What is Tokenization?

Tokenization is the process of breaking down text into smaller units called tokens—which can be words, parts of words, or even individual characters—that AI language models can process. For example, the sentence "Ice cream is amazing" might be split into tokens like ["Ice", "cream", "is", "amazing"], allowing the model to analyze and generate language piece by piece. This step is essential because AI models don't understand text directly; they work with these numerical representations of tokens to learn patterns and meaning in language.

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Improving Equity and Access to Non-English Large Language Models

Prabha Kannan

Apr 22

news

The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.

Improving Equity and Access to Non-English Large Language Models

Prabha Kannan

Apr 22

The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.

Natural Language Processing

news

Enroll in a Human-Centered AI Course

This AI program covers technical fundamentals, business implications, and societal considerations.

What is Tokenization?

Navigate

Participate

Stay Up To Date

Tokenization mentioned at Stanford HAI

Improving Equity and Access to Non-English Large Language Models

Improving Equity and Access to Non-English Large Language Models

Enroll in a Human-Centered AI Course