Tokenization is the process of breaking down text into smaller units called tokens—which can be words, parts of words, or even individual characters—that AI language models can process. For example, the sentence "Ice cream is amazing" might be split into tokens like ["Ice", "cream", "is", "amazing"], allowing the model to analyze and generate language piece by piece. This step is essential because AI models don't understand text directly; they work with these numerical representations of tokens to learn patterns and meaning in language.
Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News
Explore Similar Terms:
Large Language Model (LLM) | Embeddings | Natural Language Processing (NLP)

The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.
The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.
