Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
What is Tokenization? | Stanford HAI
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs

What is Tokenization?

Tokenization is the process of breaking down text into smaller units called tokens—which can be words, parts of words, or even individual characters—that AI language models can process. For example, the sentence "Ice cream is amazing" might be split into tokens like ["Ice", "cream", "is", "amazing"], allowing the model to analyze and generate language piece by piece. This step is essential because AI models don't understand text directly; they work with these numerical representations of tokens to learn patterns and meaning in language.

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News


Tokenization mentioned at Stanford HAI

Explore Similar Terms:

Large Language Model (LLM) | Embeddings | Natural Language Processing (NLP)

See Full List of Terms & Definitions

Improving Equity and Access to Non-English Large Language Models
Prabha Kannan
Apr 22
news

The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.

Improving Equity and Access to Non-English Large Language Models

Prabha Kannan
Apr 22

The lessons learned from the fine-tuning and evaluation of Vietnamese LLMs could help broaden access to models beyond English speakers.

Natural Language Processing
news

Enroll in a Human-Centered AI Course

This AI program covers technical fundamentals, business implications, and societal considerations.