Concept·AI Models & Capabilities·Added 1 month ago

Tokenization

Also known as: text tokenization, subword tokenization, BPE tokenization, byte-pair encoding

The process of breaking text into smaller units called tokens before passing it to a language model. A token is roughly a word fragment, a whole short word, or a punctuation mark, and the number of tokens in your input determines how much context the model uses and what you pay.

Before a language model can process text, that text has to be converted into a sequence of numbers. Tokenization is the step that does this. The text gets split into small units called tokens, each mapped to an ID in the model's vocabulary. Those IDs are what the model actually sees. Most modern models use subword tokenization: common words stay intact as a single token, while rare or long words get split into two or more fragments. The word "tokenization" itself, for example, might become two tokens.

The practical reason builders care about tokenization: tokens are the unit of measurement for context windows, latency, and pricing. A model with a 128K-token context window can hold roughly 90–100K words of English before it runs out of room. API costs are typically quoted per million input and output tokens. A prompt that feels short to you might contain far more tokens than expected if it includes code, URLs, or non-English text, which tend to tokenize less efficiently than plain English prose.

Different model families use different tokenizers, so the same text can produce a different token count depending on the model. OpenAI's tiktoken library and tools like platform.openai.com/tokenizer let you count tokens before sending requests. When you're optimizing prompts for cost or fitting a long document into a context window, knowing your token counts is the first practical step.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms