AI/TLDR

What Is a Token in an LLM? Tokenization Explained for Beginners

Learn the unit every model reads, every API prices, and every context limit is counted in.

BEGINNER9 MIN READUPDATED 2026-06-11

In plain English

Before an AI model like ChatGPT or Claude reads anything you type, your text gets chopped into tokens — small chunks of characters that the model treats as indivisible units. A token is usually a whole short word (the), a piece of a longer word (ization), a word with its leading space attached ( magic), or a punctuation mark. The model never sees letters, and it never quite sees words. It only ever sees tokens.

Think of Scrabble tiles, except each tile holds a frequently used chunk of text instead of one letter. The model ships with a fixed bag of tiles — its vocabulary — typically somewhere between 50,000 and 250,000 distinct chunks, chosen once before training ever starts. Every prompt you send and every answer it writes is assembled from exactly those tiles, nothing else. If your word doesn't have its own tile, the tokenizer snaps it together out of smaller ones.

Here's a real example, produced with one of OpenAI's published tokenizers. The sentence Tokenization isn't magic. becomes five tokens: Token, ization, isn't, magic, and .. Common words usually get one token each. Rare words shatter: supercalifragilisticexpialidocious splits into ten pieces. And notice the spaces — they don't disappear, they get glued onto the front of the next word.

Why it matters

Tokens sound like an implementation detail. They're not — they're the unit of account for everything you do with an LLM:

  • Money. Every LLM API charges per token — both the tokens you send in (input) and the tokens the model writes back (output). Nobody bills per word, per request, or per character. Cut your prompt's token count in half and you cut that part of the bill in half.
  • Memory. A model's context window — how much text it can consider at once — is a hard limit measured in tokens, not pages. "128K context" means 128,000 tokens, roughly the length of a full novel.
  • Speed. Models generate output one token at a time, like very fast typing. An answer that's twice as many tokens takes roughly twice as long to finish streaming.
  • Failure modes. Many famously dumb LLM moments — miscounting the letters in strawberry, fumbling arithmetic on long numbers — trace straight back to how the text was tokenized.

So who should care? Anyone who pays an AI API bill, anyone deciding whether a document fits in a prompt, and anyone debugging why a model acts strangely around spelling, digits, or non-English text. Tokens also solved a real problem: older NLP systems worked word-by-word with a clumsy catch-all "unknown word" bucket for anything outside their dictionary. Subword tokens replaced that — any text, in any language, can always be broken into pieces the model knows.

How it works

A tokenizer has two jobs: split text into chunks, and map each chunk to an integer ID. The mapping lives in the vocabulary — a big lookup table that's frozen before training begins. The chunk magic might be entry 19,745; the model's entire world is sequences of integers like that.

Tokens are IDs, not meanings

Inside the model, token 19,745 is not "the letters m-a-g-i-c with a space in front". It's just a row number in a giant table of learned vectors. The model learns what that token means from seeing it in billions of contexts — but it literally cannot look at the characters inside it. That's why a model that writes brilliant essays can fail to count the r's in "strawberry": the word arrives as st + raw + berry, three opaque IDs, not ten letters.

Generation runs the same pipeline in reverse, one step at a time. The model outputs a score for every token in its vocabulary, one token gets picked, it's appended to the sequence, and the loop runs again. That's next-token prediction — the heartbeat of every modern LLM — and tokens are the beats.

Where does the vocabulary come from? Almost every modern model uses byte-pair encoding (BPE) or a close cousin: start from single bytes, scan a mountain of training text, and repeatedly merge the most frequent adjacent pair into a new token until you hit the target vocabulary size. Frequent strings ( the, ing, http) earn their own tokens; rare strings stay as multiple pieces. The full algorithm gets its own article: How does tokenization work?

See it yourself in Python

You don't have to take any of this on faith. OpenAI publishes its tokenizer as an open-source library called tiktoken, and you can inspect real tokens locally in a few lines:

tokens.pypython
# pip install tiktoken
import tiktoken

# o200k_base is one of OpenAI's published encodings
enc = tiktoken.get_encoding("o200k_base")

text = "Tokenization isn't magic."

ids = enc.encode(text)
print(ids)        # [4421, 2860, 12471, 19745, 13]
print(len(ids))   # 5  <- the number your API bill is based on

# See exactly which chunk each ID stands for
print([enc.decode([i]) for i in ids])
# ['Token', 'ization', " isn't", ' magic', '.']

Run it and you'll see the five IDs and the five chunks from earlier — including the spaces glued onto isn't and magic. Then try your own strings: paste in a long email, some Python code, a sentence in another language, a wall of digits. Watching where the splits land builds better intuition than any article can.

Five surprises that bite beginners

All of these were measured with the same tokenizer as above. Other models split differently in the details, but the patterns hold everywhere:

InputTokenizes asThe surprise
cat vs cattwo different single tokensA leading space changes the token entirely — cat and cat are unrelated entries in the vocabulary.
12345123 + 45Numbers split into arbitrary chunks, one reason LLMs are shaky at arithmetic on long digit strings.
strawberryst + raw + berryEven everyday words can split — and the model can't see the letters inside the pieces.
ChatGPT is greatChat + GPT + is + greatBrand names, code identifiers, and jargon often cost more tokens than ordinary words.
Same greeting, four languagesEnglish 7 · German 9 · Japanese 10 · Hindi 10Non-English text usually costs more tokens for the same meaning — the "token tax". With older tokenizers the gap was far larger.

The practical upshot: token counts are not intuitive. When money or a context limit is at stake, never eyeball them — measure.

Going deeper

Vocabulary size is a genuine engineering trade-off. GPT-2 shipped with 50,257 tokens; current frontier models run vocabularies several times larger. A bigger vocabulary compresses text into fewer tokens, so the same document costs less to process and more of it fits in the context window. The price: the embedding table and the output layer both scale with vocabulary size, and the rarest tokens appear so seldom in training that their vector representations stay half-baked. That under-training produced the infamous glitch tokens — strings like SolidGoldMagikarp in GPT-2/GPT-3-era vocabularies that triggered bizarre behavior because the token existed in the vocabulary but was almost never seen during training.

Modern tokenizers are byte-level. The base alphabet is the 256 possible byte values, so any UTF-8 string — emoji, rare scripts, malformed input — can always be encoded. The old "unknown word" failure mode is gone entirely. The flip side is graceful degradation rather than failure: text the vocabulary wasn't optimized for dissolves into many near-byte tokens, burning through your context budget fast.

Vocabularies reserve special tokens that never come from your text: end-of-sequence markers like <|endoftext|>, plus the structural markers chat models use to tell system, user, and assistant turns apart. APIs strip these out of user input, because smuggling a fake "end of user turn" marker into a prompt is a real injection vector. Chat templates and special tokens get their own article in this series.

A tokenizer is welded to its model. Every weight in the network was learned against one specific vocabulary, so you can't swap tokenizers after training, and extending the vocabulary — say, to add domain jargon or better support a language — means adding new embedding rows and training them from scratch. This lock-in is also why token counts across model families are apples-to-oranges: the "same" prompt is literally a different integer sequence for each model.

The open question is whether tokens should exist at all. Tokenization is the last hand-engineered stage in an otherwise end-to-end learned pipeline, and it causes real damage: spelling blindness, the multilingual token tax, brittle number handling. Byte-level models like Google's ByT5 skip tokenization and consume raw bytes, at the cost of much longer sequences. More recently, Meta's Byte Latent Transformer groups bytes into learned "patches" whose size adapts to how predictable the text is, showing that byte-level models can compete with token-based ones at scale. If that research line wins, this article becomes a history lesson — but every major production model today still runs on tokens.

FAQ

How many words is 1,000 tokens?

Roughly 750 words of ordinary English — about a page and a half of single-spaced text. The ratio shifts with content: code, URLs, digits, and non-English languages pack fewer words per token, so measure with a real tokenizer when the number actually matters.

Is a token the same as a word?

No. Short common words are usually one token each, but longer or rarer words split into several pieces, and a token often includes the leading space. In English a token averages about three-quarters of a word.

Do spaces and punctuation count as tokens?

Yes — nothing is free. Spaces are usually glued onto the front of the following word ( magic is a single token), and punctuation marks typically get tokens of their own. Whitespace-heavy text like deeply indented code carries a real token cost.

Why do AI APIs charge per token instead of per word?

Because tokens are what the model actually computes over. Every input token is processed by every layer of the network, and every output token costs a full forward pass to generate. Compute scales with token count, so billing does too.

Why does the same text give different token counts on different models?

Each model family trains its own tokenizer with its own vocabulary, so the same string splits into different chunks. A count from OpenAI's tiktoken won't match a Llama tokenizer's count. Always count with the tokenizer that belongs to the model you're calling.

Are LLM tokens related to crypto tokens?

Not at all — it's a pure name collision. An LLM token is a chunk of text the model reads and writes; a crypto token is a blockchain asset. (Confusingly, "API tokens" — the secret keys you authenticate with — are a third, also unrelated, thing.)

Further reading