AI/TLDR

What Is Tokenizer Vocabulary Size? Why Models Have ~100K Tokens

Learn what a tokenizer's 'vocabulary' is, why it's a fixed list of around 100,000 tokens, and how its size trades off speed against efficiency.

INTERMEDIATE10 MIN READUPDATED 2026-06-13

In plain English

Before a language model can read your text, it has to chop that text into pieces called tokens. A token is usually a common word, a word fragment, or even a single character. The vocabulary is the complete, fixed list of every token the model is allowed to use — and the vocabulary size is simply how many entries are on that list. For most modern models, it's somewhere around 100,000 to 250,000.

Vocabulary Size — illustration
Vocabulary Size — images.template.net

Think of it like the keys on a giant piano. The model can only play notes that have a key. The vocabulary is the set of keys; the vocabulary size is how many keys the piano has. A small piano can still play any song, but it has to hammer out long melodies one tiny note at a time. A bigger piano has dedicated keys for whole chords, so it covers the same song in far fewer presses. Either way, every piece of music must be expressed using keys that already exist — there is no key for a note the piano was never built with.

That last point is the crucial one. The vocabulary is decided once, before training begins, and then it is frozen forever. The model never invents a new token at runtime. When it meets a word it has no single token for — say a rare medical term or a typo — it doesn't fail; it just spells that word out using several smaller tokens it does have. So vocabulary size isn't about what a model can say. It's about how many tokens it takes to say it.

Why it matters

Vocabulary size sounds like an obscure internal setting, but it quietly shapes three things you care about as a builder: cost, speed, and how well the model handles languages and code outside English.

  • Cost and context. APIs bill per token, and a model's context window is measured in tokens. A bigger vocabulary packs more characters into each token, so the same paragraph becomes fewer tokens. Fewer tokens means a smaller bill and more of your actual content fitting inside the window.
  • Speed. A model generates one token at a time. If a sentence is 20 tokens instead of 30, that's a third fewer generation steps for the same output — directly faster responses, since latency scales with token count, not character count.
  • Fairness across languages. Early tokenizers were trained mostly on English. They had dedicated tokens for English words but spelled out other languages character by character. The result: the same sentence in Hindi or Thai could cost three to five times more tokens than in English. Bigger, more multilingual vocabularies shrink that gap.
  • Code and structure. Programming has its own dialect — indentation, brackets, common keywords. Modern vocabularies reserve tokens for these patterns so a code file doesn't explode into single-character tokens.

This is why you'll notice that newer model generations keep nudging their vocabularies upward. Going from a ~32K vocabulary to ~100K or beyond isn't vanity — it's a measured efficiency win that lowers per-request cost and helps non-English users on the very same model. The trade-off, as we'll see, is that a bigger vocabulary also makes parts of the model itself larger.

How it works

The vocabulary is built once, during a step called tokenizer training, which happens before the model itself is trained. An algorithm — usually Byte Pair Encoding (BPE) — scans a huge sample of text and decides which character sequences are common enough to deserve their own token. The full mechanism is covered in how tokenization works; here we focus on the size dial and what it controls.

BPE starts from the raw bytes and repeatedly merges the most frequent adjacent pair into a new token. You tell it a target vocabulary size — say 100,000 — and it keeps merging until the list reaches that count. A bigger target means more merges, which means longer, more complete chunks (whole words, common code patterns) earn their own token. A smaller target stops earlier, leaving more text to be assembled from short fragments.

Once frozen, that vocabulary connects to the model at two ends. At the input end, every token id maps to a row in the embedding table — a learned vector the model uses to represent that token. At the output end, the model's final layer produces one score for every token in the vocabulary, then picks the next token from those scores. Both of these layers are sized exactly to the vocabulary.

This is the heart of the trade-off. Make the vocabulary bigger and you get shorter token sequences (good — cheaper, faster), but the embedding table and the output layer both grow in lockstep (a cost — more parameters, more memory, a heavier final softmax over every token). The transformer middle — the attention and feed-forward layers that do the actual reasoning — doesn't change with vocabulary at all. So you're trading sequence length against the width of the two ends of the network.

The core trade-off, made concrete

Engineers picking a vocabulary size are balancing two opposing forces. Here is each side, plainly.

There's a subtler reason you can't just crank the vocabulary to a million. Rare tokens are seen far less often during training, so their embeddings get poorly tuned — a token that appears a handful of times never gets a good vector. Past a point, adding tokens stops shortening real text much (you've already covered the common stuff) while burning parameters on entries the model barely learns. That diminishing return is roughly why the industry clustered around the ~100K mark: it captures the frequent words of many languages plus code, without wasting capacity on the long tail.

It's worth separating two numbers that beginners conflate. Vocabulary size is how many distinct tokens exist (a property of the tokenizer). Tokens used is how many tokens a specific piece of text turns into (what you pay for). A bigger vocabulary lowers the second number for the same text — that's the entire payoff. If you want to see the second number for your own text, see how to count tokens.

Typical vocabulary sizes

To build intuition, here's the rough trajectory the field has followed. Treat these as orders of magnitude, not exact specs — vendors tune the precise number per model.

Era / styleApprox. vocabularyWhat it reflects
Early word-piece models~30,000English-first, lots of word-splitting
First large BPE models~50,000Better English, weak on other languages
Modern multilingual models~100,000–130,000Strong multilingual + code coverage
Newest, most multilingual~150,000–260,000Pushing fairness across many languages

One detail that surprises people: vocabulary sizes are often a round-ish number like 100,000 or a number padded up to a multiple of 64 or 128 (for example 128,000). That padding isn't cosmetic — GPUs run matrix math fastest when dimensions are nicely divisible, so engineers round the vocabulary up to a hardware-friendly size. This connects to why LLMs need GPUs.

Common misunderstandings

  • "A bigger vocabulary means the model knows more words." No. A model can express any string regardless of vocabulary size — a small vocabulary just spells unknown words out from fragments. Size affects efficiency, not coverage.
  • "Vocabulary size is the same as context window." Different things. Vocabulary is how many distinct tokens exist; the context window is how many tokens fit in one prompt. A bigger vocabulary helps you use a context window more efficiently, but they're separate numbers.
  • "You can grow the vocabulary after training." Not without consequences. The embedding and output layers are sized to the original vocabulary. Adding tokens means adding untrained rows that need fine-tuning, and you can't shrink it cleanly either.
  • "Token count equals word count." Only loosely. Because of sub-word splitting, one word can be several tokens and one token can be part of a word — this is exactly the gap explained in tokens vs words, and it's why models stumble on letter-level puzzles like the strawberry problem.

Going deeper

Once the basics click, a few finer points reward attention.

Special tokens live in the vocabulary too. Beyond ordinary text, the vocabulary reserves a handful of control tokens — markers for the start and end of a turn, system/user/assistant roles, end-of-sequence, padding, and tool-call delimiters. These are how a model's chat template and special tokens structure a conversation. They count toward the vocabulary size and occupy real embedding rows.

Weight tying. Many models share the same matrix between the input embedding table and the output layer (called tied embeddings or weight tying). This halves the parameter cost of a large vocabulary, since one big table does double duty. It's a major reason a ~100K+ vocabulary is affordable: you pay for the table once, not twice.

The softmax tax. At every generation step, the model computes a score for every token in the vocabulary and normalizes them — a vocabulary-wide softmax. Double the vocabulary and that final step roughly doubles in width. For most models this is a small slice of total compute, but it's the reason vocabulary size isn't free even when sequences get shorter.

Why not one token per byte? You could use a tiny vocabulary of just 256 byte values — truly universal, never an unknown token. Some research explores exactly this byte-level direction. The catch is that text then becomes very long in tokens, so the model must reason over far longer sequences, which is expensive in a different way. Vocabulary size is the dial that balances sequence length against layer width, and ~100K happens to be a sweet spot for today's hardware and training data. This sits alongside broader scaling-law trade-offs.

Where to go next. If you want the algorithm that builds the vocabulary, read how tokenization works. For the bigger picture of how these pieces sit inside a model, see how LLMs work. The durable takeaway: vocabulary size is a deliberate efficiency choice — it decides how compactly text is encoded, and it pays for that compactness with wider layers at the two ends of the network.

FAQ

What is tokenizer vocabulary size?

It's the total number of distinct tokens in a model's fixed token list. The tokenizer can only encode text using tokens from this list, and the size — often around 100,000 for modern models — is chosen once before training and then frozen for the life of the model.

Why do many models have around 100,000 tokens?

Roughly 100K hits a sweet spot. It's large enough to give dedicated tokens to the common words of many languages plus code (so text encodes compactly), but not so large that you waste parameters on rare tokens the model barely sees during training. Past that point, adding tokens shortens real text very little while bloating the embedding and output layers.

Does a bigger vocabulary make a model smarter?

Not directly. A model can express any text regardless of vocabulary size — a smaller vocabulary just spells uncommon words out from fragments. A bigger vocabulary mainly makes text cheaper and faster to process and improves fairness for non-English languages, rather than adding knowledge or reasoning ability.

How does vocabulary size affect API cost?

APIs charge per token, so a bigger vocabulary that packs more characters into each token turns the same text into fewer tokens — lowering your bill and letting more content fit in the context window. When comparing models, compare cost on the same text, not just the per-token rate, since vocabularies differ.

Can you change a model's vocabulary after training?

Not cleanly. The embedding table and output layer are sized exactly to the original vocabulary. You can add new tokens, but their embeddings start untrained and need fine-tuning, and you can't shrink the vocabulary without retraining. In practice the vocabulary is treated as permanent once chosen.

Is vocabulary size the same as the context window?

No. Vocabulary size is how many distinct tokens exist in the tokenizer's list. The context window is how many tokens fit into a single prompt. They're separate numbers — though a bigger vocabulary helps you use a given context window more efficiently because the same text becomes fewer tokens.

Further reading