In plain English
Here is a question that trips people up: what happens when you type a word an LLM has never seen before — a brand-new product name, a typo, a made-up word, or text in a language it barely knows? You might expect it to choke, or to spit out an "unknown" placeholder. It almost never does. The short answer is that a modern LLM has no concept of a whole word in the first place, so it can't be stumped by one.

Instead of a dictionary of words, the model has a fixed list of a few tens of thousands of subword pieces — common chunks like the, ing, pro, tion, plus every single character and even raw bytes. Before the model reads anything, a tokenizer chops your text into the longest known pieces it can find. A word it has seen a million times becomes one token; a word it has never seen becomes a string of smaller pieces stitched together.
Think of it like spelling out an unfamiliar name over the phone. You don't have a single sound for "Szczepański" — so you fall back to syllables, then to individual letters: "S, Z, C, Z, E...". You can pronounce any name that way, even one you've never met, because letters are a universal fallback. Subword tokenization does the same thing for an LLM. The worst case is falling all the way down to single characters or bytes — and since every byte is in the vocabulary, there is always a way to represent the text.
Why it matters
To see why this is a big deal, you have to picture how language models worked before subwords. Early systems kept a fixed vocabulary of whole words — say the 50,000 most common ones. Anything outside that list was replaced with a single catch-all token, usually written <UNK> (for "unknown").
That created three nasty failures the moment real-world text arrived:
- Information just vanished. A rare medical term, a person's surname, or a new company name all collapsed into the same
<UNK>token. The model literally could not tell them apart — every unknown word looked identical. - New words were hopeless. Language invents words constantly: rizz, vibecoding, Ozempic. A fixed word list froze on its training day and could never represent anything coined afterward.
- Typos and morphology exploded the vocabulary. run, runs, running, runner, runnin — to a word-level model these are five unrelated entries, and a single dropped letter (runnig) produced yet another unknown.
Subword tokenization makes all three problems shrink. Because any string can be broken into known pieces — down to single bytes if needed — the model can represent anything you throw at it. A new drug name might split into Oz + emp + ic; the pieces carry partial meaning and spelling, so the model has something real to work with instead of a blank <UNK>.
How it works
The tokenizer is built once, before training, by an algorithm that scans a huge pile of text and learns which character sequences are worth keeping as single tokens. The two common algorithms are Byte-Pair Encoding (BPE) and WordPiece, and both work on the same greedy principle: merge frequent neighbors into bigger pieces, keep the most useful ones, and always keep the smallest units (characters or bytes) as a guaranteed fallback.
At runtime, encoding your text is a matching game. The tokenizer walks through the string and repeatedly grabs the longest piece in its vocabulary that matches at the current position, then continues from there. A common word matches as one long piece. A rare word finds no long match, so it gets covered by several short pieces. The rarer the text, the smaller the pieces — and the more tokens it takes.
The "bytes if needed" step is the safety net that kills the OOV problem for good. Modern tokenizers are byte-level: their vocabulary includes all 256 possible byte values. Any text in any language, any emoji, any symbol, any corrupted character — all of it is, at the lowest level, a sequence of bytes, and every byte is a known token. So there is no string on Earth that the tokenizer can fail to encode. The fallback might be ugly (one character spread across several byte-tokens), but it never breaks.
A concrete encode/decode
Here's roughly what happens to a familiar word versus an invented one, using the kind of subword splits a GPT-style tokenizer produces. The pipe | marks a token boundary:
"tokenization" -> token | ization (2 tokens)
"Glorptastic" -> G | lor | pt | astic (4 tokens)
"runnig" (typo) -> run | n | ig (3 tokens)
"🦄" (emoji) -> \xf0 | \x9f | \xa6 | \x84 (4 byte-tokens)Notice that nothing was ever rejected. The model receives a list of integer token IDs and never sees the raw text at all. When it generates a reply, it emits token IDs and the tokenizer reverses the process — gluing the pieces (and bytes) back into readable text. This is also why an LLM can write a word it has never seen in training: it just produces the right sequence of subword pieces.
Fragmentation: the real cost of rare words
So unknown words don't break anything — but they aren't free either. The price you pay is fragmentation: a single rare word can swell into many tokens, and that has two real consequences.
1. You pay more and use up your context
LLM pricing and context windows are measured in tokens, not words. If a chunk of text fragments into 3x as many tokens as ordinary English, it costs roughly 3x as much to process and eats 3x the space in the prompt. This hits some inputs much harder than others:
| Kind of text | Why it fragments | Token impact |
|---|---|---|
| Common English | Whole words are single tokens | ~0.75 words per token (efficient) |
| Code / identifiers | getUserById splits into pieces | Moderate inflation |
| Rare names, jargon, typos | No whole-word match | Several tokens per word |
| Non-English / non-Latin scripts | Vocabulary is English-heavy | Often 2–4x more tokens |
| Emoji, symbols, random strings | Falls back to raw bytes | Up to several tokens each |
2. The model can lose the thread inside a word
When a word is split into pieces, the model sees the pieces, not the whole. Usually that's fine — it has learned that Oz + emp + ic tend to appear together. But fragmentation is also why LLMs are oddly bad at character-level tasks: counting the letters in a word, spelling it backwards, or rhyming. The model never received the individual letters of a common word; it got one chunky token. That's the root of the famous "how many r's in strawberry" problem.
Old word-level vs. modern subword models
It's worth seeing the two worlds side by side, because the shift from one to the other is exactly why "unknown word" stopped being a phrase you hear in modern LLM work.
- Fixed list of whole words
- Unknown word → single `<UNK>` token
- All unknowns look identical
- Can't represent new coinages
- Spelling and morphology lost
- Vocabulary of subword pieces + bytes
- Unknown word → many small pieces
- Pieces preserve spelling and parts
- Any string is representable
- Cost is the only penalty
The key conceptual flip: an old model asked "is this exact word in my list?" and gave up if not. A subword model asks "what's the longest known piece I can match right now?" and keeps going until the whole string is covered. The first question has a hard failure mode; the second one cannot fail, because at worst it falls back to bytes. That single design choice is what retired the out-of-vocabulary problem.
Going deeper
Once the basics click, a few nuances are worth knowing — especially if you're debugging odd model behavior or optimizing token cost.
"Unknown to the tokenizer" vs. "unknown to the model." These are two different things, and conflating them causes confusion. The tokenizer can encode anything — it never hits a true unknown. But the model may still have weak knowledge of a rare word, because it saw those particular subword pieces in that order only a handful of times during training. So a model can faithfully read and echo a brand-new term while having almost no real understanding of what it means. Representable is not the same as understood.
Glitch tokens. A strange edge case: some tokens exist in the vocabulary but appeared so rarely (or never) in training that the model has essentially no learned behavior for them. Feeding such a token can produce bizarre, broken output — these are nicknamed "glitch tokens." They're a reminder that a token being in the vocabulary doesn't guarantee the model knows what to do with it.
Special tokens are a different category. Beyond text pieces, tokenizers reserve a handful of control tokens — markers for the start of a message, the end of a turn, a system role, and so on. These never come from your text; the chat framework inserts them. They're how the model tells your instructions apart from the conversation structure — see chat templates and special tokens.
Whitespace and casing live inside tokens. In byte-level tokenizers, a leading space is part of the token, so " hello" and "hello" can tokenize differently, and Hello may differ from hello. This is why two strings that look almost identical to you can have surprisingly different token counts — and why pasting text with odd spacing can quietly inflate your bill.
Where to go next. If you want the full mechanics of how the vocabulary is built and matched, read how tokenization works. To understand the unit itself, start with what is a token and tokens vs. words. The durable takeaway: modern LLMs don't have an unknown-word problem — they have a fragmentation problem, and managing token count is the skill that replaces worrying about OOV.
FAQ
Do LLMs have an out-of-vocabulary (OOV) problem?
Not in the old sense. Modern LLMs use subword tokenization, so any string — including words they've never seen — can be broken into smaller known pieces, falling back to single bytes if necessary. Because every byte is in the vocabulary, there is no text the tokenizer can fail to represent.
How are rare words and brand-new terms tokenized?
The tokenizer matches the longest piece in its vocabulary at each position. A rare word finds no whole-word match, so it's covered by several smaller subword pieces (for example Oz + emp + ic). The pieces preserve spelling and partial meaning, so the model still has something real to work with.
What happens when you type a typo into an LLM?
Nothing breaks. A misspelled word simply fails to match its usual single token and gets split into smaller pieces instead — runnig might become run + n + ig. The model still reads it as valid token IDs, which is why LLMs are generally good at understanding text despite typos.
Why do rare words and other languages cost more tokens?
Tokenizer vocabularies are trained mostly on English-heavy data, so common English words become single tokens while rare words, code, and non-Latin scripts fragment into many pieces. Since you pay per token, the same sentence can cost 2–4x more in some languages than in English.
What is the difference between a word being unknown to the tokenizer and unknown to the model?
The tokenizer can always encode a word into pieces — it never truly fails. But the model may have seen those pieces in that order very rarely during training, so it can read and repeat a new term while barely understanding its meaning. Representable is not the same as understood.
Can an LLM write a word it has never seen before?
Yes. Because it generates output as a sequence of subword pieces rather than whole words, it can assemble brand-new or rare words from familiar fragments. This is the same mechanism that lets it read unfamiliar words on the way in.