Why Do LLMs Charge More for Non-English Text? Tokenization and Languages

Understand why a sentence in Thai or Arabic can cost several times more tokens than the same sentence in English, and what that means for global users.

INTERMEDIATE10 MIN READUPDATED 2026-06-13

In plain English

An LLM never reads letters or words directly. It reads tokens — small chunks of text, usually a few characters each, that a piece of software called a tokenizer carves your text into before the model ever sees it. You are billed per token, and the model's memory limit (its context window) is measured in tokens too. So the number of tokens your text turns into is not a detail — it is the price tag and the size limit, both at once.

Tokenizing Other Languages — illustration — Tokenizing Other Languages — pic1.zhimg.com

Here is the catch. The same meaning, written in different languages, does not turn into the same number of tokens. The English sentence "The cat is sleeping" might be 4 or 5 tokens. The exact same sentence in Hindi, Thai, or Burmese can be 15, 20, or even more. Nothing about the idea got bigger — only the token count did. And since you pay per token, the non-English speaker pays several times more for the identical thought.

Think of a tokenizer like a delivery company that sells boxes pre-cut for one customer. The boxes were sized around English's most common letter-clumps, so an English sentence drops in neatly using a few big boxes. A Thai or Tamil sentence has to be packed into many tiny boxes instead, because no big box was ever cut to fit its shapes. Same shipment, far more boxes, far bigger bill — purely because of how the boxes were designed, not because of what is inside.

Why it matters

For a builder or a global product, this is not an academic curiosity. The token gap shows up in three very practical places, and all three hit non-English users hardest.

Cost. APIs charge per token, input and output. If a language uses 3x the tokens for the same content, every request, every summary, every chat reply costs roughly 3x as much. A product that is cheap to run in English can be quietly expensive to run in Hindi or Arabic.
Context window exhaustion. The model can only hold so many tokens at once. A document that fits comfortably in a context window in English may overflow it in Japanese or Telugu, so you can paste less of the same book, fit fewer chat turns, and lose the start of a conversation sooner.
Speed and latency. Models generate text one token at a time. More tokens for the same answer means more steps, which means slower responses for those users — the wait grows with the token count, not the word count.
Quality. Languages that fragment into many tiny tokens also tend to have less training data, and the two problems compound. The model sees broken-up, unfamiliar pieces and often reasons over them less reliably than over clean English tokens.

Put together, this means a single price and a single context limit can deliver a meaningfully worse deal to most of the planet. Anyone building a multilingual app — support bots, translation tools, education products, anything serving users outside the English-speaking world — needs to know this before they set pricing or promise a feature works "the same" in every language.

How it works

To see why this happens, you have to look at how a tokenizer is built. Most modern LLMs use a method called byte-pair encoding (BPE) or a close relative. The full mechanics are covered in how tokenization works; here we only need the one idea that drives the language gap.

The tokenizer learns its vocabulary from data

A tokenizer is not hand-written. It is trained on a big pile of text before the model is trained. It starts from single characters and repeatedly merges the most frequently seen pairs into a single token. Common clumps — the, ing, tion, and — appear constantly in the training text, so they get merged into single, efficient tokens. Rare clumps never get merged and stay split into many small pieces.

The decisive fact: that training text is overwhelmingly English and other Latin-script languages. So the tokenizer spends almost all of its merge budget learning English-shaped chunks. English ends up with a rich vocabulary of big, meaning-carrying tokens. Languages that were rare in the training data — and most of the world's languages were — never earned those merges, so their text falls back to tiny pieces: often one token per character, or even multiple tokens per character.

// Why one language gets bigger tokens than another

Training textmostly EnglishLearn frequent pairsmerge common clumpsVocabularybig English tokens, few othersResultEnglish packs tight, others fragment

Why non-Latin scripts fragment the worst

There is a second, sharper reason scripts like Chinese, Thai, Arabic, Hindi, and many others fragment. Modern tokenizers operate on raw bytes (UTF-8) under the hood. An English letter is one byte. But a single character in many other scripts is encoded as two, three, or four bytes. If the tokenizer never learned a merge for that character, it splits into one token per byte — so a single Devanagari or CJK character can cost 2–4 tokens before you even add merges for whole words.

// The same idea, very different token counts

English

"hello" → roughly 1 token
Latin letters = 1 byte each
Common words pre-merged
Best-case cost and context

Morphologically rich

Long glued-together words
Many rare word forms
Few merges learned for them
2–3x more tokens typical

Non-Latin script

Multi-byte characters
Often 1 token per byte
Little to no merging
3–8x more tokens common

Two forces stack on top of each other

So the token tax comes from two overlapping causes. First, script: non-Latin characters take multiple bytes and rarely got merges, so they shatter into pieces. Second, morphology: languages like Finnish, Turkish, or Tamil build long words by gluing many parts together, and each rare combined form was too uncommon to earn its own token. A language that is both non-Latin and morphologically rich — say, Tamil or Georgian — gets hit by both forces at once, which is how you reach 5x, 8x, or worse versus English.

A worked example: the same sentence in three languages

Let's make it concrete with a tiny script. You can run this against any tokenizer to see the gap for yourself. The point is not the exact numbers — they vary by model and tokenizer — but the shape of the result: English is small, and other languages are reliably larger for identical meaning.

compare_languages.pypython

# Compare token counts for the same meaning across languages.
# Uses tiktoken (OpenAI's BPE tokenizer) as a stand-in; the
# pattern is identical for any tokenizer you can call.
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

sentences = {
    "English":  "The weather is nice today.",
    "Spanish":  "El clima está agradable hoy.",
    "Hindi":    "आज मौसम अच्छा है।",
    "Japanese": "今日はいい天気です。",
}

baseline = len(enc.encode(sentences["English"]))
for lang, text in sentences.items():
    n = len(enc.encode(text))
    print(f"{lang:9} {n:3d} tokens  ({n / baseline:.1f}x English)")

Run that and you will see English come out smallest, with the other languages costing noticeably more tokens for a sentence that means exactly the same thing. The multiplier depends on the language and the tokenizer, but English is almost always the cheapest, and non-Latin scripts are almost always the most expensive.

What you can actually do about it

You can't rewrite a model's tokenizer, but you are not helpless. A few practical moves blunt the token tax in a real product.

Budget per language, not globally. Measure real token counts on sample text in each language you support, then set cost limits and context budgets from the worst case, not from English.
Trim the prompt aggressively. Long system prompts and few-shot examples are paid for on every call, and they are even more expensive when the surrounding language is token-heavy. Keep instructions tight.
Pick a tokenizer-friendly model when you can. Different model families use different tokenizers, and some are markedly more efficient on a given language than others. If a language matters to you, test a few models and compare token counts on the same text before committing.
Consider an English pivot for internal steps. For non-user-facing reasoning (planning, classification, tool calls), running the model's internal work in English and only translating the final reply can cut tokens — though it adds a translation step and can lose nuance, so test quality carefully.
Cap output length explicitly. Since generation is per-token, a verbose answer in a token-heavy language is doubly costly. Ask for concise answers and set a sensible max.

None of these fully closes the gap. They manage it. The honest framing for stakeholders is: serving some languages simply costs more per unit of meaning, and that cost should be planned for, not discovered on the bill.

Going deeper

Once the basics click, a few deeper threads are worth knowing — both for fairness conversations and for engineering decisions.

It is a documented fairness issue, not folklore. Researchers have measured the disparity across dozens of languages and found that some can require many times more tokens than English for equivalent text. Because API pricing and context limits are uniform, this translates directly into non-English speakers paying more and getting smaller effective context — a structural inequality baked into the tooling, not the model's intent.

Newer tokenizers are getting better, slowly. Each generation of frontier models tends to ship a larger, more multilingual vocabulary than the last, which shrinks (but does not erase) the gap. A bigger vocabulary can afford more merges for non-English scripts. This is why the same sentence can tokenize differently across model versions, and why it is worth re-checking token counts when you upgrade models rather than assuming last year's numbers hold.

The fragmentation also hurts model behavior, not just cost. When a word shatters into many tiny pieces, the model has a harder time treating it as a single unit of meaning. This is closely related to why models stumble on character-level tasks — the famous strawberry problem of miscounting letters — and the effect is often worse in heavily-fragmented languages. So tokenization bias can quietly degrade quality, not only economics.

Special and control tokens add a fixed overhead too. Every chat request is wrapped in template and role markers, covered in chat templates and special tokens. That overhead is the same in every language, so on short messages it is a larger share of a cheap English request — a smaller effect than the script gap, but worth knowing when you profile real traffic.

Where to go next: solidify the foundation with what is a token and how tokenization works, then zoom out to how LLMs work to see where tokenization sits in the whole pipeline. The durable takeaway: the model is multilingual, but the tokenizer — the thing that decides your bill — was trained on a lopsided slice of the world's text, and that bias flows straight through to cost, speed, context, and quality.

FAQ

Why do non-English languages use more tokens?

Tokenizers are trained mostly on English-heavy text, so they learn big, efficient tokens for English clumps but few for other languages. Non-Latin scripts also use multi-byte characters that often split into one token per byte. The result is that the same meaning fragments into far more tokens in many other languages.

Does that mean non-English text actually costs more money in an LLM?

Yes. APIs bill per token for both input and output, so if a language uses two or three times the tokens for the same content, it costs roughly two or three times as much. The price gap is real and falls on non-English users for identical meaning.

Which languages are the most expensive in tokens?

Generally, languages that are both non-Latin and morphologically rich are hardest hit — for example Hindi, Thai, Burmese, Tamil, Telugu, and many others can run several times more tokens than English. Latin-script European languages like Spanish or French are usually only mildly more expensive.

Can I reduce token costs for non-English text?

You can manage it: keep prompts tight, cap output length, compare models since their tokenizers differ, and for internal (non-user-facing) steps consider reasoning in English. These reduce the gap but do not fully close it, because the underlying tokenizer bias remains.

Does using more tokens also make the model worse, not just pricier?

Often, yes. When words shatter into many tiny pieces, the model has a harder time treating them as single units of meaning, and heavily-fragmented languages also tend to have less training data. So cost, context limits, and answer quality can all suffer together.

Is this token bias being fixed?

Slowly. Newer models tend to ship larger, more multilingual tokenizer vocabularies, which shrinks the gap but does not erase it. Because of this, token counts for the same sentence can change between model versions, so it is worth re-measuring when you upgrade.

// In plain English

// Why it matters

// How it works

The tokenizer learns its vocabulary from data

Why non-Latin scripts fragment the worst

Two forces stack on top of each other

// A worked example: the same sentence in three languages

// What you can actually do about it

// Going deeper

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

A worked example: the same sentence in three languages

What you can actually do about it

Going deeper

FAQ

Further reading

Related