Why LLMs Can't Count Letters: The Strawberry Test

In plain English

Ask a chatbot "how many R's are in strawberry?" and for years the classic answer was a confident, wrong two. The real answer is three (strawberry). The same model can write you a working web app or explain quantum tunneling, yet it trips over a question a six-year-old gets right. That gap is so famous it has a name: the strawberry problem.

Here is the secret: the model never sees the word strawberry as the letters s-t-r-a-w-b-e-r-r-y. Before the text reaches the model, a piece of software called the tokenizer chops it into a few chunks — something like straw + berry, or st + raw + berry — and hands the model ID numbers for those chunks, not individual letters. Asking the model to count R's is like asking you to count the letters in a word you've only ever heard out loud and never seen written down.

Why it matters

The strawberry test looks like a party trick, but it's a window into a whole family of failures — and understanding the cause stops you from being surprised (or worse, trusting a wrong answer) in real work.

Counting letters or substrings — "how many R's / vowels / double letters" questions.
Spelling and reversing — spell a word backwards, or insert a hyphen between every letter.
Rhyme, anagrams, acrostics — anything that hinges on the characters inside a word.
Naive arithmetic — long numbers get split into odd token chunks, so column-by-column math goes sideways.
Wordplay and ciphers — Pig Latin, Caesar shifts, "words that start and end with the same letter."

If you've ever seen a model botch one of these while nailing far harder tasks, you weren't watching it get "dumber" — you were watching the tokenizer hide the raw characters. The same mechanism quietly affects token counts and therefore your bill and your context budget, which is why it's worth understanding alongside Tokens vs Words vs Characters. It also explains a chunk of "why is the model wrong here?" confusion that gets wrongly blamed on hallucination.

How it works

Every LLM starts with the same step: text comes in, the tokenizer chops it into chunks ("tokens"), and each chunk becomes an integer ID. The model only ever sees those integers. It learns deep statistical relationships between token IDs — but the letters that live inside a token are invisible to it unless it memorized them from training text.

// What the model actually receives

"strawberry"the text you typedTokenizerBPE merge rules["st","raw","berry"]subword chunks[267, 1618, 19772]integer IDs — what the model sees

Most modern tokenizers use Byte-Pair Encoding (BPE): start from raw bytes and repeatedly merge the most frequent adjacent pair into a new token. Common words become single tokens; rarer words get split into a few pieces. The exact split depends on the vocabulary, so the same word splits differently across models. That's covered in depth in How Tokenization Works.

To count the R's in straw + berry, the model would have to know that the token straw contains one R and berry contains two, then add them. Nothing in next-token prediction forces it to store that fact reliably. It may have picked up some spelling from training data — which is why models are often close — but it's recalling a fuzzy association, not reading characters off the page. (How next-token prediction works is explained in How Do LLMs Actually Work?.)

Numbers make this even clearer. A long number like 1234567 doesn't split into clean digits — it splits into whatever multi-digit chunks the BPE vocabulary happens to contain. The model has no stable notion of "the digit in the hundreds place," so carrying across columns is genuinely hard for it. Same root cause, different symptom.

See it yourself in 10 lines

You don't have to take this on faith. OpenAI's tiktoken library lets you run the exact tokenizers their models use and watch strawberry fall apart. Two encodings matter (both verified current as of mid-2026): cl100k_base (used by GPT-4 and GPT-3.5-turbo) and o200k_base (used by GPT-4o and the o-series reasoning models).

strawberry_tokens.pypython

import tiktoken

for name in ("cl100k_base", "o200k_base"):
    enc = tiktoken.get_encoding(name)
    ids = enc.encode("strawberry")
    chunks = [enc.decode([i]) for i in ids]
    print(f"{name}: {len(ids)} tokens -> {chunks}")

# Example output (your chunks may vary by version):
# cl100k_base: 3 tokens -> ['str', 'aw', 'berry']
# o200k_base:  3 tokens -> ['st', 'raw', 'berry']
#
# The model sees 3 chunks. It NEVER sees s-t-r-a-w-b-e-r-r-y.
# Counting R's means knowing each chunk's R count -- which it doesn't store.

Now the fix. The reliable way to make a model count letters is to not make it count letters in its head — give it a tool, exactly like a human reaching for a calculator. This is the same idea behind function calling and tool use in agents.

count_with_a_tool.pypython

# Instead of asking the model 'how many R in strawberry?',
# let it call a tiny function. Python sees real characters; the model doesn't.

def count_letter(word: str, letter: str) -> int:
    return word.lower().count(letter.lower())

print(count_letter("strawberry", "r"))  # -> 3, every time

# In production you'd expose this as a tool the model can invoke.
# The model decides WHEN to call it; Python does the character work.

Do reasoning models fix it?

Mostly, yes — and the story is a great illustration of why reasoning helps. OpenAI's first reasoning model line was literally codenamed "Strawberry" (it shipped as o1, followed by o3), a nod to this exact failure. These models are trained to think step by step before answering, and a model that spells the word out one letter at a time gives itself fresh tokens to count.

// Same question, two strategies

Answer immediately

Sees [st, raw, berry]
Recalls a fuzzy 'R count'
Often guesses 2
Wrong

Reason step by step

Writes out: s-t-r-a-w-b-e-r-r-y
Each letter is now its own token
Counts the R tokens: 3
Right

You can trigger the same behavior on any model with a prompt: "Spell the word out letter by letter, then count." That's chain-of-thought prompting, and it works because forcing the letters into the output re-tokenizes them into countable pieces. As of mid-2026, the frontier reasoning models — Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Grok 4.3 — get the classic strawberry question right almost every time, because step-by-step reasoning is baked in.

Going deeper

If tokenization is the root cause, the radical fix is to get rid of the tokenizer. That's an active research direction, and it's worth knowing about because it reframes the whole problem.

Byte-level and tokenizer-free models

The Byte Latent Transformer (BLT) — from researchers at Meta, the University of Washington, and the University of Chicago — drops the fixed tokenizer entirely. Instead of learned subword chunks, it reads raw bytes and dynamically groups them into patches based on how predictable the next byte is: predictable stretches get long patches, surprising spots get short ones. Because the model sees bytes, character-level information is right there.

Approach	Smallest unit	Sees letters directly?	Strawberry-class tasks
BPE tokenizer (most LLMs)	Subword chunk	No	Weak without reasoning/tools
Byte / patch (BLT)	Byte, grouped by entropy	Yes	Strong by design
Reasoning model + BPE	Subword chunk	No (but spells it out)	Strong via step-by-step

On the CUTE benchmark, which probes character-level understanding, the byte-based model reached 99.9% spelling accuracy versus a token-based Llama 3 model that landed near the floor — a gap of dozens of points. The same byte view also makes the model more robust to typos (pizya and pizza share almost all their bytes, where a tokenizer might map them to wildly different tokens).

// Byte Latent Transformer (simplified)

Raw bytes in: s t r a w b e r r yLocal encoder -> dynamic patches (by entropy)Large global transformer (patch level)Local decoder -> bytes out

Why we still mostly use tokenizers

If bytes are so clean, why does nearly every production model still tokenize? Efficiency. Tokens pack several characters into one unit, so a tokenized sequence is far shorter than its byte sequence — fewer positions to process means less compute and a roomier context window. Byte and patch models claw that efficiency back with clever dynamic grouping, but the tokenizer's compression is the reason it has stuck around. There's a real trade-off between seeing every character and running cheaply at scale — which ties into the broader story of scaling laws.

FAQ

Why can't ChatGPT count the R's in strawberry?

Because it never sees the individual letters. A tokenizer splits strawberry into a few subword chunks (like st + raw + berry) and feeds the model ID numbers for those chunks, not the characters inside them. Counting R's would require the model to know each chunk's R count and add them up — something next-token prediction doesn't reliably store. Newer reasoning models work around it by spelling the word out first.

Why are LLMs bad at spelling and reversing words?

Same root cause as the strawberry problem: the model operates on subword tokens, not letters. Spelling backwards, inserting hyphens between letters, finding anagrams, and counting vowels all require character-level access that the tokenizer hides. Asking the model to spell the word out letter by letter first usually fixes it, because that forces the letters into the output where they become separate tokens.

Do reasoning models like o1 and o3 solve the strawberry problem?

Largely, yes. OpenAI's reasoning line was even codenamed 'Strawberry' for this reason. By thinking step by step — effectively spelling the word out before answering — these models give themselves countable letter-tokens. As of mid-2026 the frontier models get the classic question right almost every time, but it's a workaround, not a cure: the tokenizer is unchanged, so unusual or rarer words can still trip them up.

What's the most reliable way to make an LLM count letters?

Give it a tool. Let the model call a tiny function (one line of Python that does word.count(letter)) instead of counting in its head. Python sees real characters; the model doesn't. This is the same tool-use pattern that makes models reliable at arithmetic — offload the character/number work to code and let the model decide when to call it.

Why do LLMs also get arithmetic with long numbers wrong?

It's the strawberry problem wearing a math hat. Long numbers don't split into clean digits — the tokenizer breaks them into whatever multi-digit chunks its vocabulary contains, so the model has no stable sense of place value. That makes carrying across columns hard. The fix is the same: reason digit by digit, or hand the calculation to a tool.

Will tokenizer-free models fix this for good?

They might. Byte-level approaches like the Byte Latent Transformer read raw bytes and group them dynamically, so characters are visible by design — one such model hit 99.9% spelling accuracy on the CUTE benchmark where a token-based Llama 3 model scored near zero. But tokenizers stick around because they compress text and cut compute, so for now most production models still tokenize and rely on reasoning or tools to route around the limitation.

Why Can't LLMs Count the R's in "Strawberry"? Tokenizer Quirks

In plain English

Why it matters

How it works

See it yourself in 10 lines

Do reasoning models fix it?

Going deeper

Byte-level and tokenizer-free models

Why we still mostly use tokenizers

FAQ

Further reading

// In plain English

// Why it matters

// How it works

// See it yourself in 10 lines

// Do reasoning models fix it?

// Going deeper

Byte-level and tokenizer-free models

Why we still mostly use tokenizers

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

See it yourself in 10 lines

Do reasoning models fix it?

Going deeper

FAQ

Further reading

Related