Tokens vs Words vs Characters: How to Estimate Text Size

Q: How many tokens is one word?

For ordinary English, about **1.33 tokens per word** on average — equivalently, **1 token ≈ 0.75 words**. Short common words are usually one token; long or rare words get split into several, so the per-word number is an average, not a fixed rate.

Q: How many words is 1000 tokens?

Roughly **750 words** of English prose, or about 1.5 pages of double-spaced text. For code or non-English text it will be fewer words, because those use more tokens per word.

Q: How many characters are in a token?

About **4 characters per token** for English, including spaces. Code and JSON run denser (~2.5–3.5 chars/token), and languages like Chinese or Japanese can be near 1 character per token, meaning many more tokens for the same content.

Q: How do I count tokens exactly instead of estimating?

For OpenAI-family models use the free `tiktoken` library locally with the `o200k_base` encoding. For Claude, call the free `count_tokens` API endpoint, which returns exact input tokens for your full request. Gemini and others offer an equivalent `countTokens` call.

Get reliable rules of thumb for converting between tokens, words, and characters — and know when those rules lie.

BEGINNER10 MIN READUPDATED 2026-06-12

In plain English

An LLM doesn't read words. It reads tokens — chunks of text the model was trained to recognize. A token is often a whole common word (the, dog), but it can also be part of a word (token + ization), a single character, a space, or a punctuation mark. If you're fuzzy on what a token even is, start with What Is a Token in an LLM?.

So when you want to know "will my 4,000-word document fit?" or "how much will this prompt cost?", you have to translate between three units: characters (what you typed), words (how you think), and tokens (what the model and the bill actually count).

Here's the everyday analogy. Think of packing a suitcase. Characters are individual socks. Words are outfits. Tokens are the little packing cubes the airline weighs. You plan in outfits, but you get charged by the cube — and an outfit doesn't always fit in one cube. Estimating tokens is just learning the rough number of cubes per outfit, plus knowing the cases where one outfit needs three cubes.

Why it matters

Tokens are the unit everything is measured in. You don't pay per word and you don't get a word limit — you pay per token and you get a token limit. Three concrete reasons to be able to estimate them on the back of an envelope:

Cost. API providers bill per million input and output tokens. If you can eyeball that a 10-page report is ~6,000 tokens, you can predict the bill before you press send. See LLM API Pricing.
Fit. Every model has a context window measured in tokens — the max it can read at once. "Does my 300-page PDF fit in 200K tokens?" is a token-estimation question, and getting it wrong means content gets silently dropped.
Speed. Output is generated one token at a time, so longer answers literally take longer. A 500-token reply streams roughly twice as long as a 250-token one.

The catch: the conversion rate is a rough average, not a law of physics. Code, numbers, non-English text, and weird formatting all break the 0.75-words rule — sometimes badly. Knowing both the rule and its exceptions is the whole skill.

How it works

Modern LLMs tokenize text with subword algorithms — usually a variant of Byte-Pair Encoding (BPE), covered in How Does Tokenization Work?. The idea: build a fixed vocabulary (tens of thousands of pieces) where common words get their own token and rarer words get split into familiar fragments. The pipeline from your text to a bill looks like this:

// From characters to a bill

Your textcharacters you typedTokenizersplits into subword piecesToken IDsintegers the model readsCounted & billedper-token, in & out

Why ~4 characters per token for English? Because the average English word is about 5 characters plus a space, and common words map to a single token — but enough longer or rarer words get split into 2+ pieces that the average settles near 4 characters. That's where the 0.75-words rule comes from. To see the three units side by side:

// Three units, same sentence: "Tokenization is unbelievably handy."

Characters

35 characters
including spaces + period

Words

4 words
how a human counts

Tokens

~7 tokens
`Token`+`ization`, `un`+`believ`+`ably`…
rare/long words split

Notice what happened: is is one token, but unbelievably got chopped into pieces. Common short words are cheap; long or unusual words cost more tokens than you'd expect. The rule of thumb works because these average out across a paragraph — it gets shaky on a single short string.

The conversion cheat sheet

Here are the numbers worth keeping in your head. They assume ordinary English prose and a modern tokenizer. Treat them as ±10–20%, not exact.

You have	Multiply by	To estimate
Words	× 1.33	Tokens
Tokens	× 0.75	Words
Characters (English)	÷ 4	Tokens
Tokens	× 4	Characters
A4 / Letter page (~500 words)	× 1.33	~650–700 tokens

Some anchors that come up constantly when you're sizing prompts and documents:

1,000 tokens ≈ 750 words ≈ 1.5 pages of double-spaced text.
A typical chat message (a sentence or two) is 20–60 tokens.
A short blog post (~800 words) is ~1,100 tokens.
A novel (~100,000 words) is ~130,000–150,000 tokens — which is why "does a whole book fit in a 200K context window?" is usually a yes now.
This article you're reading (~1,800 words) is roughly 2,400 tokens.

When the 0.75-words rule lies

The 0.75-words rule is tuned for English prose. The moment your text stops looking like an English novel, the ratio shifts — sometimes a lot. Know these four landmines:

1. Code and numbers cost more

Source code is full of symbols, indentation, camelCase, snake_case, and long digit strings — all of which fragment into many tokens. A line of Python often runs 3–5 characters per token instead of 4, and a UUID or a long number can be one token per few characters. Budget code at roughly 2.5–3.5 characters per token to be safe.

2. Other languages cost much more

Tokenizers are trained mostly on English, so non-English text is split into smaller pieces. As of mid-2026, the rough picture verified across providers: most European languages run ~1.5–2× the tokens of equivalent English, and CJK languages (Chinese, Japanese, Korean) run ~2–3× — Chinese often lands near 1 token per 2 characters, Japanese near 1 per 3. Some low-resource languages are far worse. Same meaning, more tokens, higher bill.

3. Whitespace and markup inflate counts

JSON, HTML, Markdown tables, and deeply-indented text spend tokens on braces, tags, and runs of spaces. A pretty-printed JSON blob can have a third of its tokens going to structure rather than data.

4. The tokenizer itself changes between models

Different model families use different tokenizers, so the same text counts differently across providers — and even across versions of one provider. As of mid-2026, Anthropic's docs note that the tokenizer introduced with Claude Opus 4.7 produces roughly 30% more tokens for the same text than earlier Claude models. So a token count you measured last year may be wrong for this year's model.

// Same idea, very different token counts

English prose

~4 chars / token
the friendly baseline

Code / JSON

~2.5–3.5 chars / token
symbols + indentation split

Chinese / Japanese

~2–3× the tokens
of equivalent English

Count exactly, don't guess

When the estimate isn't good enough, run the real tokenizer. For OpenAI-family models, the open-source tiktoken library does it locally and for free — no API call. Its modern encoding (o200k_base) matches recent GPT models; the older cl100k_base matches GPT-3.5/4-class models.

count_tokens.pypython

# pip install tiktoken
import tiktoken

# o200k_base = encoding used by recent OpenAI models
enc = tiktoken.get_encoding("o200k_base")

samples = {
    "english": "Tokenization is unbelievably handy.",
    "code":    "for i in range(len(items)): total += items[i].price",
    "chinese": "标记化非常方便。",
}

for name, text in samples.items():
    ids = enc.encode(text)
    chars = len(text)
    # chars-per-token reveals where the 4.0 rule holds vs breaks
    print(f"{name:8} {len(ids):3} tokens  {chars:3} chars  "
          f"{chars/len(ids):.1f} chars/token")

# english  ~7 tokens  35 chars  ~5.0 chars/token
# code    ~17 tokens  53 chars  ~3.1 chars/token  <- denser
# chinese  ~9 tokens   8 chars  ~0.9 chars/token  <- much denser

For Claude, the tokenizer isn't public, so you count via the API. Anthropic exposes a free count_tokens endpoint that returns the exact input-token count for a request — including system prompt, tools, images, and PDFs — before you actually send it. It's billed-free and uses the same tokenizer as the model you name.

claude_count.pypython

# pip install anthropic
import anthropic

client = anthropic.Anthropic()

result = client.messages.count_tokens(
    model="claude-opus-4-8",
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "How many tokens is this?"}],
)

print(result.input_tokens)  # exact count, free, no message created

Going deeper

Once the basic conversions are second nature, a few subtler points separate a rough guess from a reliable one.

Tokenizer vocabulary size shifts the ratio

A bigger vocabulary can represent more words as a single token, so it tends to use fewer tokens for the same text. OpenAI's o200k_base has a ~200K-piece vocabulary versus ~100K for the older cl100k_base, and on typical English it produces slightly fewer tokens. But bigger isn't free — every vocabulary entry adds to the model's embedding table, which ties into why LLMs need GPUs. Vocabulary design is a real trade-off, not a pure win.

Token counts aren't additive across turns

It's tempting to count each message once and sum them. But in a chat, the entire conversation history is re-sent on every turn, plus a fixed overhead for role markers and chat-template tokens like <|im_start|> (model-specific). So a 10-turn conversation costs far more than 10 single messages — the input grows roughly quadratically with turn count unless you trim history. This is the practical reason long chats get expensive and eventually hit the wall described in What Happens When You Exceed the Context Window?.

Why character-splitting causes the "strawberry" bug

Because the model sees tokens, not letters, it can't reliably count characters inside a token — it never saw strawberry as s-t-r-… in the first place. That's the root of the famous letter-counting failures, unpacked in the strawberry problem. The same blindness explains why models fumble reversing strings or doing digit-by-digit arithmetic on long numbers.

Output tokens are the expensive, slow ones

Input is processed in parallel, but output is generated one token at a time, each depending on the last. That's why output tokens are usually priced higher than input tokens and why long answers feel slow. If latency or cost matters, the highest-leverage move is often capping max_tokens and asking for concise output — not shrinking the prompt. This connects to how the model actually picks each next token, covered in How Do LLMs Actually Work?.

FAQ

How many tokens is one word?

For ordinary English, about 1.33 tokens per word on average — equivalently, 1 token ≈ 0.75 words. Short common words are usually one token; long or rare words get split into several, so the per-word number is an average, not a fixed rate.

How many words is 1000 tokens?

Roughly 750 words of English prose, or about 1.5 pages of double-spaced text. For code or non-English text it will be fewer words, because those use more tokens per word.

How many characters are in a token?

About 4 characters per token for English, including spaces. Code and JSON run denser (~2.5–3.5 chars/token), and languages like Chinese or Japanese can be near 1 character per token, meaning many more tokens for the same content.

Why does the same text cost more tokens on some models than others?

Each model family uses its own tokenizer with a different vocabulary, so identical text splits differently. As of mid-2026, Anthropic notes the tokenizer introduced with Claude Opus 4.7 produces roughly 30% more tokens than earlier Claude models — so always re-count when you switch models.

How do I count tokens exactly instead of estimating?

For OpenAI-family models use the free tiktoken library locally with the o200k_base encoding. For Claude, call the free count_tokens API endpoint, which returns exact input tokens for your full request. Gemini and others offer an equivalent countTokens call.

Do I pay for output tokens too, or just my prompt?

Both. You're billed for input tokens (your prompt, system message, and any retrieved context) and output tokens (the model's reply). Output tokens are usually priced higher and are generated one at a time, so they're the slow, expensive part.

// In plain English

// Why it matters

// How it works

// The conversion cheat sheet

// When the 0.75-words rule lies

1. Code and numbers cost more

2. Other languages cost much more

3. Whitespace and markup inflate counts

4. The tokenizer itself changes between models

// Count exactly, don't guess

// Going deeper

Tokenizer vocabulary size shifts the ratio

Token counts aren't additive across turns

Why character-splitting causes the "strawberry" bug

Output tokens are the expensive, slow ones

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

The conversion cheat sheet

When the 0.75-words rule lies

Count exactly, don't guess

Going deeper

FAQ

Further reading

Related