In plain English
Before a language model ever reads your prompt, your text is chopped into small pieces called tokens. A token is roughly a word, part of a word, or a punctuation mark — cat is one token, but tokenization might split into token + ization. The model never sees raw characters; it sees a list of token IDs (just numbers). Every API bills you by the token, and every model has a hard ceiling on how many tokens fit in one request (the context window).

tiktoken is OpenAI's fast, open-source library that turns text into exactly the same tokens an OpenAI model would see — before you send anything over the network. It runs locally on your machine in microseconds. Give it a string, it gives you back the list of token IDs; ask for the length, and you know the token count.
Think of a shipping company that charges by the box, not by weight, and refuses any order over 50 boxes. tiktoken is the calculator at your desk that tells you exactly how many boxes your shipment will fill — so you can price it and check it fits — without driving to the depot first. You count at home, for free, and only ship when you know the number.
Why it matters
Tokens are the unit of three things you cannot ignore when you build on an LLM: price, the context limit, and truncation. tiktoken lets you measure all three on your own machine before you spend a cent or get a surprise error.
- Cost estimation. APIs price per million input and output tokens. If you can count the tokens in a prompt locally, you can estimate the bill for one call — or for a million calls in a batch job — before running it. No more guessing why the invoice doubled.
- Staying inside the context window. Every model rejects a request whose input is too large. Counting first lets you check
prompt_tokens + expected_output < context_limitand trim or chunk the input before the API says no. - Avoiding truncated output. The room left for the answer is
context_limit − input_tokens. If your prompt eats most of the window, the model runs out of space mid-sentence (see max tokens and truncation). Counting the input tells you how much output budget remains.
Who needs this? Anyone shipping LLM features at scale. A RAG pipeline that stuffs retrieved chunks into a prompt has to budget tokens so the context still fits. A summarizer processing thousands of documents needs a per-document cost forecast. A chat app must trim old turns before the conversation overflows the window. In every case, counting before calling is cheaper, faster, and more reliable than catching an error after the fact.
Counting locally is also free and instant. The API has a token-counting endpoint too, but that is a network round-trip. tiktoken runs in-process, so you can call it on every request — in a loop, in a validator, in a UI that shows a live token meter — without adding latency or cost.
How it works
tiktoken implements byte pair encoding (BPE). The idea is simple: start from raw bytes, then repeatedly merge the most common adjacent pair into a single new token, following a fixed merge table that was learned once when the tokenizer was trained. Common words and word-parts collapse into one token; rare or made-up strings stay split into many small pieces. That is why ordinary English averages about 4 characters per token, while a long URL, a code identifier, or non-English text can cost far more tokens per character.
Encodings: the same library, different rule sets
tiktoken doesn't have one tokenizer — it has several, called encodings, each a different merge table. Two models can use the same encoding or different ones. You pick the encoding either by name or, more safely, by model name, and tiktoken loads the matching rules. Using the wrong encoding gives you the wrong count, so always tie the encoding to the exact model you'll call.
import tiktoken
# Best practice: let tiktoken pick the encoding from the model name,
# so your count always matches the model you'll actually call.
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Tokenizing is fun!"
token_ids = enc.encode(text) # -> a list of integers
print(token_ids) # [2438, 4954, 382, 1424, 0]
print(len(token_ids)) # 5 -> this is your token count
# Decoding is exact and reversible: IDs back to the original text.
print(enc.decode(token_ids)) # "Tokenizing is fun!"The whole API is essentially two methods: encode(text) turns a string into a list of token IDs, and decode(ids) turns the list back into the exact original string. To count tokens, you encode and take the length. There is no model call, no key, no network — it is pure local computation over the merge table.
Worked example: budgeting a prompt
Say you're building a summarizer. Each call sends a system instruction plus one document, and you've capped the answer at 500 output tokens. You want two things before calling: confirm the input fits the window, and estimate the cost. Here's the pattern.
import tiktoken
MODEL = "gpt-4o"
CONTEXT_LIMIT = 128_000 # the model's context window (tokens)
MAX_OUTPUT = 500 # tokens we reserve for the answer
# Per-million-token prices vary by model and change over time —
# read the current numbers from the provider's pricing page.
INPUT_PRICE_PER_M = 2.50 # example placeholder, USD per 1M input tokens
enc = tiktoken.encoding_for_model(MODEL)
def count(text: str) -> int:
return len(enc.encode(text))
system = "Summarize the document in three bullet points."
document = open("report.txt").read()
input_tokens = count(system) + count(document)
# 1) Does it fit, leaving room for the answer?
if input_tokens + MAX_OUTPUT > CONTEXT_LIMIT:
raise ValueError(
f"Too big: {input_tokens} input + {MAX_OUTPUT} output "
f"exceeds the {CONTEXT_LIMIT}-token window. Trim or chunk it."
)
# 2) Estimate the input cost for this one call.
est_cost = input_tokens / 1_000_000 * INPUT_PRICE_PER_M
print(f"{input_tokens} input tokens, ~${est_cost:.4f} for input")Two cheap local checks now prevent two expensive surprises later: a 400 error for overflowing the window, and a bill you didn't expect. Multiply est_cost by your number of documents and you have a forecast for the whole batch — all without making a single API call.
tiktoken vs. other ways to count
tiktoken is one option among several. The right choice depends on which model you're targeting and whether you need a billing-exact number or a fast estimate.
| Method | Accurate for | Speed / cost | Use when |
|---|---|---|---|
| tiktoken (local) | OpenAI GPT-family models | Instant, free, offline | Pre-flight checks and cost estimates for OpenAI models |
| Provider count-tokens endpoint | That provider's models exactly | Network round-trip per call | You need a billing-exact count, including chat overhead |
chars / 4 rule of thumb | Rough English estimate only | Instant, free | A back-of-envelope guess, never a hard limit check |
| Hugging Face tokenizer | Open models (Llama, Mistral, etc.) | Local, free | Counting for open-weight models tiktoken doesn't cover |
The trap beginners fall into is assuming one tokenizer fits all models. It does not. tiktoken gives correct counts for OpenAI models and wrong counts for everything else — it typically undercounts other model families on normal text, and by more on code or non-English input. If you're calling Claude, don't reach for tiktoken: Anthropic's API exposes a dedicated count_tokens endpoint that returns the exact token count for the model you name, and that is the number to budget against. Match the counter to the model, every time.
Common pitfalls
- Wrong encoding for the model. Counting with an encoding that doesn't match your model gives a number that's close but not exact — and "close" is what overflows the window on a long prompt. Use
encoding_for_model(name), not a hard-coded encoding name, so the right rules load automatically. - Forgetting chat-message overhead.
len(enc.encode(text))counts your text, not the role markers and separators the chat format adds around each message. The real billed input is higher. For chat requests, apply the per-message overhead formula or call the provider's count-tokens endpoint. - Counting only the input. You're billed for output tokens too, usually at a higher rate. tiktoken can't predict how long the answer will be — budget for it with
max_tokensand price both sides. - Assuming tiktoken works for non-OpenAI models. It's OpenAI-specific. For Claude use the
count_tokensendpoint; for open models use that model's own tokenizer. A tiktoken count for Claude or Llama is simply wrong. - Trusting
chars / 4as a hard limit. It's a rough guess for English prose only. Code, URLs, emoji, and non-English text use far more tokens per character, so the estimate can be badly low right when it matters most.
Going deeper
Once the basics click, a few details separate a rough counter from a reliable one.
Why token counts feel unpredictable. Because BPE merges by frequency, identical-looking text can cost wildly different token counts. A leading space is part of the token ( the and the may differ), capitalization can change the split, and a rare word fragments into many pieces. Code is especially token-hungry — indentation, braces, and long identifiers all add up — which is why a code-heavy prompt costs more tokens than the same character count of plain prose.
Special tokens. Beyond ordinary text, encodings include special control tokens (end-of-text, and the role/format markers used in chat). These are how the API frames a conversation, and they count toward your bill. tiktoken can encode or refuse them depending on flags you pass — relevant when you're trying to reproduce the exact number the API charges rather than a text-only estimate.
Local estimate vs. billed truth. Treat a local tiktoken count as an excellent estimate, not a guarantee. The provider's own count-tokens endpoint is the source of truth for what you'll be billed, because it accounts for the full request envelope. The practical pattern: use tiktoken for fast, free, in-loop checks and live UI meters, and reserve the endpoint for the cases where being off by a few percent actually matters.
Performance and portability. tiktoken's core is written in Rust, so it tokenizes large texts very fast — fast enough to run on every request without a noticeable hit. There are community ports (for example, JavaScript implementations) so you can count tokens in a browser or Node service, not just Python. The merge tables are downloaded and cached on first use, which matters if you deploy to a locked-down or offline environment.
Where to go next. Tokenization underpins everything else about working with LLMs: it sets the context window you have to budget, drives the cost math behind choosing a model, and is the first thing to check when output is mysteriously truncated. The durable lesson: tokens, not characters or words, are the real currency of LLM APIs — so count them with the right tool for the model you're calling, and count before you call.
FAQ
What is tiktoken used for?
tiktoken is OpenAI's open-source library for counting tokens in text locally, before you call the API. You use it to estimate the cost of a request, check that a prompt fits inside the model's context window, and reserve enough room for the output so the answer isn't truncated. It runs in microseconds, offline and for free.
How do I count tokens with tiktoken in Python?
Install it with pip install tiktoken, then call enc = tiktoken.encoding_for_model("gpt-4o") and len(enc.encode(your_text)). The encode method returns a list of token IDs, and its length is the token count. Always pick the encoding by model name so the count matches the model you'll actually call.
Is tiktoken accurate for Claude or Gemini?
No. tiktoken is specific to OpenAI's GPT-family models and typically undercounts other model families, especially on code or non-English text. For Claude, use Anthropic's count_tokens endpoint, which returns the exact token count for the model you name; for open models like Llama, use that model's own tokenizer.
Why is my tiktoken count lower than what the API bills?
Counting your raw text misses the chat-format overhead — the role markers and separators the API wraps around every message — which all count as tokens. To match the billed number, apply OpenAI's documented per-message overhead formula, or use the provider's count-tokens endpoint, which already includes the full request envelope.
What is BPE and how does tiktoken use it?
BPE (byte pair encoding) is the algorithm tiktoken uses to split text. It starts from raw bytes and repeatedly merges the most common adjacent pair into a single token, following a fixed merge table learned during training. Common words become one token while rare strings split into several, which is why token counts don't map cleanly to character or word counts.
Can I estimate tokens without tiktoken?
For a rough guess on English prose, dividing the character count by about four gets you close. But never use that as a hard limit check — code, URLs, emoji, and non-English text use far more tokens per character, and an underestimate is exactly what gets a request rejected for overflowing the context window.