In plain English
When you send a request to an LLM API, you are not charged by the word or the sentence. You are charged by the token — the small chunk of text the model actually reads (see what is a token). Every prompt you send has a token count, and that number controls two things you care about: how much the request costs, and whether it even fits inside the model's context window.

Counting tokens before you send means measuring that number while you are still building the request, not after the bill arrives or after the API rejects it. It is the difference between weighing your suitcase at home and finding out at the airport that it is overweight.
Think of an airline check-in. The plane has a strict weight limit, and you pay by the kilogram. A careful traveler puts the bag on a bathroom scale before leaving the house: if it is too heavy, they take something out now, calmly, instead of repacking in a panic at the desk while a queue forms behind them. Counting tokens before sending is that bathroom scale for your API requests — a quick local check that lets you fix problems before they become errors or surprise charges.
Why it matters
If you build anything real on an LLM API, two failures will eventually bite you, and both are preventable by counting tokens up front.
- Context overflow. Every model has a maximum number of tokens it can read in one request. Go over it and the API returns an error instead of an answer (often a
400or arequest_too_large). In a chatbot, this usually happens silently as the conversation grows — turn 20 is fine, turn 21 tips over the limit, and your app crashes for that one user. A pre-send count catches it before the call. - Surprise cost. You pay for every input token on every call. A prompt that quietly grew from 2,000 to 60,000 tokens (because you kept appending history, or pasted a huge document) costs 30x more — and you only notice on the invoice. Knowing the count beforehand lets you budget, warn, or trim.
- Wasted round-trips. Sending an oversized request, waiting for the rejection, then trimming and resending burns time and latency. Counting locally is far faster than learning the size from a failed API call.
- Room for the answer. The context window is shared between your input and the model's output. If your prompt fills the window to the brim, there is no room left for a reply. You need to leave headroom, and you can only do that if you know your input size.
Who needs this? Anyone running multi-turn conversations (chatbots, agents), anyone feeding large documents in, and anyone whose costs need to be predictable. If your prompt size never changes and is tiny, you can skip it. The moment prompt size varies — which is almost always — counting earns its keep.
How it works
There are two ways to get a token count before sending: ask the provider's count-tokens endpoint (exact, one network call), or estimate it locally with a tokenizer library (instant, approximate). Both feed the same decision: does this request fit my budget and the context window?
Option A: the provider's count-tokens endpoint (exact)
Most providers expose an endpoint that takes the same request body you are about to send and returns its token count — without generating any output. With the Anthropic SDK it is client.messages.count_tokens(...), which calls POST /v1/messages/count_tokens and returns an input_tokens number. The crucial detail: pass the exact same model, messages, system, and tools you will use for the real call, because the count includes all of them, not just your visible text.
from anthropic import Anthropic
client = Anthropic()
messages = [
{"role": "user", "content": "Summarize this contract..."},
]
# Ask the API how big this request is — no answer is generated.
count = client.messages.count_tokens(
model="claude-opus-4-8", # same model you'll call
messages=messages,
)
print(count.input_tokens) # e.g. 1843
# Only send the real request if it fits your budget.
if count.input_tokens < 50_000:
reply = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
messages=messages,
)Option B: a local tokenizer (fast, approximate)
If you need a count instantly and offline — say, to update a live character counter as a user types — you run a tokenizer library in your own code. OpenAI publishes tiktoken for its models; other providers offer their own. The trade-off is that a local count is only exact when the library matches the exact model you are calling, and it usually cannot see the extra tokens the provider adds for message structure. Treat a local count as a good estimate, and reserve the endpoint for the moment a precise answer matters (right before a borderline send).
What the count actually includes
The single biggest source of confusion is assuming your token count equals the length of the text you wrote. It does not. A request is more than your message — and you pay for all of it. This is exactly why a local len(text) / 4 guess drifts from reality, and why the count-tokens endpoint (which sees the assembled request) is more trustworthy.
| Part of the request | Counted as input? | Easy to forget? |
|---|---|---|
| Your user message text | Yes | No — this is the obvious part |
| The system prompt / instructions | Yes | Yes — it is sent on every call |
| The full conversation history | Yes | Yes — it grows silently each turn |
| Tool / function definitions | Yes | Yes — schemas can be large |
| Message structure markers | Yes | Yes — invisible formatting tokens |
| The model's reply | No (that's output) | Billed separately, at a higher rate |
Two practical lessons fall out of this table. First, in a multi-turn conversation the history is resent on every call, so your input count climbs with each turn even if each new message is short. Second, a long system prompt or a big set of tool definitions is a fixed tax paid on every single request — worth counting once and keeping an eye on.
Budgeting input against the context window
Counting is only half the job. The point of the number is to make a decision: send as-is, or trim first. The budget you compare against is not the whole context window — you must subtract room for the reply.
The rule of thumb is simple: input budget = context window − the max_tokens you'll request for the reply − a small safety margin. If your counted input exceeds that budget, trim before sending. The most common trimming strategies for a chat history are to drop the oldest turns, or to summarize them into a short note that costs far fewer tokens.
MODEL = "claude-opus-4-8"
CONTEXT_WINDOW = 200_000 # look this up for your model
REPLY_HEADROOM = 2_000 # the max_tokens you'll ask for
SAFETY_MARGIN = 500
budget = CONTEXT_WINDOW - REPLY_HEADROOM - SAFETY_MARGIN
def count(messages):
return client.messages.count_tokens(
model=MODEL, messages=messages,
).input_tokens
# Drop the oldest turns until the request fits the budget.
while count(messages) > budget and len(messages) > 1:
messages.pop(0) # remove the oldest message
reply = client.messages.create(
model=MODEL, max_tokens=REPLY_HEADROOM, messages=messages,
)Going deeper
Once the basic pre-send check is in place, a few finer points separate a toy from a robust production setup.
Counts are estimates of cost, not the final bill. The pre-send count tells you the input tokens. Your total cost also includes the output tokens the model generates, which you cannot know until the reply arrives — and output is usually priced higher than input. To predict total spend, add your input count to your max_tokens ceiling for a worst-case estimate, then check the real usage numbers the API returns on the response to learn the actual output size. This is also how you decide which model to call: a cheaper model may let a bigger prompt stay within budget.
Tokenizers can change between model versions. A newer model in the same family often shares a tokenizer with its predecessor, so counts carry over — but not always. When you migrate to a different model, do not assume the old numbers hold. Re-measure a representative sample by calling the count endpoint once with each model and comparing, then re-baseline your budgets, cost calculators, and overflow thresholds. A blanket multiplier ("add 10%") is a guess; an actual recount is the truth.
Counting interacts with prompt caching. If your provider caches a stable prefix of your prompt (a long system prompt, fixed instructions) to make repeat calls cheaper, the counted token total stays the same but the cost of those cached tokens drops sharply on subsequent calls. So a high token count is not automatically a high bill once caching is in play — read the response's usage breakdown to see how many tokens were billed fresh versus served from cache.
Overflow is not the only error to plan for. Counting tokens prevents context-overflow errors, but a production client still needs to handle the rest — rate limits, transient server errors, malformed requests. Token counting is one layer of defense; pair it with proper API error handling so the request that does go out fails gracefully when something else goes wrong. The durable principle: measure the request before you send it, leave room for the answer, and never trust a local estimate from the wrong tokenizer when an exact count is one cheap call away.
FAQ
How do I count tokens before sending a request to the Claude API?
Call the count-tokens endpoint with the exact same model, messages, system, and tools you plan to send. In the Anthropic SDK it is client.messages.count_tokens(...) (Python) or client.messages.countTokens(...) (TypeScript), which calls POST /v1/messages/count_tokens and returns input_tokens. It does not generate any output, so it is fast and not billed as a normal request.
Can I use tiktoken to count tokens for Claude?
No. tiktoken is OpenAI's tokenizer and uses different rules than Claude, so it typically undercounts Claude tokens by roughly 15-20% on plain English, and by more on code or non-English text. Use a count for the same model family you will actually call. For an exact Claude count, use the provider's count-tokens endpoint instead of any local OpenAI-style estimator.
Why does my local token count not match the API's count?
A local estimate ignores the extra tokens the provider adds for message structure, the system prompt, tool definitions, and special formatting markers. Those are real input tokens you pay for but never see. The count-tokens endpoint measures the fully assembled request, so it is always closer to the truth than a raw len(text)/4 style guess.
Does calling the count-tokens endpoint cost money or use my rate limit?
The count-tokens endpoint is designed to be cheap and does not produce or bill output tokens. It does count against a request-per-minute limit, so do not call it in a tight loop on every keystroke. Cache the count for a given prompt and only recount when the prompt actually changes.
How do I stop my conversation from overflowing the context window?
Before each call, count the tokens for the full request (history plus the new message) and compare it to the model's context window minus your max_tokens for the reply. If the request is too big, trim the oldest turns or summarize them until it fits. This pre-check is what turns a crash into a graceful trim.
Are token counts the same across different models?
Not always. Different model families can use different tokenizers, so the same text can produce different counts. Newer Claude models in the same family often share a tokenizer, but you should still pass the specific model ID to the count endpoint rather than assume a number measured on one model applies to another.