AI/TLDR

How LLM API Pricing Works: Input, Output, and Cached Tokens

Read any provider's pricing page fluently: per-million-token rates, why output costs more, and where cached tokens fit in the bill.

BEGINNER12 MIN READUPDATED 2026-06-11

In plain English

When you call an LLM through an API — Anthropic's Claude, OpenAI's GPT, Google's Gemini — you don't pay a flat monthly fee or a price per request. You pay by the token. A token is a small chunk of text, roughly three-quarters of an English word. Every token you send the model is input, and every token the model writes back is output. The bill is just: tokens used × the rate per token.

Think of it like an old-school phone bill or a taxi meter. The meter doesn't care whether you said something brilliant or asked it to count to ten — it counts the words going each way and charges accordingly. There's no monthly line rental and no per-call fee. A one-word reply costs almost nothing; a 2,000-word essay costs more, because it's more words off the meter.

The one twist that surprises everyone: the two directions cost different amounts. Input (the text you send) is cheap. Output (the text the model generates) is several times more expensive — commonly four to five times the input rate. So a chatbot that reads a short question and writes a long answer is paying mostly for the answer, not the question.

Why it matters

If you only ever poke at a chatbot in a browser, pricing is invisible — you pay a subscription and forget about it. The moment you build anything on the API — a support bot, a RAG app, an agent — the token meter is running on every call, and it's your card on file. A design choice that feels harmless ("let's just stuff the whole manual into every prompt") can quietly 10x your bill.

Understanding the rate card is what lets you answer the questions that actually decide whether a product is viable:

  • "What does one user request cost me?" You can't price a SaaS feature, set a free-tier limit, or forecast a launch without this number. It's just (input tokens × input rate) + (output tokens × output rate).
  • "Can I afford the big model, or do I need the cheap one?" Within one provider, the top model can cost 10–20x the smallest. Knowing the gap tells you when a smaller model is the obvious call.
  • "Why did my bill spike?" Almost always: longer prompts, more output, or more calls than you expected. Reading the rate card turns a scary invoice into arithmetic you can debug.
  • "Where do I cut cost without hurting quality?" Prompt caching, batch mode, shorter outputs, and a cheaper model for easy tasks are all rate-card-driven levers.

Per-token billing replaced the old world of fixed software licenses with something closer to a utility: you pay for exactly what you consume, scaling smoothly from a hobby project's pennies to an enterprise's six-figure monthly bill. That's powerful, but it means cost is now an engineering concern, not just a procurement one. Treating it that way is a core part of running LLMs in production.

How it works

Every API call gets metered in the same way. The provider counts the tokens flowing in, counts the tokens flowing out, multiplies each by its rate, and adds them up. Crucially, input and output are billed separately because they cost the provider different amounts of compute.

Why output costs more than input

It comes down to how the model runs. The whole input is processed in one parallel pass — the model reads it all at once. But output is generated one token at a time: to write a 500-token answer the model runs 500 sequential forward passes, each one re-reading everything so far to predict the next token. That sequential, can't-parallelize work is expensive and slow, so providers price output several times higher. This is also why streaming exists — output trickles out token by token because that's literally how it's produced.

What counts as input

Beginners often picture "input" as just the user's latest question. It's not. Input is everything you send on that single API call, because LLM APIs are stateless — the model has no memory between calls, so you re-send the full context every time. On a chat request that means: the system prompt, the entire prior conversation, any documents or retrieved context, the tool/function definitions, and finally the new user message. In a long conversation, the growing history is usually the biggest line item — every turn re-bills all the turns before it.

Cached tokens: the third rate

Because re-sending the same big prefix on every call is so common, providers added prompt caching. The first time you send a large reusable block (a system prompt, a long document), the provider stores its internal representation. On the next call, if the prefix matches, those tokens are read from cache instead of reprocessed — and cache reads are dramatically cheaper than fresh input, often around one-tenth the normal input rate. There's usually a small premium to write the cache (a bit above the base input rate), so caching pays off when the same content is reused at least a couple of times within the cache's lifetime (commonly a few minutes). For Claude specifically, the pattern is roughly a 1.25x write cost and a 0.1x read cost relative to base input; the Claude API exposes this through a cache_control field.

How to read a pricing page

Every provider's pricing page is the same table wearing different clothes. Once you know the columns, you can read any of them in seconds. Here's a model rate card with realistic relative numbers — treat them as illustration, not today's live prices (always check the official page for current rates):

ModelInput /MTokCached input /MTokOutput /MTok
Flagship (big, smart)$5.00$0.50$25.00
Mid (the workhorse)$3.00$0.30$15.00
Small (fast, cheap)$1.00$0.10$5.00

Read it column by column. Input is what you pay per million tokens sent. Cached input is the discounted rate for cache hits — note it's a fraction of the input column. Output is the expensive one, and you can see the consistent pattern: output is about 5x input, and the big model costs about 5x the small one. Two numbers, two ratios — that's the entire shape of LLM pricing, and it holds across Claude, GPT, and Gemini.

Estimate a bill in code

You don't have to guess. Every API response reports exactly how many input and output tokens it used, in a usage block. Multiply by the rates and you have the cost of that call. Here's a tiny, self-contained estimator you can adapt — no API key needed to run the math part:

estimate_cost.pypython
# Rates are per 1,000,000 tokens (USD). Plug in your model's real numbers.
RATES = {
    "input":        5.00,
    "cached_input": 0.50,   # cache-hit reads are much cheaper
    "output":       25.00,
}

def cost(input_tokens, output_tokens, cached_tokens=0):
    fresh_input = input_tokens - cached_tokens
    dollars = (
        fresh_input   / 1_000_000 * RATES["input"]
        + cached_tokens / 1_000_000 * RATES["cached_input"]
        + output_tokens / 1_000_000 * RATES["output"]
    )
    return dollars

# A typical chat turn: 1,200 tokens in, 400 tokens out, nothing cached
print(f"${cost(1_200, 400):.5f}")          # $0.01600 per request

# Same turn, but 1,000 of the input tokens were a cached system prompt
print(f"${cost(1_200, 400, cached_tokens=1_000):.5f}")  # $0.01150

# Now scale to 100,000 of those requests a day
print(f"${cost(1_200, 400) * 100_000:,.2f}/day")        # $1,600.00/day

Notice three things. First, one request is fractions of a cent — which is why people underestimate cost. Second, multiplied by 100,000 requests a day it's $1,600 — which is why they get a shock. Third, caching that 1,000-token system prompt knocked roughly 28% off the per-request cost in this example, for free. To get the real token counts after a call, read them straight off the response:

read_usage.pypython
from anthropic import Anthropic

client = Anthropic(api_key="sk-...")

resp = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=400,
    messages=[{"role": "user", "content": "Explain LLM pricing in one sentence."}],
)

u = resp.usage
print("input tokens: ", u.input_tokens)
print("output tokens:", u.output_tokens)
# When prompt caching is on, you also get:
#   u.cache_creation_input_tokens  (tokens written to cache)
#   u.cache_read_input_tokens      (cheap cache hits)

Mistakes that blow up the bill

  • Treating input and output as the same price. They're not — output is the expensive direction. Capping max_tokens and asking the model to be concise is often the single biggest cost lever.
  • Re-sending a giant system prompt uncached. A 4,000-token instruction block paid in full on every request, across millions of requests, is real money. Cache the static prefix.
  • Letting conversation history grow forever. Every turn re-bills the whole history. Trim, summarize, or window old turns instead of carrying the full transcript indefinitely.
  • Using the flagship model for trivial tasks. Classification, routing, and short extractions rarely need the top model. A model 5–10x cheaper often scores the same on the easy stuff.
  • Forgetting tool/function definitions are input. Big function-calling schemas get sent — and billed — on every request, even when no tool is used.
  • Ignoring batch mode for offline jobs. Bulk summarizing, tagging, or evals that aren't time-sensitive can run at roughly half price through the Batch API.

None of these require a model swap or a rewrite. They're all just consequences of how the meter works — and once you can read the rate card, each one is obvious.

Going deeper

Caching is more nuanced than a single discount. Cache lifetime matters: a 5-minute cache and a 1-hour cache have different write premiums, and a hit only lands if your prefix matches exactly from the start, byte for byte. Put your stable content (system prompt, long reference docs) at the very front of the prompt and the volatile part (the user's new message) at the end — otherwise a tiny change near the top invalidates everything after it. Some providers cache automatically; others, like Anthropic, require you to mark cache breakpoints explicitly with cache_control.

Reasoning models add a hidden output line item. Models that "think" before answering generate internal reasoning tokens, and those are billed as output even though you may never see most of them. A model that thinks for 3,000 tokens and then writes a 200-token answer is charged for ~3,200 output tokens. For these models, output cost can dominate in ways the visible reply length completely hides — always check the usage block, not the rendered text.

Multimodal pricing has its own rules. Images are converted to a token count based on resolution; audio and video are typically priced per second or per token of a transcribed/encoded stream, often at different rates than text. A single high-res image can equal a thousand-plus tokens, so a vision-heavy app's cost profile looks nothing like a pure-text one. Check the multimodal section of the rate card separately.

The same model can cost different amounts depending on where you call it. A frontier model is often available first-party (direct from the lab) and also through cloud marketplaces like AWS Bedrock and Google Vertex AI. Rates, caching support, batch availability, and regional/data-residency surcharges can all differ between those routes — and a region-pinned endpoint may carry a premium over the default global one. If you're cost-optimizing seriously, the deployment surface is part of the rate card.

*Tokenizers differ, so token counts* differ.** "Cheaper per token" is meaningless if that model splits the same text into more tokens. Two providers can quote different per-MTok rates while the effective cost for your actual workload is reversed, because their tokenizers chunk text differently — and a provider can even change its tokenizer between model versions. The only honest comparison is to run a representative sample of your real prompts through each model and multiply the measured token counts by each rate card. Cost optimization is a whole production discipline, not a one-time spreadsheet.

FAQ

Why does output cost more than input on LLM APIs?

Because the model reads all your input in one parallel pass, but writes output one token at a time — each output token needs its own sequential forward pass over everything generated so far. That serial work is more expensive to run, so providers price output several times higher, commonly 4–5x the input rate.

What does 'price per million tokens' actually mean?

It's the cost of processing 1,000,000 tokens (chunks of text, ~0.75 of a word each). If input is $5 per million tokens, then sending 200,000 tokens costs 200,000 ÷ 1,000,000 × $5 = $1.00. Providers quote per-million because a single request is otherwise a tiny fraction of a cent.

How do I calculate the cost of one API call?

(input tokens × input rate) + (output tokens × output rate), with rates expressed per token (the per-million price ÷ 1,000,000). Every API response returns the exact input and output token counts in a usage field, so you never have to guess after the fact.

Do cached tokens really make LLM APIs cheaper?

Yes, when you reuse the same large prefix across calls. A cache hit typically costs around one-tenth of the normal input rate. There's a small premium to write the cache, so it pays off once the same content (a system prompt or long document) is reused at least a couple of times within the cache window.

Is the cheapest model always the best choice for cost?

Not always. A cheaper model can use a different tokenizer that splits your text into more tokens, and it may need longer prompts or more retries to hit the same quality. Compare effective cost by running your real prompts through each model and multiplying the measured token counts by each rate card, not by comparing headline per-million prices.

What's the easiest way to lower my LLM API bill?

In rough order of impact: cap output length, cache static system prompts and documents, use a smaller model for easy tasks, trim conversation history instead of resending it all, and run non-urgent bulk jobs through batch mode for roughly half price.

Further reading