In plain English
Every time you call an LLM API, the model processes every token you send — even if that text was identical on the last request. A 10,000-token system prompt, a document you attached, a list of tool definitions: the model re-reads all of it from scratch, charges you for every token, and discards the internal computation the moment it returns a response.
Prompt caching fixes that. When you mark part of your prompt as cacheable, the provider stores the model's internal state — specifically the key-value (KV) tensors computed for that prefix — on their servers. The next request that starts with the same prefix skips the expensive re-computation and pulls the cached state instead. You still pay, but at a fraction of the normal rate.
The analogy: imagine you're a chef who prepares the same mise en place every morning. Prompt caching is like keeping your pre-chopped vegetables in the fridge. You pay once to chop them (a cache write). Every dish you make that day uses them without re-chopping (a cache read). If you throw them out at the end of service and re-chop tomorrow, you pay again — that's the TTL expiring.
Why it matters
The economics shift dramatically once you understand the price multipliers. On Claude Sonnet 4.6, a cached token costs $0.30 per million — compared to $3.00/M for a normal input token. That is a 10× reduction. On OpenAI GPT-4o, caching gives a 50% discount automatically.
Real-world savings example
Suppose you run a support chatbot with a 6,000-token system prompt and serve 5,000 conversations per day. Without caching, you pay for 6,000 × 5,000 = 30 million input tokens/day just for the system prompt. With caching and a high cache-hit rate, those 30M tokens cost 10× less. At Claude Sonnet 4.6 rates, the monthly saving is roughly $2,430/month on that prompt alone.
Latency also drops. Skipping prefix computation reduces time-to-first-token by up to 85% for long prompts. For an agent that processes a 50-page document on every tool call, that is the difference between a 3-second wait and a sub-second response.
- Document Q&A — attach a large PDF once, cache it, answer many questions without re-uploading
- Multi-turn chat — cache the system prompt + growing conversation history separately
- Agentic loops — cache tool schemas and memory context that stay constant across many tool calls
- Batch processing — pre-warm the cache before a burst of requests all using the same context
- Code assistants — cache a large codebase or style guide shared across all users of a session
How it works
Under the hood, every transformer layer in the model produces key-value tensors as it processes each token. Normally these tensors are ephemeral — created for one request, used to generate the response, then discarded. Prompt caching serializes and stores those KV tensors on the provider's inference servers. On the next matching request, the model picks up exactly where those tensors left off and processes only the new tokens at the end.
Cache breakpoints (Anthropic)
Anthropic's API requires you to place explicit cache breakpoints using the cache_control parameter. A breakpoint is a marker on a content block telling the server: cache all tokens from the beginning of this request up to and including this block. You can place up to 4 breakpoints per request, which is useful for separating a static system prompt, a shared document, a conversation history, and a final user turn — each with its own cache lifecycle.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a legal document analyst. [long system prompt...]",
"cache_control": {"type": "ephemeral"} # Breakpoint 1: cache system prompt
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "[Full text of 50-page contract...]",
"cache_control": {"type": "ephemeral"} # Breakpoint 2: cache document
},
{
"type": "text",
"text": "What are the termination clauses?" # Not cached — changes each turn
}
]
}
]
)
print(response.usage.cache_creation_input_tokens) # tokens written to cache
print(response.usage.cache_read_input_tokens) # tokens read from cache
print(response.usage.input_tokens) # tokens processed normallyAutomatic caching (OpenAI)
OpenAI takes the opposite approach: no configuration required. Their infrastructure automatically detects when a request's prefix matches a cached entry and applies a 50% discount on those tokens. The cache requires a minimum prefix of 1,024 tokens and expires after roughly 5–10 minutes of inactivity. The tradeoff is that you have less control — you can't force a cache write or inspect what was cached.
Cost math and TTL choices
Prompt caching is not free — the first request with a new cacheable prefix costs more than a regular request, because the provider has to store those KV tensors. The savings only materialise on subsequent hits. Understanding the math helps you decide whether caching is worth it for a given use case.
Anthropic price multipliers
- Cache write (5-min TTL): 1.25× the standard input price — the one-time cost to populate the cache
- Cache write (1-hour TTL): 2× the standard input price — higher upfront, but the cache survives longer gaps between requests
- Cache read: 0.1× the standard input price — 10× cheaper than normal input tokens
- No-cache input: 1× — standard rate for tokens after the last breakpoint
Break-even calculation
For a 5-minute TTL, the break-even point is just 2 requests. The first request costs 1.25× (write). The second costs 0.1× (read). Combined: 1.35× for two requests — cheaper than paying 1× twice. Every additional hit beyond that is pure saving.
Example: 4,000-token system prompt, Claude Sonnet 4.6 ($3.00/MTok)
Without caching (10 requests):
10 × 4,000 tokens × $3.00/M = $0.1200
With caching (1 write + 9 reads, 5-min TTL):
Write: 4,000 × $3.75/M = $0.0150
Reads: 9 × 4,000 × $0.30/M = $0.0108
Total: $0.0258
Saving: 78% on that prefix across 10 requestsChoosing between 5-minute and 1-hour TTL
The default 5-minute TTL resets on every cache hit — so as long as users keep interacting, the cache stays alive. It is the right choice for interactive chat applications where turns arrive frequently.
The 1-hour TTL (specified by adding "ttl": "1h" to cache_control) costs 2× to write but keeps the cache warm through long gaps. Use it for batch jobs, document analysis sessions where the user reads slowly, or server-side cache pre-warming before a surge of traffic.
Minimum cacheable token counts
Caching only activates above a minimum prefix size. On Anthropic's API: 1,024 tokens for most current Claude models (Claude Sonnet 4.6, Claude Opus 4.8), and 512 tokens for Claude Fable 5. There is no point placing a cache_control marker on a 200-token system prompt — it won't cache regardless.
When to use it — and when not to
Structure prompts for cache hits
Caching only works when the prefix is byte-for-byte identical across requests. The single most important rule: put stable content at the top of your prompt and dynamic content at the bottom. If you insert a timestamp, user name, or request ID near the beginning, it breaks the cache for everything that follows.
- System prompt / instructions — almost never changes; cache first
- Shared context (documents, tool schemas, knowledge base) — changes only when updated; cache second
- Conversation history — grows each turn; cache at the end of the history with its own breakpoint
- Current user message — unique each request; never cache this
Good use cases
- Long system prompts (>1,024 tokens) reused across many requests or users
- Document Q&A where the same document is queried multiple times in one session
- Agentic loops that pass the same tool schemas and memory on every iteration
- Multi-turn conversations — cache the conversation history up to the last assistant turn
- API pre-warming — send a
max_tokens: 0request at startup to populate the cache before real traffic arrives
When caching won't help
- One-off requests with no repeated prefix — there is no second request to hit the cache
- Short prompts under the minimum (< 1,024 tokens for most Claude models)
- Highly dynamic prompts where every token changes per request (e.g., personalised marketing copy)
- Serverless functions with cold starts spaced far apart — the 5-minute cache will have expired
Going deeper
Prompt caching is one layer in a broader cost-optimisation stack. Once you have caching in place, consider combining it with other techniques:
- Semantic caching — cache full responses when two questions are semantically similar, not just token-identical. Complements prompt caching for FAQ-style applications.
- Batch APIs — Anthropic and OpenAI both offer asynchronous batch endpoints at 50% discount for non-latency-sensitive workloads. Stack with caching for maximum savings.
- Context window management — trimming conversation history keeps prompts within the cache prefix boundary and avoids the lookback-window miss when history grows past 20 blocks.
- KV cache internals — understanding how attention's key-value tensors work explains exactly why prefix caching is possible and why only the prefix (not arbitrary middle sections) can be cached.
Limitations to be aware of
- Cache is per-account and per-model — a cache entry created with claude-sonnet-4-6 is not reused by claude-opus-4-8, and caches are not shared between different API keys
- No cache invalidation API — you cannot force-expire an entry; you wait for the TTL
- Automatic caching not available on Bedrock / Vertex AI — on those platforms you must use explicit
cache_controlmarkers - Not a substitute for streaming — caching reduces time-to-first-token but the model still generates output sequentially; combine with streaming for perceived responsiveness
- Prefix must be exact — even a single changed character in the cached portion causes a cache miss and a full re-write
FAQ
Does prompt caching change the model's response?
No. The cached KV tensors are mathematically identical to what the model would compute fresh — the output is the same. Caching only affects cost and speed, not quality or content.
How do I know if my cache is actually being hit?
Check the usage object in the API response. Anthropic returns cache_read_input_tokens (tokens served from cache) and cache_creation_input_tokens (tokens written to cache). If cache_read_input_tokens is greater than zero, you had a cache hit. OpenAI reports cached_tokens inside usage.prompt_tokens_details.
What happens when the TTL expires?
The cached KV state is discarded. The next request processes the full prefix normally and pays the standard 1.25× write cost to repopulate the cache. The 5-minute TTL resets on every successful cache hit, so active sessions rarely see expiry.
Can I cache the middle of a prompt, not just the beginning?
No — only prefixes can be cached. LLMs process tokens left-to-right; caching stores computation state at a fixed point. You can place multiple breakpoints to cache several contiguous sections from the start, but you cannot skip the beginning and cache only the middle.
Is prompt caching worth it for short prompts?
Usually not. Anthropic requires a minimum of 1,024 tokens for most models before a cache write is accepted. For shorter prompts the write overhead is wasted and you get no savings. Focus caching efforts on system prompts, documents, or tool schemas that exceed that threshold.
Does OpenAI charge for cache writes?
No — OpenAI's automatic caching has no cache-write surcharge. You simply pay 50% of the normal rate when a cache hit occurs, and the full normal rate when there is a miss. Anthropic charges 1.25× for a 5-minute write or 2× for a 1-hour write, offset by the 10× saving on reads.