AI/TLDR — New AI Releases Daily: Models, Tools, Repos & PapersA high-volume feed of new AI releases — models, open-source repos, developer tools, papers, datasets, and benchmarks — refreshed every 8 hours. Each release is explained in plain English so you actually understand what shipped.This site uses JavaScript to render the interactive feed. Enable JavaScript, or visit the source repo for the raw JSON.

AI/TLDR

What Is Prompt Caching?

Understand how prompt caching stores KV state for reusable prompt prefixes — and how to structure requests so 90% of your input tokens cost almost nothing.

INTERMEDIATE10 MIN READUPDATED 2026-06-12

In plain English

Every time you call an LLM API, the model processes every token you send — even if that text was identical on the last request. A 10,000-token system prompt, a document you attached, a list of tool definitions: the model re-reads all of it from scratch, charges you for every token, and discards the internal computation the moment it returns a response.

Prompt caching fixes that. When you mark part of your prompt as cacheable, the provider stores the model's internal state — specifically the key-value (KV) tensors computed for that prefix — on their servers. The next request that starts with the same prefix skips the expensive re-computation and pulls the cached state instead. You still pay, but at a fraction of the normal rate.

The analogy: imagine you're a chef who prepares the same mise en place every morning. Prompt caching is like keeping your pre-chopped vegetables in the fridge. You pay once to chop them (a cache write). Every dish you make that day uses them without re-chopping (a cache read). If you throw them out at the end of service and re-chop tomorrow, you pay again — that's the TTL expiring.

Why it matters

The economics shift dramatically once you understand the price multipliers. On Claude Sonnet 4.6, a cached token costs $0.30 per million — compared to $3.00/M for a normal input token. That is a 10× reduction. On OpenAI GPT-4o, caching gives a 50% discount automatically.

Real-world savings example

Suppose you run a support chatbot with a 6,000-token system prompt and serve 5,000 conversations per day. Without caching, you pay for 6,000 × 5,000 = 30 million input tokens/day just for the system prompt. With caching and a high cache-hit rate, those 30M tokens cost 10× less. At Claude Sonnet 4.6 rates, the monthly saving is roughly $2,430/month on that prompt alone.

Latency also drops. Skipping prefix computation reduces time-to-first-token by up to 85% for long prompts. For an agent that processes a 50-page document on every tool call, that is the difference between a 3-second wait and a sub-second response.

Document Q&A — attach a large PDF once, cache it, answer many questions without re-uploading
Multi-turn chat — cache the system prompt + growing conversation history separately
Agentic loops — cache tool schemas and memory context that stay constant across many tool calls
Batch processing — pre-warm the cache before a burst of requests all using the same context
Code assistants — cache a large codebase or style guide shared across all users of a session

How it works

Under the hood, every transformer layer in the model produces key-value tensors as it processes each token. Normally these tensors are ephemeral — created for one request, used to generate the response, then discarded. Prompt caching serializes and stores those KV tensors on the provider's inference servers. On the next matching request, the model picks up exactly where those tensors left off and processes only the new tokens at the end.

// Prompt caching request lifecycle

First requestFull prompt sent with cache_control markerCache writeKV tensors for the marked prefix stored server-side (1.25× cost)Response generatedDynamic suffix processed normally; answer returnedSubsequent requestSame prefix detected — cached KV tensors loadedCache readPrefix computation skipped entirely (0.1× cost)Response generatedOnly the new dynamic suffix is processed; answer returned faster

Cache breakpoints (Anthropic)

Anthropic's API requires you to place explicit cache breakpoints using the cache_control parameter. A breakpoint is a marker on a content block telling the server: cache all tokens from the beginning of this request up to and including this block. You can place up to 4 breakpoints per request, which is useful for separating a static system prompt, a shared document, a conversation history, and a final user turn — each with its own cache lifecycle.

pythonpython

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst. [long system prompt...]",
            "cache_control": {"type": "ephemeral"}  # Breakpoint 1: cache system prompt
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "[Full text of 50-page contract...]",
                    "cache_control": {"type": "ephemeral"}  # Breakpoint 2: cache document
                },
                {
                    "type": "text",
                    "text": "What are the termination clauses?"  # Not cached — changes each turn
                }
            ]
        }
    ]
)

print(response.usage.cache_creation_input_tokens)  # tokens written to cache
print(response.usage.cache_read_input_tokens)      # tokens read from cache
print(response.usage.input_tokens)                 # tokens processed normally

Automatic caching (OpenAI)

OpenAI takes the opposite approach: no configuration required. Their infrastructure automatically detects when a request's prefix matches a cached entry and applies a 50% discount on those tokens. The cache requires a minimum prefix of 1,024 tokens and expires after roughly 5–10 minutes of inactivity. The tradeoff is that you have less control — you can't force a cache write or inspect what was cached.

Cost math and TTL choices

Prompt caching is not free — the first request with a new cacheable prefix costs more than a regular request, because the provider has to store those KV tensors. The savings only materialise on subsequent hits. Understanding the math helps you decide whether caching is worth it for a given use case.

Anthropic price multipliers

Cache write (5-min TTL): 1.25× the standard input price — the one-time cost to populate the cache
Cache write (1-hour TTL): 2× the standard input price — higher upfront, but the cache survives longer gaps between requests
Cache read: 0.1× the standard input price — 10× cheaper than normal input tokens
No-cache input: 1× — standard rate for tokens after the last breakpoint

Break-even calculation

For a 5-minute TTL, the break-even point is just 2 requests. The first request costs 1.25× (write). The second costs 0.1× (read). Combined: 1.35× for two requests — cheaper than paying 1× twice. Every additional hit beyond that is pure saving.

texttext

Example: 4,000-token system prompt, Claude Sonnet 4.6 ($3.00/MTok)

Without caching (10 requests):
  10 × 4,000 tokens × $3.00/M = $0.1200

With caching (1 write + 9 reads, 5-min TTL):
  Write:  4,000 × $3.75/M   = $0.0150
  Reads:  9 × 4,000 × $0.30/M = $0.0108
  Total:  $0.0258

Saving: 78% on that prefix across 10 requests

Choosing between 5-minute and 1-hour TTL

The default 5-minute TTL resets on every cache hit — so as long as users keep interacting, the cache stays alive. It is the right choice for interactive chat applications where turns arrive frequently.

The 1-hour TTL (specified by adding "ttl": "1h" to cache_control) costs 2× to write but keeps the cache warm through long gaps. Use it for batch jobs, document analysis sessions where the user reads slowly, or server-side cache pre-warming before a surge of traffic.

Minimum cacheable token counts

Caching only activates above a minimum prefix size. On Anthropic's API: 1,024 tokens for most current Claude models (Claude Sonnet 4.6, Claude Opus 4.8), and 512 tokens for Claude Fable 5. There is no point placing a cache_control marker on a 200-token system prompt — it won't cache regardless.

When to use it — and when not to

Structure prompts for cache hits

Caching only works when the prefix is byte-for-byte identical across requests. The single most important rule: put stable content at the top of your prompt and dynamic content at the bottom. If you insert a timestamp, user name, or request ID near the beginning, it breaks the cache for everything that follows.

System prompt / instructions — almost never changes; cache first
Shared context (documents, tool schemas, knowledge base) — changes only when updated; cache second
Conversation history — grows each turn; cache at the end of the history with its own breakpoint
Current user message — unique each request; never cache this

Good use cases

Long system prompts (>1,024 tokens) reused across many requests or users
Document Q&A where the same document is queried multiple times in one session
Agentic loops that pass the same tool schemas and memory on every iteration
Multi-turn conversations — cache the conversation history up to the last assistant turn
API pre-warming — send a max_tokens: 0 request at startup to populate the cache before real traffic arrives

When caching won't help

One-off requests with no repeated prefix — there is no second request to hit the cache
Short prompts under the minimum (< 1,024 tokens for most Claude models)
Highly dynamic prompts where every token changes per request (e.g., personalised marketing copy)
Serverless functions with cold starts spaced far apart — the 5-minute cache will have expired

Going deeper

Prompt caching is one layer in a broader cost-optimisation stack. Once you have caching in place, consider combining it with other techniques:

Semantic caching — cache full responses when two questions are semantically similar, not just token-identical. Complements prompt caching for FAQ-style applications.
Batch APIs — Anthropic and OpenAI both offer asynchronous batch endpoints at 50% discount for non-latency-sensitive workloads. Stack with caching for maximum savings.
Context window management — trimming conversation history keeps prompts within the cache prefix boundary and avoids the lookback-window miss when history grows past 20 blocks.
KV cache internals — understanding how attention's key-value tensors work explains exactly why prefix caching is possible and why only the prefix (not arbitrary middle sections) can be cached.

Limitations to be aware of

Cache is per-account and per-model — a cache entry created with claude-sonnet-4-6 is not reused by claude-opus-4-8, and caches are not shared between different API keys
No cache invalidation API — you cannot force-expire an entry; you wait for the TTL
Automatic caching not available on Bedrock / Vertex AI — on those platforms you must use explicit cache_control markers
Not a substitute for streaming — caching reduces time-to-first-token but the model still generates output sequentially; combine with streaming for perceived responsiveness
Prefix must be exact — even a single changed character in the cached portion causes a cache miss and a full re-write

FAQ

Does prompt caching change the model's response?

No. The cached KV tensors are mathematically identical to what the model would compute fresh — the output is the same. Caching only affects cost and speed, not quality or content.

How do I know if my cache is actually being hit?

Check the usage object in the API response. Anthropic returns cache_read_input_tokens (tokens served from cache) and cache_creation_input_tokens (tokens written to cache). If cache_read_input_tokens is greater than zero, you had a cache hit. OpenAI reports cached_tokens inside usage.prompt_tokens_details.

What happens when the TTL expires?

The cached KV state is discarded. The next request processes the full prefix normally and pays the standard 1.25× write cost to repopulate the cache. The 5-minute TTL resets on every successful cache hit, so active sessions rarely see expiry.

Can I cache the middle of a prompt, not just the beginning?

No — only prefixes can be cached. LLMs process tokens left-to-right; caching stores computation state at a fixed point. You can place multiple breakpoints to cache several contiguous sections from the start, but you cannot skip the beginning and cache only the middle.

Is prompt caching worth it for short prompts?

Usually not. Anthropic requires a minimum of 1,024 tokens for most models before a cache write is accepted. For shorter prompts the write overhead is wasted and you get no savings. Focus caching efforts on system prompts, documents, or tool schemas that exceed that threshold.

Does OpenAI charge for cache writes?

No — OpenAI's automatic caching has no cache-write surcharge. You simply pay 50% of the normal rate when a cache hit occurs, and the full normal rate when there is a miss. Anthropic charges 1.25× for a 5-minute write or 2× for a 1-hour write, offset by the 10× saving on reads.

Further reading