AI/TLDR

What Is Semantic Caching? Reusing Answers to Similar Questions

Understand how semantic caches match similar questions to stored answers, and the false-hit risks that come with them.

BEGINNER12 MIN READUPDATED 2026-06-11

In plain English

Semantic caching is a way to reuse an LLM's answer when a new question means the same thing as one you've already answered — even if the wording is completely different. Instead of paying the model to think again, you hand back the stored answer in milliseconds.

Here's the everyday analogy. Imagine a help desk where the same handful of questions come in all day: "How do I reset my password?", "I forgot my password, what now?", "Can't log in — need a new password." Those are three different sentences asking one thing. A sharp support agent recognizes the intent instantly and reads back the same answer they've given fifty times today. They don't re-derive it from scratch each time. A semantic cache gives your app that same instinct: spot that two questions are really the same question, and reuse the work.

The trick is the word semantic — meaning-based, not character-based. A plain cache (the kind that runs the web) only fires on an exact match: the new request has to be byte-for-byte identical to a stored one. That's almost useless for natural language, because nobody phrases a question the same way twice. Semantic caching loosens the rule from "identical text" to "close enough in meaning," which is exactly what makes it work for human questions — and also exactly what makes it risky.

Why it matters

Two things hurt every production LLM app: it costs money per call, and it's slow. A round trip to a frontier model can take a couple of seconds and bills you for every token in and out (see LLM API pricing). When the same questions arrive over and over — and in most real apps they do — you're paying full price and full latency to answer something you already answered an hour ago. That's pure waste.

Semantic caching attacks both at once. A cache hit skips the model entirely: no tokens billed, no generation time. The answer comes back from a fast vector lookup instead of a slow language model. On a support bot or FAQ assistant where a chunk of traffic is repeated intent, that can mean a meaningful slice of requests served for near-zero cost and near-zero latency. It's one of the highest-leverage moves in the cost and latency optimization toolkit precisely because it cuts the bill and speeds up the app with one mechanism.

Who should care

  • Anyone running a high-traffic Q&A or support bot — the more repetitive your incoming questions, the bigger the win. A cache is only as valuable as your hit rate.
  • Teams watching the LLM bill climb — if you're paying to re-answer FAQs all day, a cache is the cheapest cut you'll find.
  • Latency-sensitive products — autocomplete, in-app help, anything where a two-second wait feels broken. A cached answer is effectively instant.
  • RAG and agent builders — caching common sub-questions trims both spend and the number of slow steps in a chain.

What did it replace? For most teams, nothing systematic — they either ate the cost or hand-rolled a brittle exact-match cache that almost never fired because users never type the same thing twice. Don't confuse this with the prompt caching some providers offer at the API level, which reuses the model's internal work on a repeated prefix (a long system prompt) but still runs the model. Semantic caching skips the model call altogether. They stack nicely — but they're different tools.

How it works

A semantic cache sits in front of your model. Every incoming question takes a detour through the cache first. The cache turns the question into an embedding — a list of numbers that captures its meaning — and searches a vector database of past questions for the nearest one. If the closest match is similar enough, the cache returns that stored answer and you never call the model. If nothing is close enough, you call the model as usual and save the new question-and-answer pair for next time.

"Close enough" is the heart of the whole thing, and it comes down to one number: the similarity threshold. Each question becomes a point in vector space; two questions that mean the same thing land near each other. The cache measures the distance (usually cosine similarity) between the new question and its nearest neighbor and compares it to your threshold. Above the line, it's a hit. Below, it's a miss.

ThresholdBehaviorThe risk
Too strict (very high)Almost nothing matchesLow hit rate — the cache barely earns its keep
Well-tunedGenuine paraphrases hit; different questions missThe sweet spot you're aiming for
Too loose (low)Lots of hits, including wrong onesFalse hits — a stale or off-topic answer served confidently

The flow above is the read path. There's also a write path — what happens on a miss. After the model answers, you store the question's vector and the answer in the cache so the next paraphrase hits. Over time the cache fills with your real traffic and the hit rate climbs. Most caches also attach a time-to-live (TTL) so entries expire; you don't want to serve a six-month-old answer to a question about something that changed last week.

A minimal example

You don't need a library to grasp this — the whole idea fits in a few lines. Below is a toy semantic cache: it embeds each question, compares it to everything stored, and returns the cached answer when the best match clears the threshold. In production you'd swap the brute-force loop for a real vector database, but the logic is identical.

tiny_semantic_cache.pypython
import numpy as np
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer

client = Anthropic(api_key="sk-...")          # placeholder
embedder = SentenceTransformer("all-MiniLM-L6-v2")
THRESHOLD = 0.85                              # tune this on real traffic

cache = []  # list of {"vec": ndarray, "q": str, "a": str}

def cosine(a, b):
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

def ask(question: str) -> str:
    vec = embedder.encode(question)
    # 1. Look for a close-enough past question.
    best, best_sim = None, -1.0
    for entry in cache:
        sim = cosine(vec, entry["vec"])
        if sim > best_sim:
            best, best_sim = entry, sim
    if best and best_sim >= THRESHOLD:
        print(f"[cache hit: {best_sim:.2f}]")
        return best["a"]                       # skip the model entirely
    # 2. Miss — call the model, then store the result.
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=256,
        messages=[{"role": "user", "content": question}],
    )
    answer = msg.content[0].text
    cache.append({"vec": vec, "q": question, "a": answer})
    return answer

print(ask("How do I reset my password?"))     # miss -> calls the model
print(ask("I forgot my login, how do I fix it?"))  # likely hit -> instant

That's the entire concept. The first question is a miss and costs a real model call; the second is a paraphrase, scores above 0.85, and comes straight back from the cache. Notice the one knob that matters: THRESHOLD. Everything good and everything dangerous about semantic caching lives in that number — too high and the second question misses, too low and a genuinely different question would have hit by mistake.

The false-hit problem

Semantic caching's superpower — matching by meaning instead of exact text — is also its sharpest edge. The danger is the false hit: two questions look close in vector space but actually need different answers, and the cache serves the stale one with full confidence. This is the failure mode that turns a money-saver into a bug factory if you're careless.

Classic traps where "similar" text means very different things:

  • Negation. "Can I cancel my order?" and "Can I not cancel my order?" sit close in embedding space but want opposite answers. Embeddings are famously weak at flipping on a single "not."
  • Specific entities. "What's the return policy for shoes?" vs "...for electronics?" — one word changes the correct answer, but the sentences are otherwise twins.
  • Numbers and dates. "Orders over $50" vs "over $500", or "my 2024 invoice" vs "2025" — tiny textual differences, completely different facts.
  • Personal or stateful questions. "What's my account balance?" must never be cached and reused across users — a false hit here is a data leak, not just a wrong answer.

How teams keep false hits in check:

  1. Tune the threshold on real data, not vibes. Collect pairs of questions, label which should share an answer, and pick the threshold that maximizes true hits while keeping false hits near zero. Lean strict by default.
  2. Scope the cache. Never cache user-specific or session-specific answers in a shared cache. Partition by user, tenant, or topic so a hit can only ever come from the right pool.
  3. Verify high-stakes hits. For anything sensitive, treat a cache hit as a candidate and have a cheap check (even a small model) confirm the stored answer still fits before returning it.
  4. Expire aggressively. Short TTLs for anything that can change — prices, policies, availability. A correct answer from last month can be a wrong answer today.

Going deeper

Once a basic cache is live, a set of harder questions appears — measurement, eviction, freshness, and how the cache interacts with the rest of the LLMOps stack.

Measuring whether it's even worth it

Two numbers decide a cache's value. Hit rate is the fraction of requests served from cache — too low and the embedding overhead isn't paying for itself. False-hit rate is the fraction of those hits that were actually wrong — the number that keeps you honest. You can't manage these by feel; you wire them into observability and watch them like any other production metric. A cache that quietly drifts to a 5% false-hit rate is a slow-motion outage, so this belongs in your eval suite, not just a dashboard.

Eviction and cache size

A cache can't grow forever. When it fills, something has to go — usually the least-recently-used or least-frequently-hit entries, the same eviction policies traditional caches use. A bigger cache means more potential hits but slower search and more memory; the right size is a tuning exercise driven by your traffic and your vector index. Approximate nearest-neighbor indexes (the indexing techniques behind vector search) keep lookups fast even as the cache grows into the millions of entries.

Choosing the embedding model

Your cache is only as good as its sense of "similar," and that comes entirely from the embedding model. A weak embedder blurs distinct questions together (more false hits) or fails to see that two paraphrases match (lower hit rate). A small, fast embedding model keeps lookups cheap but is coarser; a larger one is sharper but adds latency to every request — including misses. Picking the embedder is a real trade-off, not a default, and it's worth measuring on your own questions.

Where it sits in production

Mature setups put the semantic cache inside an LLM gateway — the proxy layer your app calls instead of the provider directly — alongside logging, rate limiting, and model routing. That keeps caching logic in one place and lets you flip it per-route: cache the FAQ endpoint hard, never cache the personalized one. It also composes with provider-side prompt caching: the gateway's semantic cache skips whole calls for repeated intent, while prompt caching discounts the calls that do go through. Production systems layer both, plus a plain exact-match cache for truly identical requests, into a tiered defense against cost and latency.

FAQ

What is semantic caching for LLMs?

Semantic caching stores past question-and-answer pairs and returns a stored answer when a new question means the same thing as an old one — even if it's worded differently. It matches by meaning using embeddings, not by exact text, so paraphrases of the same question can reuse one answer and skip the model call entirely.

How is semantic caching different from normal caching?

A normal cache only fires on an exact, byte-for-byte match, which almost never happens with natural-language questions because people phrase things differently every time. A semantic cache loosens the rule to "close enough in meaning," using embeddings and a similarity threshold, so it actually catches the repeated questions a normal cache would miss.

How does GPTCache work?

GPTCache is an open-source semantic cache for LLM apps. It embeds each incoming query into a vector, searches a vector store for the nearest past query, and if the match clears a similarity threshold it returns the stored response instead of calling the model. It handles the embedding, similarity search, storage, and eviction so you don't build them yourself.

What are the risks of semantic caching?

The main risk is a false hit: two questions look similar in vector space but need different answers, so the cache confidently returns the wrong one. Negation, specific entities, numbers, and user-specific questions are common traps. Caching personalized answers in a shared cache can even leak one user's data to another.

How do I set the similarity threshold for a semantic cache?

Tune it on real data, not by guessing. Collect pairs of questions, label which should share an answer, then pick the threshold that maximizes correct hits while keeping false hits near zero. When unsure, lean strict — a stingy cache that misses occasionally is far safer than a loose one that serves wrong answers.

Is semantic caching the same as prompt caching?

No. Prompt caching, offered by some providers, reuses the model's internal work on a repeated prompt prefix but still runs the model. Semantic caching skips the model call entirely when a similar question was already answered. They solve different problems and stack well together in production.

Further reading