What Is GPTCache? A Semantic Cache for LLM Calls

After reading, you'll understand what GPTCache is, how a semantic cache reuses answers to similar prompts, and how it trades a similarity threshold against accuracy to cut cost and latency.

INTERMEDIATE10 MIN READUPDATED 2026-06-14

zilliztech/GPTCache8.1k REFERENCEWikipedia

In plain English

Every time your app calls a large language model, you pay for it twice: once in money (tokens cost money) and once in time (the user waits for the answer to be generated). If a hundred users ask the same question, a naive app pays both costs a hundred times over, even though the answer never changed.

GPTCache — illustration — GPTCache — marktechpost.com

GPTCache is an open-source library that sits in front of your LLM and remembers past answers. Before sending a new prompt to the model, it checks: have I already answered something close enough to this? If yes, it returns the stored answer instantly — no model call, no token bill, almost no wait. If no, it calls the model as usual and saves the result for next time.

The clever part is the phrase close enough. A plain cache only reuses an answer when the new request is byte-for-byte identical to an old one. But "What's your refund policy?" and "How do refunds work here?" are different strings, so an exact cache treats them as two separate questions and pays twice. GPTCache is a semantic cache: it matches on meaning, not exact text, so both of those questions can share one cached answer.

Why it matters

Real LLM traffic is repetitive. FAQ bots, documentation assistants, and product chat all get the same handful of questions asked thousands of times in slightly different words. Without a semantic cache, each rephrasing is a full, fresh model call. With one, the repeats collapse onto a small set of stored answers.

Cost. A cache hit costs essentially nothing — a quick embedding lookup instead of a full generation. On repetitive workloads that can remove a large slice of your token bill. See cutting LLM token costs for the bigger picture.
Latency. A model call can take seconds, especially for long answers. A cache hit returns in milliseconds, which is the difference between an answer that feels instant and one the user watches stream. This is closely tied to time to first token and reducing LLM latency.
Load and rate limits. Fewer model calls means you stay under provider rate limits and put less pressure on any self-hosted serving stack during traffic spikes.
Provider independence. GPTCache wraps the call, so it works the same whether you're hitting a hosted API or your own model. The cache layer doesn't care who answers a miss.

Who should care? Anyone running an LLM feature with public, repeating questions and answers that don't need to be unique per call — support bots, help-center search, internal knowledge assistants. Who should not reach for it blindly? Anything where every answer must be personalized, time-sensitive, or legally fresh. Caching a generic FAQ answer is great; caching "what's my account balance?" is a bug waiting to happen.

How it works

GPTCache turns a cache lookup into a similarity search. Instead of comparing the raw text of two prompts, it converts each prompt into an embedding — a list of numbers that captures meaning — and then asks: is there a stored prompt whose embedding sits close to this one? Closeness is measured by a similarity score, and you set a threshold that decides how close is close enough to count as a hit.

// A request through GPTCache

New promptuser questionEmbedtext → vectorSimilarity searchfind nearest stored promptAbove threshold?hit vs. miss decisionReturn / generatecached answer or call model

Hit: similar enough

If the nearest stored prompt scores above your threshold, that's a cache hit. GPTCache returns the answer it saved for that earlier prompt — no model call at all. This is the fast, free path.

Miss: nothing close enough

If nothing scores above the threshold, that's a cache miss. GPTCache forwards the prompt to the real LLM, gets a fresh answer, returns it to the user, and stores the new prompt embedding plus its answer. The next time a similar question arrives, it's a hit.

The moving parts

Under the hood GPTCache is modular — you can swap each stage — but conceptually there are four pieces working together on every request:

Component	Job
Embedding function	Turns each prompt into a vector that captures its meaning
Vector store	Holds past prompt embeddings and finds the nearest one fast
Similarity evaluator + threshold	Scores the match and decides hit vs. miss
Cache store + eviction	Keeps the prompt→answer pairs and drops old ones when full

Eviction is the last piece: a cache can't grow forever, so GPTCache applies a policy (such as least-recently-used) to drop stale or rarely-hit entries once the store fills up. That keeps memory bounded and stops ancient answers from lingering.

The threshold tradeoff

The single most important knob in a semantic cache is the similarity threshold. It decides how loosely GPTCache is allowed to call two prompts "the same question," and it directly trades hit rate against correctness. Get it wrong in either direction and the cache hurts you.

// Tuning the similarity threshold

Threshold too loose

Matches questions that only look related
High hit rate, low cost
Returns wrong or stale answers
"Refund policy" answers a shipping question

Threshold too strict

Only near-identical prompts hit
Answers stay correct
Low hit rate, little savings
Most rephrasings still hit the model

There's no universal "right" threshold — it depends on your embedding model and how varied your questions are. The honest way to set it is empirical: collect a sample of real prompts, label which pairs should share an answer, then pick the threshold that maximizes hits without letting a wrong answer slip through. Watch two numbers as you tune: the hit rate (how often the cache answers) and the false-hit rate (how often a hit returns an answer that doesn't actually fit the new question).

Where semantic caching fits among the others

"Caching" is an overloaded word in LLM land. Three very different techniques all carry the name, and they're not interchangeable — GPTCache does only the first. Knowing the difference keeps you from reaching for the wrong tool.

Technique	Matches on	Reuses	Saves
Semantic cache (GPTCache)	Meaning of the prompt	The whole final answer	An entire model call
Exact / prompt cache	Identical prompt text	The whole final answer	An entire model call (only on exact repeats)
Prefix / KV cache	A shared leading chunk of tokens	Internal compute for the shared prefix	Recomputation inside one call

Exact caching reuses an answer only when the prompt string is identical — simple and zero-risk, but it misses every rephrasing. Prefix caching (and the underlying KV cache) works inside a single model call: it reuses the model's computation for a shared opening section of the prompt — like a long system message repeated across requests — but it still generates a fresh answer every time. See prefix caching explained and the full prompt caching vs semantic caching comparison.

A minimal mental model in code

GPTCache itself wraps your LLM client so the cache is mostly invisible — you keep calling the model the way you already do, and hits get served transparently. But the whole idea fits in a few lines of plain Python, which makes the mechanism obvious: embed, search, compare to a threshold, then hit or miss.

the semantic-cache idea, stripped downpython

import numpy as np

store = []          # list of (embedding, answer)
THRESHOLD = 0.85    # cosine similarity needed to count as a hit

def ask(prompt):
    q = embed(prompt)                      # prompt -> normalized vector

    # 1) SEARCH: find the closest stored prompt, if any.
    best_score, best_answer = 0.0, None
    for vec, answer in store:
        score = float(q @ vec)             # cosine similarity
        if score > best_score:
            best_score, best_answer = score, answer

    # 2) DECIDE: above the threshold is a hit.
    if best_score >= THRESHOLD:
        return best_answer                 # cache hit: no model call

    # 3) MISS: call the real model, then store for next time.
    answer = call_llm(prompt)
    store.append((q, answer))
    return answer

That's the entire principle. Real GPTCache swaps the Python list for a proper vector store so search stays fast with millions of entries, adds eviction so the store doesn't grow without bound, and lets you choose the embedding model and similarity metric. But the four steps — embed, search, threshold, hit-or-store — never change.

Going deeper

Once the basic cache works, the interesting problems are about when not to trust it and how to keep it honest. A few directions worth knowing.

Freshness and invalidation. A cache is a copy, and copies go stale. If the underlying answer changes — a policy update, a new price, a corrected fact — the cached version is now wrong. Plan an invalidation strategy from the start: time-to-live (entries expire after N hours), versioned cache keys you bump when source documents change, or manual purges tied to your content pipeline. Caching is wonderful for evergreen answers and dangerous for volatile ones.

Context blindness. Two users can send the same words and still deserve different answers, because the context differs — their account, their conversation history, their permissions. A semantic cache keyed only on the prompt text will happily serve one user's answer to another. The fix is to fold the relevant context into the cache key (so different users or sessions don't collide) and to simply not cache anything personalized or sensitive.

Embedding choice drives everything. The cache is only as good as its sense of "similar," which is the embedding model's job. A weak embedding model will both miss true matches and create false ones, no matter how carefully you tune the threshold. If your cache quality is poor, suspect the embeddings before the threshold.

Measure, don't assume. Treat the cache like any other production component. Track hit rate (is it actually saving anything?), false-hit rate (is it ever wrong?), and the latency of the lookup itself (a slow embedding step can erase the win on misses). A cache that hits 2% of the time while adding latency to the other 98% is a net loss — and you'd only know by measuring.

GPTCache vs. a gateway's built-in cache. Some managed AI gateways offer caching as a feature you turn on with a config flag, no library to run. GPTCache is the self-hosted alternative: more control over the embedding model, store, threshold, and eviction, at the cost of running it yourself. Choose the gateway for convenience, GPTCache (or a similar self-hosted layer) when you need to tune the semantics. Either way, the concepts in this article are the same.

FAQ

What is GPTCache used for?

GPTCache is an open-source semantic cache that sits in front of an LLM and reuses past answers for prompts that mean the same thing, even when worded differently. It's used to cut token cost and response latency on repetitive workloads like FAQ bots, documentation assistants, and support chat.

How is a semantic cache different from a normal cache?

A normal (exact) cache only reuses an answer when the new request is byte-for-byte identical to an old one. A semantic cache like GPTCache embeds the prompt and matches on meaning, so "How do refunds work?" and "What's the refund policy?" can share one cached answer even though the text differs.

What is the similarity threshold in GPTCache?

It's the score a new prompt must reach against the closest stored prompt to count as a cache hit. A loose threshold raises the hit rate but risks returning wrong or stale answers; a strict threshold keeps answers correct but saves less. There's no universal value — tune it on real prompts while watching both hit rate and false-hit rate.

Does GPTCache risk returning a wrong answer?

Yes, if the threshold is too loose it can serve an answer from a prompt that only looked similar — a "false hit." A false hit is worse than a miss because it ships a confidently wrong answer. Start with a strict threshold, avoid caching personalized or time-sensitive content, and measure the false-hit rate in production.

How is GPTCache different from prompt caching or prefix caching?

GPTCache reuses the entire final answer when a new prompt is semantically similar to an old one, skipping the model call. Prefix/KV caching works inside a single call: it reuses the model's computation for a shared leading chunk of tokens (like a repeated system prompt) but still generates a fresh answer. They solve different problems and can be used together.

When should I not use a semantic cache?

Avoid caching answers that must be unique, personalized, or freshly accurate per request — account balances, live data, legal or medical specifics. Caching evergreen, public answers is the sweet spot; caching volatile or per-user answers invites stale and cross-user mistakes unless you key the cache on context and set short expirations.

// In plain English

// Why it matters

// How it works

Hit: similar enough

Miss: nothing close enough

The moving parts

// The threshold tradeoff

// Where semantic caching fits among the others

// A minimal mental model in code

// Going deeper

// FAQ

// Further reading

// Related