AI/TLDR

What Is a Retriever? How RAG Finds the Right Documents

You'll understand the retriever's job in a RAG system and the difference between dense, sparse, and hybrid retrieval.

BEGINNER11 MIN READUPDATED 2026-06-11

In plain English

A language model knows a lot, but it doesn't know your stuff — your company's docs, last week's support tickets, the PDF you uploaded an hour ago. To answer questions about that material, you have to hand the relevant pages to the model inside the request. The thing that finds those pages is the retriever.

Picture a research assistant standing in front of a library of ten thousand documents. You ask a question. The assistant doesn't read all ten thousand — they grab the five or six that look most relevant and slide them across the desk. That's the entire job of a retriever: given a question, return the handful of chunks most likely to contain the answer. It doesn't write the answer itself. It just decides what the model gets to read.

This is the "R" in RAG — Retrieval-Augmented Generation. RAG is the pattern of fetching relevant text and pasting it into the prompt so the model answers from real sources instead of guessing from memory. The retriever is the fetcher. The model is the writer. Get the retriever wrong and the best model in the world will confidently answer from the wrong pages.

Why it matters

Here's the uncomfortable truth about RAG: the retriever is usually where it breaks. If the right chunk never makes it into the prompt, the model can't answer correctly no matter how smart it is — it will either say "I don't know" or, worse, invent something plausible. Most "the bot gave a wrong answer" bugs are not generation failures. They're retrieval failures wearing a generation costume.

It matters because the context window is a tiny, expensive slot. You can't dump all ten thousand documents into the prompt — it wouldn't fit, it would cost a fortune, and accuracy degrades as the window fills anyway. So the retriever's job is brutal triage: out of everything you know, pick the few pieces that earn a seat. Choosing those few well is the single highest-leverage thing in a RAG system.

Who should care:

  • Anyone building a docs chatbot or internal Q&A. The retriever decides whether "how do I reset 2FA?" surfaces the 2FA page or a random release note.
  • Anyone giving an agent access to a knowledge base. Agentic RAG lets the model issue its own searches — but those searches still go through a retriever, so its quality sets the ceiling.
  • Anyone trying to reduce hallucinations. Grounding only works if the grounding material is actually relevant. A bad retriever grounds the model in the wrong facts.

What did it replace? The old answer was "fine-tune the model on your data" or "stuff everything into a giant prompt." Fine-tuning is slow, expensive, and goes stale the moment your docs change. Stuffing everything doesn't scale. A retriever sidesteps both: keep the documents in a searchable store, fetch only what each question needs, and your knowledge updates the instant you re-index — no retraining required.

How it works

Retrieval happens in two phases. There's a one-time ingestion phase where you prepare your documents, and a per-query retrieval phase that runs every time someone asks something. Ingestion is where chunking lives: you split long documents into bite-sized passages, because retrieving a whole 80-page manual is useless — you want the one paragraph that answers the question.

The interesting question is how the retriever scores chunks against the query. There are two fundamentally different strategies, and the difference between them is the most important thing to understand about retrieval.

Sparse retrieval (keyword matching)

The classic approach matches words. If the query says "2FA reset" and a chunk contains the words "2FA" and "reset," it scores high. The dominant algorithm is BM25 — a refined version of keyword search that weights rare words more heavily and stops rewarding a word once it appears many times. It's called "sparse" because each document is represented as a huge mostly-zero vector, one slot per vocabulary word. BM25 is fast, needs no machine learning, and is shockingly hard to beat on exact terms, product codes, error numbers, and acronyms.

Dense retrieval (meaning matching)

The modern approach matches meaning. An embedding model turns each chunk into a single dense vector — a list of a few hundred numbers — positioned so that texts with similar meanings land near each other. At query time you embed the question the same way and find the nearest vectors. This is semantic search, and it shines where words differ but meaning matches: a query for "can't sign in" finds a chunk about "authentication failures" even with zero shared words. Dense retrievers are sometimes called bi-encoders because the query and the document are encoded separately and compared by distance.

Dense vs sparse vs hybrid

Neither strategy wins outright — they fail in opposite ways. Dense retrieval can miss an exact term it has never seen (a brand-new product SKU embeds to nothing useful). Sparse retrieval misses synonyms and paraphrases entirely. So the standard production answer is hybrid retrieval: run both, then merge the two ranked lists into one.

The catch when merging is that BM25 scores and cosine similarity scores live on totally different scales — you can't just add them. The standard fix is Reciprocal Rank Fusion (RRF), which ignores the raw scores and combines the two lists by rank alone: a chunk that ranks #1 in either list gets a strong boost, and chunks that rank well in both rise to the top. It's a two-line formula, it needs no tuning, and it's the default in most hybrid search systems for good reason.

Many teams add a final step after fusion: a reranker. The retriever casts a wide net — fetch the top 50 — and then a slower, more accurate cross-encoder model re-scores those 50 by reading the query and each chunk together, returning a tight top 5. Retrieve broad and cheap, then rerank narrow and precise. That two-stage shape is the backbone of strong RAG retrieval.

Build a retriever in code

Here's a minimal hybrid retriever in plain Python — sparse BM25, dense embeddings, and RRF to fuse them — with no framework, so you can see the moving parts. In a real app you'd store the dense vectors in a vector database instead of a list, but the logic is identical.

hybrid_retriever.pypython
from rank_bm25 import BM25Okapi          # pip install rank-bm25
from sentence_transformers import SentenceTransformer, util

chunks = [
    "To reset two-factor authentication, open Settings > Security.",
    "Our refund policy allows returns within 30 days of purchase.",
    "If you are locked out, contact support to disable 2FA.",
    "The API rate limit is 1000 requests per minute per key.",
]

# --- Sparse index: BM25 over tokenized words ---
bm25 = BM25Okapi([c.lower().split() for c in chunks])

# --- Dense index: one embedding per chunk ---
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(chunks, convert_to_tensor=True)

def retrieve(query, k=2):
    # Rank chunks by each method (best first)
    sparse = sorted(range(len(chunks)),
                    key=lambda i: bm25.get_scores(query.lower().split())[i],
                    reverse=True)
    q_emb = model.encode(query, convert_to_tensor=True)
    sims = util.cos_sim(q_emb, embeddings)[0]
    dense = sorted(range(len(chunks)), key=lambda i: sims[i], reverse=True)

    # Reciprocal Rank Fusion: combine by RANK, not raw score
    scores = {}
    for ranked in (sparse, dense):
        for rank, idx in enumerate(ranked):
            scores[idx] = scores.get(idx, 0) + 1 / (60 + rank)

    best = sorted(scores, key=scores.get, reverse=True)[:k]
    return [chunks[i] for i in best]

print(retrieve("how do I turn off 2FA?"))

Run it and the query "how do I turn off 2FA?" surfaces both 2FA chunks — even though the second one says "locked out" and "disable" rather than "turn off." Sparse retrieval catches the literal "2FA"; dense retrieval catches the paraphrase; RRF merges them. The 60 in the formula is the conventional RRF constant that softens how much rank-1 dominates. That's the whole idea behind production hybrid search, minus the scale.

Common pitfalls

Most retriever problems trace back to a short list of mistakes:

  • Bad chunks in, bad results out. If your chunks are too big, the relevant sentence is diluted by surrounding noise; too small, and they lose the context that made them meaningful. Retrieval can only return what ingestion prepared — fix chunking first.
  • Dense-only on technical content. Pure embedding search quietly fumbles part numbers, error codes, and exact names. If users search by identifiers, you need a sparse component. This is the most common reason a demo that worked on prose flops on a product catalog.
  • top-k set wrong. Too few and the answer chunk never makes the cut; too many and you flood the model with noise and burn tokens. Start around 5, then tune by measuring.
  • Embedding mismatch. You must embed the query with the same model you embedded the documents with. Mix two embedding models and the vectors live in different spaces — distances become meaningless.
  • No reranking on a noisy corpus. When chunks are similar to each other, first-stage retrieval gets the right ones into the top 50 but not the top 5. A cross-encoder reranker fixes the ordering cheaply.

Crucially, you can't fix what you don't measure. Treat retrieval as its own evaluation target — separate from the model's answers — and track whether the right chunk actually showed up. The standard metrics here (recall@k, precision, and friends) are covered in how to evaluate a RAG system.

Going deeper

Once the basic pipeline works, the frontier is about closing the gap between "what the user typed" and "how the answer is phrased." Three techniques dominate. Query rewriting uses an LLM to expand or rephrase the question before retrieval — turning a terse "refunds?" into "What is the refund and return policy and time window?" so it embeds closer to the actual document. Multi-query retrieval generates several paraphrases of the question, retrieves for each, and unions the results, which dramatically improves recall on ambiguous queries. And HyDE (Hypothetical Document Embeddings) has the model write a fake answer first, then retrieves using that — because a hypothetical answer often sits closer in embedding space to the real source than the question does.

There's also a middle path between sparse and dense called learned sparse retrieval — models like SPLADE produce sparse vectors (one weight per vocabulary term, like BM25) but learn which terms matter and even add related terms the document never literally contained. You get keyword-style exact matching with a dose of semantic expansion, and it indexes in the same inverted-index machinery BM25 uses. A heavier cousin, ColBERT, keeps a separate vector per token instead of one per chunk and matches them individually ("late interaction"), trading storage for noticeably better precision.

Production retrieval has its own hard problems. Metadata filtering — restricting search to documents the user is allowed to see, or to a date range — interacts awkwardly with approximate vector indexes and is a frequent source of subtle bugs. Freshness matters: when documents change, stale embeddings linger until you re-index, so teams build incremental pipelines. And multi-tenancy (one index, many customers, strict isolation) forces design choices that pure-relevance benchmarks never reveal. Retrieval also doesn't have to stop at flat text: GraphRAG and agentic approaches let the system follow relationships between documents or issue multiple iterative searches, which helps on questions whose answer is scattered across several sources.

The honest open problem is that relevance is not the same as helpfulness. A retriever optimizes for similarity to the query, but the chunk most similar to the question isn't always the one that best supports a correct answer — and no offline metric fully captures that gap. That's why serious RAG teams measure the end-to-end answer quality, not just retrieval scores, and why the retriever, the reranker, and the generator are tuned together rather than in isolation.

FAQ

What is the difference between a retriever and vector search?

Vector search is one technique a retriever can use — finding the nearest embedding vectors to your query. A retriever is the broader component whose job is to return relevant chunks; it might use vector search, keyword search (BM25), or both fused together. So vector search is the engine; the retriever is the whole car.

What is the difference between dense and sparse retrieval?

Sparse retrieval (like BM25) matches the actual words in the query against the words in documents — great for exact terms, codes, and names. Dense retrieval uses embeddings to match meaning, so it finds relevant text even when the wording differs. Sparse is blind to synonyms; dense can miss rare exact terms. Most production systems run both.

Do I always need a vector database for retrieval?

No. For small or keyword-heavy corpora, BM25 alone (no vectors at all) is often enough and far simpler. You need a vector database once you want semantic matching at scale — when meaning matters more than exact words and you have too many chunks to brute-force compare every embedding.

What is hybrid retrieval and why do people use it?

Hybrid retrieval runs both keyword (sparse) and embedding (dense) search, then merges the two ranked lists — usually with Reciprocal Rank Fusion (RRF). People use it because the two methods fail in opposite ways: keyword search misses synonyms, semantic search misses exact terms. Combining them covers both, which is why it's the common default.

What is a good top-k value for a retriever?

There's no universal number, but starting around 5 chunks is a sensible default for most Q&A systems. Too few and the answer chunk may never make the cut; too many floods the model with noise and wastes tokens. A common pattern is to retrieve a wide top-k (say 50), then rerank down to the best 5.

Further reading