AI/TLDR

What Is Hybrid RAG?

You'll understand how running BM25 and vector search in parallel — then fusing their results — closes the gaps that either method leaves on its own.

INTERMEDIATE12 MIN READUPDATED 2026-06-12

In plain English

Imagine two librarians answering the same question. The first one has read every book cover to cover and understands meaning — she finds documents that are about your topic even if the exact words don't appear. The second librarian has memorised every word in the index and can instantly locate the one document that contains an unusual code, a version number, or a rare product name nobody else ever wrote. Neither librarian is wrong. But for most real questions you want both of them looking at the same time.

Hybrid RAG does exactly that. It runs two retrievers in parallel — a vector (dense) retriever that works by semantic meaning, and a keyword (sparse) retriever using the BM25 algorithm — then merges their result lists into one ranked list before passing the top chunks to the language model. The fusion step is usually Reciprocal Rank Fusion (RRF), which combines the two ranked lists without needing to reconcile their incompatible raw scores.

The result is a retriever that handles both the kinds of queries that break pure vector search (exact codes, version strings, feature flag names) and the kinds that break pure keyword search (conceptual questions where the right document uses different words than the query). Most production RAG systems at companies like Perplexity and Glean use this pattern.

Why it matters

Pure vector search has a quiet failure mode that only shows up on the long tail of real queries. Vector embeddings are approximation engines — they compress meaning into a point in high-dimensional space. That compression is excellent for capturing semantic intent, but it systematically blurs small distinguishing tokens.

  • Error codes. ERR_BLOCKED_BY_CLIENT and ERR_CONNECTION_REFUSED sit close in embedding space because they're both browser network errors. A vector retriever may surface the wrong one; BM25 returns an exact match because the token is rare (high IDF) and the query literally contains it.
  • Version numbers. v3.2 vs v3.3 may be semantically indistinguishable to an embedding model. If a user asks for the "v3.2 migration guide", vector search might return the v3.3 guide — same meaning, wrong document.
  • Feature flags and product identifiers. "Rollback runbook for payments-v2-rollout" requires both semantic understanding (what a runbook is) and exact token matching (the feature flag name). Either approach alone gets half the job done.
  • Rare proper nouns and domain jargon. Terms that appear in your corpus infrequently get low weight in embedding training data. BM25's Inverse Document Frequency (IDF) mechanism gives rare terms high weight — the exact opposite behaviour, and exactly what you need.

BM25 alone fails on the other side: conceptual queries where the right document never uses the user's exact words. "How do I cancel my account?" won't match a help article titled "Closing your subscription" via keyword search, but a vector retriever catches it through semantic similarity.

Most real user queries land somewhere in the middle — they have both an intent (vector territory) and specific identifiers (BM25 territory). Hybrid retrieval covers the whole space instead of optimising for one half.

How it works

A hybrid RAG pipeline has three mechanical stages: parallel retrieval, score fusion, and (optionally) reranking. Each stage solves a distinct problem.

Stage 1 — Parallel retrieval

The same query is sent simultaneously to two indexes. The dense index stores vector embeddings of your chunks — typically in a vector database. The query is embedded with the same model and the top-k nearest vectors are returned (usually k = 50–150 candidates, more than the final top-k you'll pass to the LLM). The sparse index is an inverted index — BM25 scores each document by how often query terms appear in it, weighted by how rare those terms are across the entire corpus. Both retrievers return their own ranked lists independently.

Stage 2 — Reciprocal Rank Fusion (RRF)

You cannot simply average BM25 scores and cosine similarity scores — they live on incompatible scales. BM25 produces unbounded positive numbers; cosine similarity runs from 0 to 1. Normalising them introduces assumptions that break on unusual query distributions. RRF sidesteps this entirely by operating on rank position, not raw scores.

The formula is: RRF_score(d) = Σ 1 / (k + rank_r(d)) where the sum is over each retriever r, and k is a smoothing constant (default 60). A document ranked #1 contributes roughly 0.0164; ranked #10 it contributes roughly 0.0143. A document that appears in both retrievers' top results accumulates points from both terms — RRF rewards consensus. The final list is sorted by descending RRF score.

Alternative: alpha weighting

Some systems use a weighted average instead: score = α × vector_score + (1 − α) × BM25_score where both scores are normalised to [0,1] first. Alpha = 0 means pure BM25; alpha = 1 means pure vector. This is simpler to reason about when you have a strong prior — for example, a code-search app might hardcode alpha = 0.3 to lean on keyword matching. The drawback is that the optimal alpha shifts per query type, making it less robust than RRF for general-purpose retrieval.

Stage 3 — Optional cross-encoder reranker

After fusion you have a merged list of, say, 20–50 candidate chunks. A cross-encoder reranker takes each (query, chunk) pair and scores them jointly — unlike embeddings, which encode query and document independently, a cross-encoder reads both at once and can reason about fine-grained relevance. This is the most expensive step (one forward pass per candidate) but it consistently lifts precision. The typical pipeline keeps the reranker's input small (top 20–50 from hybrid fusion) to keep latency manageable, then passes the reranker's top 3–8 chunks to the LLM.

Implementing hybrid retrieval

Most vector databases now support hybrid search natively, so you rarely need to wire the two indexes together yourself.

  • Qdrant — native hybrid via query_points with Prefetch blocks for dense and sparse; RRF fusion built in.
  • Elasticsearch / OpenSearch — combine a knn clause (dense) with a match clause (BM25) in a single query; scores are merged via the linear_combination or rrf retriever.
  • PostgreSQL — use pgvector for the dense side and tsvector/ts_rank for the BM25 side, then join and rank in a CTE.
  • Weaviatehybrid query parameter with configurable alpha and built-in RRF or relative-score-fusion.
  • LangChain / LlamaIndex — both ship EnsembleRetriever (LangChain) and QueryFusionRetriever (LlamaIndex) wrappers that combine any two retrievers with RRF.
Minimal hybrid retrieval with Qdrant + RRFpython
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

# Run hybrid search: dense vector + sparse BM25, fused with RRF
results = client.query_points(
    collection_name="docs",
    prefetch=[
        # Dense (vector) retrieval
        models.Prefetch(
            query=embed(user_query),        # your embedding function
            using="dense",
            limit=50,
        ),
        # Sparse (BM25-style) retrieval
        models.Prefetch(
            query=models.SparseVector(**bm25_encode(user_query)),
            using="sparse",
            limit=50,
        ),
    ],
    # Fuse both prefetch results with RRF
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=10,  # final top-k passed to the LLM
)

One practical note: the BM25 side needs its own tokenisation pipeline. Production systems typically use a tokeniser that lowercases, removes stopwords, and applies stemming — but keeps identifiers like v3.2 intact. If your corpus is code-heavy, skip stemming entirely; stemming payments-v2-rollout to payment-v2-rollout destroys the exact match you were counting on.

When hybrid retrieval wins (and when it doesn't)

Hybrid retrieval is not always the right upgrade. The decision depends on your query distribution and corpus.

ScenarioBest retrieval strategyWhy
Help-desk docs, conceptual questions onlyPure vectorQueries are paraphrased; rare tokens not the issue
API docs, error codes, version-specific guidesHybrid (BM25-heavy alpha)Exact token matching is load-bearing
Mixed enterprise corpus (docs + runbooks + tickets)Hybrid with RRFBoth semantic and exact-match queries arrive
Code search across a large repositoryHybrid + cross-encoder rerankerSymbol names need BM25; intent needs vectors; reranker lifts precision
Very small corpus (< 500 documents)BM25 alone may sufficeEmbedding overhead not worth it; BM25 + full rerank is fast

The clearest signal to add hybrid: run a RAG evaluation and inspect the queries where retrieval recalls zero relevant chunks. If a disproportionate share contain rare identifiers, version strings, or jargon, BM25 will recover them. If the failures look like "the user phrased it differently", better embeddings or chunking strategy are the right lever.

Going deeper

Why BM25 is still competitive. BM25 was published in 1994 and remains the default baseline in information retrieval research. The core insight — that term frequency has diminishing returns, that rare terms are more informative, and that document length needs normalisation — has held up as embeddings came to dominate. In BEIR benchmarks (a standard multi-domain retrieval benchmark suite), BM25 alone outperforms many out-of-the-box embedding models on domains where precise terminology matters, such as medical and legal retrieval. Hybrid with a good reranker consistently tops both.

Sparse learned retrievers (SPLADE, SPLADEv2). An emerging alternative to classic BM25 is learned sparse retrieval: models like SPLADE learn to produce sparse token weight vectors rather than raw embedding coordinates. Unlike BM25, which weights terms purely by statistics, SPLADE can expand the query with related terms (it might add weight to "rollback" when it sees "undo") while keeping the inverted-index infrastructure. SPLADE often outperforms BM25 in hybrid pipelines at the cost of an additional model to deploy.

ColBERT and late interaction. ColBERT is a third retrieval paradigm that sits between dense and sparse: it stores one embedding vector per token in the document (not one per chunk) and scores using MaxSim across all token pairs. This gives it BM25-like sensitivity to individual tokens with near-dense-search semantic coverage. ColBERT is more expensive to store and slower to query than pure vector search, but some teams incorporate it as a third leg in a hybrid pipeline for code-heavy or terminology-dense corpora.

Hybrid retrieval inside agentic RAG. Hybrid retrieval and agentic RAG are orthogonal improvements that stack. An agent loop that calls a retrieval tool multiple times benefits from having each call be a hybrid search — it combines the adaptive query-rewriting of the agentic loop with the full-coverage retrieval of hybrid. The agent can also choose to issue a BM25-only query when it knows the question is identifier-heavy, effectively steering the alpha at query time.

Evaluation is how you tune. The RRF k constant, the candidate pool size for each retriever, and the number of chunks passed to the reranker should all be set from offline evaluation data — not guessed. Build an eval harness with a representative question set and ground-truth relevant documents, measure recall@k and MRR for each configuration, and grid-search the two or three key parameters. A surprisingly common result is that the optimal k is not 60 and the optimal candidate pool is 100, not 50.

The standardisation picture. Vector database vendors are converging on hybrid search as a first-class API primitive, which reduces the barrier significantly. Qdrant, Weaviate, Elasticsearch, and MongoDB Atlas all support hybrid queries natively as of 2025. LangChain and LlamaIndex both have retriever wrappers for the pattern. The main remaining friction is maintaining a separate BM25 tokenisation pipeline for corpora with mixed natural language and code — a problem sparse learned models like SPLADE address by handling both in a single model.

FAQ

What is hybrid RAG and how does it differ from standard RAG?

Standard RAG typically uses a single vector (dense) retriever: it embeds the query and fetches the nearest chunks by cosine similarity. Hybrid RAG runs two retrievers in parallel — a vector retriever for semantic similarity and a BM25 keyword retriever for exact token matching — then merges their result lists using a fusion algorithm like RRF before passing chunks to the LLM. The main difference is that hybrid covers both conceptual queries (vector's strength) and identifier-heavy queries like error codes and version numbers (BM25's strength).

What is Reciprocal Rank Fusion and why is it used in hybrid RAG?

Reciprocal Rank Fusion (RRF) is a score-free method of combining multiple ranked lists. Each document is scored as the sum of 1/(k + rank) across all retrievers, where k is a smoothing constant (default 60). RRF is preferred over weighted score averaging because BM25 scores and cosine similarity scores are on incompatible scales — normalising them introduces assumptions that often break in practice. RRF sidesteps this by only caring about rank position, rewarding documents that appear near the top in multiple retrievers.

When does BM25 beat vector search in a RAG pipeline?

BM25 wins when the query contains tokens that are rare in the corpus but critical for finding the right document: error codes (ERR_BLOCKED_BY_CLIENT), version strings (v3.2), feature flag names, product identifiers, or uncommon proper nouns. Vector embeddings compress these distinguishing tokens into a shared semantic space, causing similar-sounding documents to rank equally. BM25's Inverse Document Frequency (IDF) mechanism gives rare tokens high weight — the opposite behaviour — making it the right tool for exact-identifier retrieval.

How do I choose between RRF and alpha-weighted hybrid search?

Start with RRF. It requires no score normalisation, works robustly across varied query types, and has only one parameter (k). Switch to alpha weighting if you have a strong domain prior — for example, a code-search tool where you know identifier matching should always dominate — and you have enough evaluation data to tune the alpha reliably. In general-purpose or mixed-query environments, RRF is more stable and less likely to degrade when query distribution shifts.

Does adding hybrid retrieval require rebuilding my existing vector store?

No, you add a parallel BM25 index alongside your existing vector index — you don't replace it. Most modern vector databases (Qdrant, Weaviate, Elasticsearch) support both index types in the same collection, so the hybrid query is a single API call. If you're on a database that doesn't support BM25 natively (e.g. vanilla pgvector), you run a separate tsvector index in Postgres and join the results in application code before applying RRF.

Should I add a reranker on top of hybrid retrieval?

For most production systems: yes, if latency allows. A cross-encoder reranker re-scores each (query, chunk) pair jointly, catching relevance signals that independent embedding and BM25 scores miss. The typical pattern is to retrieve 20–50 candidates with hybrid fusion, rerank them with a cross-encoder, and pass the top 3–8 to the LLM. The reranker is the most expensive step — one forward pass per candidate — so keep its input small. If latency is tight, skip the reranker and rely on the hybrid fusion alone.

Further reading