AI/TLDR

What Is BM25? The Keyword Algorithm Every RAG Stack Still Uses

You'll understand how BM25 ranks documents by keywords and why it remains essential alongside embeddings in modern RAG.

BEGINNER12 MIN READUPDATED 2026-06-12

In plain English

BM25 is a keyword-based scoring algorithm that answers one question: given a search query, which documents in a collection are most likely to be relevant? It was developed in the 1990s by Stephen Robertson and Karen Spärck Jones at City University London, and despite being over thirty years old it remains the default ranking algorithm in Elasticsearch, Apache Lucene, and most production search infrastructure.

Here's the analogy. Imagine you're a librarian asked to find the best books about "neural networks" in a library of ten thousand volumes. Your instinct would probably go something like this: books that mention "neural networks" more often are probably more relevant — but a book that uses the phrase on every page starts to feel repetitive, so there are diminishing returns. Meanwhile, a book that mentions a rare, specific term like "backpropagation" is more precisely on-topic than one that just mentions common words like "learning" or "data." And a short pamphlet that mentions "neural networks" three times probably covers it more densely than a thick textbook that mentions it three times across eight hundred pages. BM25 formalizes exactly those three librarian instincts — term frequency with saturation, rare-term weighting, and document-length normalization — into a single score.

The name stands for Best Match 25 — it's the 25th refinement in a series of probabilistic ranking experiments the researchers ran. "Best Matching" describes the goal: given a query, surface the best-matching documents.

Why it matters

When people discovered that embedding models could turn text into semantic vectors, many assumed keyword search was dead. If you can match meaning, why bother matching words? It turned out that vector search and keyword search fail in opposite and complementary ways — and BM25 is specifically excellent at the cases where embeddings stumble.

Embeddings struggle with exact, rare terms. A product SKU like T7-449XB, an error code like ERR_SSL_VERSION_OR_CIPHER_MISMATCH, a person's name, a model number, or a regulatory identifier like GDPR Article 17 — these are cases where the exact string is the meaning. A new or obscure term may have never appeared in the embedding model's training data, so its vector representation is vague or outright wrong. BM25 doesn't care about training data. If the string appears in the document, it's a match.

For a RAG (Retrieval-Augmented Generation) system, this matters enormously. Users querying a company's internal knowledge base often type exact product names, ticket IDs, API method names, or policy references. Those queries need exact-match retrieval, and embedding-only pipelines routinely fail them. BM25 catches what vectors miss.

  • Elastic / Lucene users: BM25 is on by default since Elasticsearch 5.0. You're already using it — understanding it tells you when and how to tune k1 and b.
  • RAG builders: Adding a BM25 pass alongside your embedding retriever is the most reliable, cheapest way to improve recall on exact-term queries.
  • Anyone debugging search quality: When results look wrong for specific queries, the first diagnostic question is often "was this a keyword match problem or a semantic match problem?" Knowing BM25's mechanics helps you answer that.

BM25 also has hard practical advantages: it's fast, it needs no GPU, it needs no model to download or maintain, and it's been battle-tested in every major search system for decades. That's a compelling case for keeping it around even when you have a full embedding pipeline.

How it works

BM25 computes a score for each document in your collection relative to a query. The score is the sum over every query term of a per-term contribution. Each term's contribution comes from three components working together.

Component 1 — IDF: rare terms matter more

IDF stands for Inverse Document Frequency. The idea is that a word appearing in only 2 documents out of 10,000 is a strong signal of topic — it probably means the document is actually about that thing. A word appearing in 9,500 documents is noise — it's a generic word that reveals almost nothing. BM25's IDF formula boosts terms that appear in fewer documents and dampens terms that are everywhere.

Component 2 — TF with saturation: diminishing returns

TF stands for Term Frequency — how many times the query word appears in a specific document. More occurrences suggest a closer focus on the topic. But BM25 applies saturation: the score boost from the 50th occurrence of a word is nearly nothing compared to the first occurrence. This is the key improvement over simpler TF-IDF: a document can't game the ranking by keyword-stuffing, because returns diminish rapidly. The parameter k1 (typically set to 1.2–2.0) controls how quickly saturation kicks in. A higher k1 rewards more occurrences for longer before flattening.

Component 3 — Length normalization: fair comparison

A long document naturally contains more words and will mention any given term more often simply because it has more text. BM25 corrects for this by comparing each document's length against the average document length in the collection. A short document that mentions the query term three times is more densely relevant than a long document that does the same. The parameter b (typically 0.75) controls the strength of this normalization. Setting b = 1 applies full normalization; b = 0 ignores length entirely.

The formula in one line

The BM25 score for a document d given query term t is: IDF(t) * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * (dl / avgdl))). Here dl is the document length and avgdl is the average document length across the collection. Sum this across all terms in the query and you have the full BM25 score.

BM25 vs TF-IDF: what changed and why

BM25 is often described as "a better TF-IDF," and that's accurate. Both score documents by combining term frequency with inverse document frequency. But TF-IDF has two important flaws that BM25 fixes.

First, plain TF-IDF has no saturation. Every additional mention of a query term adds a fixed increment to the score, which means a document that repeats a word 500 times scores enormously higher than one that uses it 10 times — even if the 10-time document is clearly more focused and relevant. This makes TF-IDF easy to game and noisy in practice.

Second, plain TF-IDF ignores document length. A term appearing 10 times in a 50-word paragraph scores the same as a term appearing 10 times in a 5,000-word article, even though the paragraph is far more focused. BM25's length normalization corrects this.

FeatureTF-IDFBM25
Term frequency saturationNo — linear increaseYes — diminishing returns via k1
Document length normalizationNoYes — controlled by b
Tunable parametersNonek1 and b
Performance on real-world searchBaselineConsistently better
ComplexityVery simpleSlightly more complex
Default in ElasticsearchBefore v5.0Since v5.0 (current default)

For practical purposes, if you are building a new system, use BM25. TF-IDF is mostly relevant today as a conceptual building block that explains what BM25 improves on.

Using BM25 in Python

The most common Python library for BM25 is rank-bm25, which implements Okapi BM25 (and several variants: BM25+, BM25L) with a minimal API. LangChain and LlamaIndex both wrap it in their retriever abstractions, so you may be using it without realizing it. Here is a direct, minimal example so you can see exactly what happens.

bashbash
pip install rank-bm25
bm25_example.pypython
from rank_bm25 import BM25Okapi

# Your document corpus (in practice these would be chunked text passages)
corpus = [
    "BM25 is a keyword ranking algorithm used in search engines.",
    "Elasticsearch uses BM25 as its default similarity scoring.",
    "Vector search uses embeddings to match meaning, not just words.",
    "Hybrid search combines BM25 keyword matching with vector retrieval.",
    "The k1 parameter in BM25 controls term frequency saturation.",
]

# Tokenise each document (lowercased whitespace split — use a real tokeniser in prod)
tokenized_corpus = [doc.lower().split() for doc in corpus]

# Build the BM25 index
bm25 = BM25Okapi(tokenized_corpus)

# Score all documents against a query
query = "BM25 keyword search"
scores = bm25.get_scores(query.lower().split())

# Retrieve the top-2 results
top_2 = bm25.get_top_n(query.lower().split(), corpus, n=2)
for i, doc in enumerate(top_2):
    print(f"Rank {i+1}: {doc}")

In a real RAG pipeline you would combine this with an embedding retriever and use RRF to merge the two ranked lists. The code above shows the BM25 half. LangChain's BM25Retriever and LlamaIndex's sparse retriever options both wrap rank-bm25 under the hood — so switching from manual BM25 to a framework is straightforward once you understand the primitives.

Going deeper

BM25's two tunable parameters give you meaningful levers once you have enough query traffic to evaluate. k1 controls term frequency saturation. If your corpus is full of dense technical documents where repeated mentions genuinely signal stronger relevance, push k1 up toward 2.0. If your corpus is conversational or users type long naturalistic queries, pull it down toward 1.2 — you want saturation to kick in faster. b controls length normalization. If your documents are all roughly the same length (chunked uniformly, for example), b doesn't matter much. If document lengths vary widely, a lower b (say 0.5) stops short fragments from being artificially promoted. Most teams start with k1=1.2, b=0.75 and tune empirically with an evaluation set.

There is a family of BM25 variants worth knowing. BM25+ adds a small floor to the term frequency component so documents with zero occurrences of a term don't score exactly zero — which matters when you're doing multi-field or multi-document fusion. BM25L adjusts the normalization to better handle very long documents in collections with high length variance. The rank-bm25 Python library implements all three. Elasticsearch's implementation is Okapi BM25 with a slightly modified IDF formula that avoids negative scores on very common terms.

A newer class of methods called learned sparse retrieval extends BM25's conceptual model with neural training. SPLADE (Sparse Lexical and Expansion Model) learns to produce sparse term-weight vectors where the weights are not simple frequencies — the model can add terms the document never contained if they are semantically related. The result is a sparse vector that can be stored and searched in a standard inverted index (the same infrastructure BM25 uses) while capturing some of the semantic breadth of dense embeddings. SPLADE often outperforms both pure BM25 and pure dense retrieval on benchmark datasets, at the cost of requiring a neural model at both index time and query time.

Even in 2025, papers evaluating RAG systems frequently find that naive BM25 retrieval outperforms, or is competitive with, much more expensive embedding-based retrieval — especially on domain-specific corpora where the embedding model was not pretrained on similar text. The lesson is not that BM25 is better than embeddings, but that the best architecture almost always involves both. Think of BM25 as the precision anchor in a retrieval pipeline: fast, interpretable, and reliably excellent on the class of queries that embeddings struggle with most.

FAQ

What does BM25 stand for?

BM25 stands for Best Match 25 — it was the 25th iteration in a series of probabilistic ranking experiments by Stephen Robertson and Karen Spärck Jones at City University London in the 1990s. The "Best Matching" part describes its goal: surface the most relevant documents for a given query.

Is BM25 still used in 2025 or has it been replaced by vector search?

BM25 is very much still in use. Elasticsearch, Apache Lucene, and most major search platforms use it as their default ranking algorithm. In RAG systems, BM25 is commonly paired with vector search in hybrid pipelines because it handles exact-term queries — error codes, product IDs, proper names — far better than embeddings alone.

What are the k1 and b parameters in BM25?

k1 controls term frequency saturation — how quickly the score boost from repeated occurrences flattens out. Typical values are 1.2 to 2.0. b controls document length normalization — how much shorter documents are favored. The conventional default is b=0.75. Both can be tuned empirically for your specific corpus.

What is the difference between BM25 and TF-IDF?

BM25 and TF-IDF both score documents using term frequency and inverse document frequency, but BM25 adds two key improvements: term frequency saturation (repeated words have diminishing returns) and document length normalization (short documents aren't penalized against long ones for having fewer raw occurrences). BM25 consistently outperforms plain TF-IDF on real-world search tasks and has been the default in Elasticsearch since version 5.0.

How do I combine BM25 with vector search in a RAG pipeline?

Run both retrievers in parallel against the same query, then merge the two ranked lists using Reciprocal Rank Fusion (RRF). RRF ignores raw scores — which live on incompatible scales — and combines ranks using the formula 1 / (60 + rank). Documents that rank highly in either list are promoted; those that rank highly in both rise to the top. Most search platforms (Elasticsearch, Weaviate, Qdrant, Redis) have built-in hybrid search modes that do this automatically.

Does BM25 work with non-English text?

Yes, BM25 is language-agnostic — it works on any text as long as you tokenize it consistently. The tokenizer (how you split text into terms) matters more than the algorithm itself. For non-English languages, use a language-aware tokenizer that handles stemming, character normalization, and stop words for that language, and apply the same tokenizer at both index time and query time.

Further reading