What Is BM25? The Keyword Algorithm Every RAG Stack Still Uses

Q: What does BM25 stand for?

BM25 stands for **Best Match 25** — it was the 25th iteration in a series of probabilistic ranking experiments by Stephen Robertson and Karen Spärck Jones at City University London in the 1990s. The "Best Matching" part describes its goal: surface the most relevant documents for a given query.

Q: What are the k1 and b parameters in BM25?

`k1` controls term frequency saturation — how quickly the score boost from repeated occurrences flattens out. Typical values are 1.2 to 2.0. `b` controls document length normalization — how much shorter documents are favored. The conventional default is `b=0.75`. Both can be tuned empirically for your specific corpus.

You'll understand how BM25 ranks documents by keywords and why it remains essential alongside embeddings in modern RAG.

BEGINNER12 MIN READUPDATED 2026-06-12

In plain English

BM25 is a keyword-based scoring algorithm that answers one question: given a search query, which documents in a collection are most likely to be relevant? It was developed in the 1990s by Stephen Robertson and Karen Spärck Jones at City University London, and despite being over thirty years old it remains the default ranking algorithm in Elasticsearch, Apache Lucene, and most production search infrastructure.

Here's the analogy. Imagine you're a librarian asked to find the best books about "neural networks" in a library of ten thousand volumes. Your instinct would probably go something like this: books that mention "neural networks" more often are probably more relevant — but a book that uses the phrase on every page starts to feel repetitive, so there are diminishing returns. Meanwhile, a book that mentions a rare, specific term like "backpropagation" is more precisely on-topic than one that just mentions common words like "learning" or "data." And a short pamphlet that mentions "neural networks" three times probably covers it more densely than a thick textbook that mentions it three times across eight hundred pages. BM25 formalizes exactly those three librarian instincts — term frequency with saturation, rare-term weighting, and document-length normalization — into a single score.

The name stands for Best Match 25 — it's the 25th refinement in a series of probabilistic ranking experiments the researchers ran. "Best Matching" describes the goal: given a query, surface the best-matching documents.

Why it matters

When people discovered that embedding models could turn text into semantic vectors, many assumed keyword search was dead. If you can match meaning, why bother matching words? It turned out that vector search and keyword search fail in opposite and complementary ways — and BM25 is specifically excellent at the cases where embeddings stumble.

Embeddings struggle with exact, rare terms. A product SKU like T7-449XB, an error code like ERR_SSL_VERSION_OR_CIPHER_MISMATCH, a person's name, a model number, or a regulatory identifier like GDPR Article 17 — these are cases where the exact string is the meaning. A new or obscure term may have never appeared in the embedding model's training data, so its vector representation is vague or outright wrong. BM25 doesn't care about training data. If the string appears in the document, it's a match.

For a RAG (Retrieval-Augmented Generation) system, this matters enormously. Users querying a company's internal knowledge base often type exact product names, ticket IDs, API method names, or policy references. Those queries need exact-match retrieval, and embedding-only pipelines routinely fail them. BM25 catches what vectors miss.

Elastic / Lucene users: BM25 is on by default since Elasticsearch 5.0. You're already using it — understanding it tells you when and how to tune k1 and b.
RAG builders: Adding a BM25 pass alongside your embedding retriever is the most reliable, cheapest way to improve recall on exact-term queries.
Anyone debugging search quality: When results look wrong for specific queries, the first diagnostic question is often "was this a keyword match problem or a semantic match problem?" Knowing BM25's mechanics helps you answer that.

BM25 also has hard practical advantages: it's fast, it needs no GPU, it needs no model to download or maintain, and it's been battle-tested in every major search system for decades. That's a compelling case for keeping it around even when you have a full embedding pipeline.

How it works

BM25 computes a score for each document in your collection relative to a query. The score is the sum over every query term of a per-term contribution. Each term's contribution comes from three components working together.

// How BM25 scores a document

Query termssplit query into individual wordsIDF weightrare terms score higher than common onesTF saturationmore occurrences help, but with diminishing returnsLength normalizationshort docs score higher per occurrence than long onesBM25 scoresum across all query terms — rank by this

Component 1 — IDF: rare terms matter more

IDF stands for Inverse Document Frequency. The idea is that a word appearing in only 2 documents out of 10,000 is a strong signal of topic — it probably means the document is actually about that thing. A word appearing in 9,500 documents is noise — it's a generic word that reveals almost nothing. BM25's IDF formula boosts terms that appear in fewer documents and dampens terms that are everywhere.

Component 2 — TF with saturation: diminishing returns

TF stands for Term Frequency — how many times the query word appears in a specific document. More occurrences suggest a closer focus on the topic. But BM25 applies saturation: the score boost from the 50th occurrence of a word is nearly nothing compared to the first occurrence. This is the key improvement over simpler TF-IDF: a document can't game the ranking by keyword-stuffing, because returns diminish rapidly. The parameter k1 (typically set to 1.2–2.0) controls how quickly saturation kicks in. A higher k1 rewards more occurrences for longer before flattening.

Component 3 — Length normalization: fair comparison

A long document naturally contains more words and will mention any given term more often simply because it has more text. BM25 corrects for this by comparing each document's length against the average document length in the collection. A short document that mentions the query term three times is more densely relevant than a long document that does the same. The parameter b (typically 0.75) controls the strength of this normalization. Setting b = 1 applies full normalization; b = 0 ignores length entirely.

The formula in one line

The BM25 score for a document d given query term t is: IDF(t) * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * (dl / avgdl))). Here dl is the document length and avgdl is the average document length across the collection. Sum this across all terms in the query and you have the full BM25 score.

BM25 vs TF-IDF: what changed and why

BM25 is often described as "a better TF-IDF," and that's accurate. Both score documents by combining term frequency with inverse document frequency. But TF-IDF has two important flaws that BM25 fixes.

First, plain TF-IDF has no saturation. Every additional mention of a query term adds a fixed increment to the score, which means a document that repeats a word 500 times scores enormously higher than one that uses it 10 times — even if the 10-time document is clearly more focused and relevant. This makes TF-IDF easy to game and noisy in practice.

Second, plain TF-IDF ignores document length. A term appearing 10 times in a 50-word paragraph scores the same as a term appearing 10 times in a 5,000-word article, even though the paragraph is far more focused. BM25's length normalization corrects this.

Feature	TF-IDF	BM25
Term frequency saturation	No — linear increase	Yes — diminishing returns via k1
Document length normalization	No	Yes — controlled by b
Tunable parameters	None	k1 and b
Performance on real-world search	Baseline	Consistently better
Complexity	Very simple	Slightly more complex
Default in Elasticsearch	Before v5.0	Since v5.0 (current default)

For practical purposes, if you are building a new system, use BM25. TF-IDF is mostly relevant today as a conceptual building block that explains what BM25 improves on.

BM25 vs vector search: different jobs

BM25 and vector (embedding) search are not competitors in the sense that one replaces the other. They are complementary tools that excel in opposite situations.

// BM25 vs vector search

BM25 (keyword / sparse)

Matches exact words in the query
No model — no training, no GPU needed
Excellent for error codes, IDs, names, acronyms
Blind to synonyms and paraphrases
Fast, cheap, and zero inference cost
Deterministic — same query, same results

Vector search (semantic / dense)

Matches meaning even if words differ
Requires an embedding model + vector index
Excellent for questions phrased differently from the answer
Can miss rare or unseen exact terms
Slower, higher cost, needs model maintenance
Sensitive to embedding model quality and domain

The production answer in most RAG stacks is to run both in parallel and merge the results. This is called hybrid search. You can't simply add the raw scores — a BM25 score of 14.3 and a cosine similarity of 0.87 are on incompatible scales. The standard solution is Reciprocal Rank Fusion (RRF): each retriever returns a ranked list, and you combine them by rank rather than raw score using the formula 1 / (60 + rank). A document that ranks highly in either list gets boosted; one that ranks highly in both rises to the very top. The constant 60 is the standard recommendation from the original RRF paper and softens how much the top rank dominates.

Most major platforms support hybrid search with BM25 + vector + RRF: Elasticsearch, OpenSearch, Weaviate, Qdrant, Meilisearch, Redis, and MongoDB Atlas all have built-in hybrid modes. The practical guidance is to default to hybrid unless you have a specific reason not to — it costs almost nothing extra but reliably catches queries that would slip through a single-method setup.

Using BM25 in Python

The most common Python library for BM25 is rank-bm25, which implements Okapi BM25 (and several variants: BM25+, BM25L) with a minimal API. LangChain and LlamaIndex both wrap it in their retriever abstractions, so you may be using it without realizing it. Here is a direct, minimal example so you can see exactly what happens.

bashbash

pip install rank-bm25

bm25_example.pypython

from rank_bm25 import BM25Okapi

# Your document corpus (in practice these would be chunked text passages)
corpus = [
    "BM25 is a keyword ranking algorithm used in search engines.",
    "Elasticsearch uses BM25 as its default similarity scoring.",
    "Vector search uses embeddings to match meaning, not just words.",
    "Hybrid search combines BM25 keyword matching with vector retrieval.",
    "The k1 parameter in BM25 controls term frequency saturation.",
]

# Tokenise each document (lowercased whitespace split — use a real tokeniser in prod)
tokenized_corpus = [doc.lower().split() for doc in corpus]

# Build the BM25 index
bm25 = BM25Okapi(tokenized_corpus)

# Score all documents against a query
query = "BM25 keyword search"
scores = bm25.get_scores(query.lower().split())

# Retrieve the top-2 results
top_2 = bm25.get_top_n(query.lower().split(), corpus, n=2)
for i, doc in enumerate(top_2):
    print(f"Rank {i+1}: {doc}")

In a real RAG pipeline you would combine this with an embedding retriever and use RRF to merge the two ranked lists. The code above shows the BM25 half. LangChain's BM25Retriever and LlamaIndex's sparse retriever options both wrap rank-bm25 under the hood — so switching from manual BM25 to a framework is straightforward once you understand the primitives.

Going deeper

BM25's two tunable parameters give you meaningful levers once you have enough query traffic to evaluate. k1 controls term frequency saturation. If your corpus is full of dense technical documents where repeated mentions genuinely signal stronger relevance, push k1 up toward 2.0. If your corpus is conversational or users type long naturalistic queries, pull it down toward 1.2 — you want saturation to kick in faster. b controls length normalization. If your documents are all roughly the same length (chunked uniformly, for example), b doesn't matter much. If document lengths vary widely, a lower b (say 0.5) stops short fragments from being artificially promoted. Most teams start with k1=1.2, b=0.75 and tune empirically with an evaluation set.

There is a family of BM25 variants worth knowing. BM25+ adds a small floor to the term frequency component so documents with zero occurrences of a term don't score exactly zero — which matters when you're doing multi-field or multi-document fusion. BM25L adjusts the normalization to better handle very long documents in collections with high length variance. The rank-bm25 Python library implements all three. Elasticsearch's implementation is Okapi BM25 with a slightly modified IDF formula that avoids negative scores on very common terms.

A newer class of methods called learned sparse retrieval extends BM25's conceptual model with neural training. SPLADE (Sparse Lexical and Expansion Model) learns to produce sparse term-weight vectors where the weights are not simple frequencies — the model can add terms the document never contained if they are semantically related. The result is a sparse vector that can be stored and searched in a standard inverted index (the same infrastructure BM25 uses) while capturing some of the semantic breadth of dense embeddings. SPLADE often outperforms both pure BM25 and pure dense retrieval on benchmark datasets, at the cost of requiring a neural model at both index time and query time.

Even in 2025, papers evaluating RAG systems frequently find that naive BM25 retrieval outperforms, or is competitive with, much more expensive embedding-based retrieval — especially on domain-specific corpora where the embedding model was not pretrained on similar text. The lesson is not that BM25 is better than embeddings, but that the best architecture almost always involves both. Think of BM25 as the precision anchor in a retrieval pipeline: fast, interpretable, and reliably excellent on the class of queries that embeddings struggle with most.

FAQ

What does BM25 stand for?

BM25 stands for Best Match 25 — it was the 25th iteration in a series of probabilistic ranking experiments by Stephen Robertson and Karen Spärck Jones at City University London in the 1990s. The "Best Matching" part describes its goal: surface the most relevant documents for a given query.

Is BM25 still used in 2025 or has it been replaced by vector search?

BM25 is very much still in use. Elasticsearch, Apache Lucene, and most major search platforms use it as their default ranking algorithm. In RAG systems, BM25 is commonly paired with vector search in hybrid pipelines because it handles exact-term queries — error codes, product IDs, proper names — far better than embeddings alone.

What are the k1 and b parameters in BM25?

k1 controls term frequency saturation — how quickly the score boost from repeated occurrences flattens out. Typical values are 1.2 to 2.0. b controls document length normalization — how much shorter documents are favored. The conventional default is b=0.75. Both can be tuned empirically for your specific corpus.

What is the difference between BM25 and TF-IDF?

BM25 and TF-IDF both score documents using term frequency and inverse document frequency, but BM25 adds two key improvements: term frequency saturation (repeated words have diminishing returns) and document length normalization (short documents aren't penalized against long ones for having fewer raw occurrences). BM25 consistently outperforms plain TF-IDF on real-world search tasks and has been the default in Elasticsearch since version 5.0.

How do I combine BM25 with vector search in a RAG pipeline?

Run both retrievers in parallel against the same query, then merge the two ranked lists using Reciprocal Rank Fusion (RRF). RRF ignores raw scores — which live on incompatible scales — and combines ranks using the formula 1 / (60 + rank). Documents that rank highly in either list are promoted; those that rank highly in both rise to the top. Most search platforms (Elasticsearch, Weaviate, Qdrant, Redis) have built-in hybrid search modes that do this automatically.

Does BM25 work with non-English text?

Yes, BM25 is language-agnostic — it works on any text as long as you tokenize it consistently. The tokenizer (how you split text into terms) matters more than the algorithm itself. For non-English languages, use a language-aware tokenizer that handles stemming, character normalization, and stop words for that language, and apply the same tokenizer at both index time and query time.

// In plain English

// Why it matters

// How it works

Component 1 — IDF: rare terms matter more

Component 2 — TF with saturation: diminishing returns

Component 3 — Length normalization: fair comparison

The formula in one line

// BM25 vs TF-IDF: what changed and why

// BM25 vs vector search: different jobs

// Using BM25 in Python

// Going deeper

// FAQ

// Further reading

// Related