AI/TLDR

BM25S

Ultra-fast BM25 keyword search in pure Python, powered by NumPy and SciPy

Overview

BM25S (short for BM25-Sparse) is a Python library that ranks documents against a query using the BM25 ranking function, the same lexical scoring method behind search services like Elasticsearch. It is written in pure Python on top of NumPy and stores eagerly computed scores in sparse matrices, so scoring at query time is fast and has no dependency on Java or PyTorch.

It is aimed at developers who need keyword (lexical) search without running a separate search server. In a RAG or hybrid-search setup, BM25S handles the keyword side: you pair its exact-term matches with a dense vector retriever so queries are caught both by meaning and by literal words.

You install it with pip and index a corpus in a few lines. Optional lightweight dependencies add stemming (via PyStemmer) and a numba JIT backend for extra speed on larger datasets.

What it does

  • Pure-Python BM25 built on NumPy with no Java or PyTorch dependency
  • Sparse-matrix score storage for fast scoring at query time
  • Built-in tokenizer with stopword removal and optional stemming via PyStemmer
  • Save and load indexes to disk, optionally bundling the corpus with the model
  • Corpus entries can be plain strings or dictionaries, so retrieval returns your own metadata objects
  • Optional numba backend (v0.2.0+) for added speedup on larger datasets

Getting started

Install the package, then tokenize a corpus, index it, and retrieve the top matches for a query.

Install BM25S

Install from PyPI. Add the optional core extras for json loading, a progress bar, stemming, and JIT compilation.

bashbash
pip install bm25s
# optional, recommended extras:
pip install "bm25s[core]"

Index a corpus and retrieve

Tokenize your documents (optionally with a stemmer), build the BM25 index, then tokenize a query and retrieve the top-k results as (doc ids, scores) arrays.

pythonpython
import bm25s
import Stemmer  # optional: for stemming

corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

stemmer = Stemmer.Stemmer("english")
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

retriever = bm25s.BM25()
retriever.index(corpus_tokens)

query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)
results, scores = retriever.retrieve(query_tokens, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

Save and reload the index

Persist the index to a directory (optionally with the corpus) and load it later instead of re-indexing.

pythonpython
retriever.save("animal_index_bm25", corpus=corpus)

reloaded = bm25s.BM25.load("animal_index_bm25", load_corpus=True)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Add a keyword-search retriever to a RAG pipeline so literal terms and IDs are matched, not just semantic similarity
  • Build the lexical half of a hybrid-search system, combining BM25S scores with a dense vector retriever
  • Run BM25 ranking locally without standing up an Elasticsearch or other search server
  • Return your own metadata objects from search by indexing dictionary corpus entries with id, title, and text fields

How BM25S compares

BM25S alongside other open-source rerank, search & hybrid tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Elasticsearch★ 77.1kDistributed search and analytics engine with a built-in vector database for dense/sparse embeddings and hybrid keyword-plus-semantic retrieval.
Meilisearch Cloud★ 58.2kManaged cloud for the Meilisearch engine, combining fast full-text search with hybrid, semantic, and multimodal vector search.
Typesense Cloud★ 26.1kManaged hosting for the Typesense search engine, offering typo-tolerant keyword search plus vector and semantic search via a simple API.
Tantivy★ 15.4kA fast full-text search engine library in Rust that provides BM25 keyword search for the lexical half of hybrid retrieval.
FlagEmbedding★ 11.8kBAAI's retrieval toolkit that provides the BGE embedding and cross-encoder reranker models used widely in RAG pipelines.
Vespa★ 7kA search and serving engine that natively combines vector, keyword (BM25), and structured search with built-in ranking for large-scale retrieval.
RAGatouille★ 3.9kA wrapper that makes it easy to train and use ColBERT late-interaction retrieval inside RAG pipelines.
BM25S★ 1.7kUltra-fast BM25 keyword search in pure Python, powered by NumPy and SciPy