BM25S

Ultra-fast BM25 keyword search in pure Python, powered by NumPy and SciPy

github.com/xhluca/bm25s★ 1.7k bm25s.github.io

Overview

BM25S (short for BM25-Sparse) is a Python library that ranks documents against a query using the BM25 ranking function, the same lexical scoring method behind search services like Elasticsearch. It is written in pure Python on top of NumPy and stores eagerly computed scores in sparse matrices, so scoring at query time is fast and has no dependency on Java or PyTorch.

It is aimed at developers who need keyword (lexical) search without running a separate search server. In a RAG or hybrid-search setup, BM25S handles the keyword side: you pair its exact-term matches with a dense vector retriever so queries are caught both by meaning and by literal words.

You install it with pip and index a corpus in a few lines. Optional lightweight dependencies add stemming (via PyStemmer) and a numba JIT backend for extra speed on larger datasets.

What it does

Pure-Python BM25 built on NumPy with no Java or PyTorch dependency
Sparse-matrix score storage for fast scoring at query time
Built-in tokenizer with stopword removal and optional stemming via PyStemmer
Save and load indexes to disk, optionally bundling the corpus with the model
Corpus entries can be plain strings or dictionaries, so retrieval returns your own metadata objects
Optional numba backend (v0.2.0+) for added speedup on larger datasets

Getting started

Install the package, then tokenize a corpus, index it, and retrieve the top matches for a query.

Install BM25S

Install from PyPI. Add the optional core extras for json loading, a progress bar, stemming, and JIT compilation.

bashbash

pip install bm25s
# optional, recommended extras:
pip install "bm25s[core]"

Index a corpus and retrieve

Tokenize your documents (optionally with a stemmer), build the BM25 index, then tokenize a query and retrieve the top-k results as (doc ids, scores) arrays.

pythonpython

import bm25s
import Stemmer  # optional: for stemming

corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

stemmer = Stemmer.Stemmer("english")
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

retriever = bm25s.BM25()
retriever.index(corpus_tokens)

query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)
results, scores = retriever.retrieve(query_tokens, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

Save and reload the index

Persist the index to a directory (optionally with the corpus) and load it later instead of re-indexing.

pythonpython

retriever.save("animal_index_bm25", corpus=corpus)

reloaded = bm25s.BM25.load("animal_index_bm25", load_corpus=True)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Add a keyword-search retriever to a RAG pipeline so literal terms and IDs are matched, not just semantic similarity
Build the lexical half of a hybrid-search system, combining BM25S scores with a dense vector retriever
Run BM25 ranking locally without standing up an Elasticsearch or other search server
Return your own metadata objects from search by indexing dictionary corpus entries with id, title, and text fields

How BM25S compares

BM25S alongside other open-source rerank, search & hybrid tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Elasticsearch	★ 77.1k	Distributed search and analytics engine with a built-in vector database for dense/sparse embeddings and hybrid keyword-plus-semantic retrieval.
Meilisearch Cloud	★ 58.2k	Managed cloud for the Meilisearch engine, combining fast full-text search with hybrid, semantic, and multimodal vector search.
Typesense Cloud	★ 26.1k	Managed hosting for the Typesense search engine, offering typo-tolerant keyword search plus vector and semantic search via a simple API.
Tantivy	★ 15.4k	A fast full-text search engine library in Rust that provides BM25 keyword search for the lexical half of hybrid retrieval.
FlagEmbedding	★ 11.8k	BAAI's retrieval toolkit that provides the BGE embedding and cross-encoder reranker models used widely in RAG pipelines.
Vespa	★ 7k	A search and serving engine that natively combines vector, keyword (BM25), and structured search with built-in ranking for large-scale retrieval.
RAGatouille	★ 3.9k	A wrapper that makes it easy to train and use ColBERT late-interaction retrieval inside RAG pipelines.
BM25S	★ 1.7k	Ultra-fast BM25 keyword search in pure Python, powered by NumPy and SciPy

// Overview

// What it does

// Getting started