Overview
ColBERT is a retrieval model from Stanford that scores how well a passage matches a query using token-level vectors instead of a single embedding per document. It encodes each passage into a matrix of token embeddings, encodes the query the same way at search time, and compares them with a MaxSim operator. This fine-grained, "late interaction" matching lets it rank results more accurately than single-vector models while still scaling to large corpora.
It is aimed at developers and researchers building search and retrieval-augmented generation (RAG) pipelines who need ranking quality beyond a plain vector similarity lookup. You index a collection once, then issue queries to retrieve the top-k passages for each one. ColBERTv2 ships a checkpoint trained on the MS MARCO passage ranking task that you can use directly.
Within the rerankers and hybrid search space, ColBERT sits between dense bi-encoders and slower cross-encoders: it keeps token-level detail like a cross-encoder but precomputes passage representations so retrieval stays fast. If you prefer a higher-level wrapper, the README points to the RAGatouille library, which builds on ColBERT.
What it does
- Late-interaction scoring: encodes passages and queries into token-level embedding matrices and ranks with a MaxSim operator
- Ships a pre-trained ColBERTv2 checkpoint trained on MS MARCO passage ranking
- Python API with Indexer and Searcher classes built around Run, RunConfig, and ColBERTConfig
- PLAID-based indexing precomputes passage representations on disk for fast top-k retrieval
- Works with a simple tab-separated (TSV) file format for queries, passages, and ranked lists
- Optional training of your own ColBERT model and support for additional Hugging Face models
Getting started
Install ColBERT, then index a collection and search it with the Python API. A GPU is required for training and indexing.
Install ColBERT
Install the package with pip. The README notes that conda is more reliable for the faiss and torch dependencies if you hit issues. ColBERT requires Python 3.7+ and PyTorch 1.9+.
pip install colbert-ai[torch,faiss-gpu]Index a collection
Point the Indexer at a trained checkpoint (such as the ColBERTv2 checkpoint) and a TSV collection of passages to build the index.
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer
if __name__=='__main__':
with Run().context(RunConfig(nranks=1, experiment="msmarco")):
config = ColBERTConfig(
nbits=2,
root="/path/to/experiments",
)
indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")Search the index
Load the index with a Searcher and run your queries to retrieve the top-k passages for each one.
from colbert.data import Queries
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Searcher
if __name__=='__main__':
with Run().context(RunConfig(nranks=1, experiment="msmarco")):
config = ColBERTConfig(
root="/path/to/experiments",
)
searcher = Searcher(index="msmarco.nbits=2", config=config)
queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
ranking = searcher.search_all(queries, k=100)
ranking.save("msmarco.nbits=2.ranking.tsv")Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Build the retrieval stage of a RAG pipeline where ranking quality matters more than a plain vector lookup
- Rerank or retrieve passages over a large corpus when single-vector embeddings miss relevant results
- Run passage search experiments on MS MARCO or your own TSV collection
- Train or fine-tune a domain-specific late-interaction retriever on your own data
How ColBERT compares
ColBERT alongside other open-source rerank, search & hybrid tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Elasticsearch | ★ 77.1k | Distributed search and analytics engine with a built-in vector database for dense/sparse embeddings and hybrid keyword-plus-semantic retrieval. |
| Meilisearch Cloud | ★ 58.2k | Managed cloud for the Meilisearch engine, combining fast full-text search with hybrid, semantic, and multimodal vector search. |
| Typesense Cloud | ★ 26.1k | Managed hosting for the Typesense search engine, offering typo-tolerant keyword search plus vector and semantic search via a simple API. |
| Tantivy | ★ 15.4k | A fast full-text search engine library in Rust that provides BM25 keyword search for the lexical half of hybrid retrieval. |
| FlagEmbedding | ★ 11.8k | BAAI's retrieval toolkit that provides the BGE embedding and cross-encoder reranker models used widely in RAG pipelines. |
| Vespa | ★ 7k | A search and serving engine that natively combines vector, keyword (BM25), and structured search with built-in ranking for large-scale retrieval. |
| RAGatouille | ★ 3.9k | A wrapper that makes it easy to train and use ColBERT late-interaction retrieval inside RAG pipelines. |
| ColBERT | ★ 3.9k | Token-level late-interaction retrieval for accurate search over large text collections |