ColBERT

Token-level late-interaction retrieval for accurate search over large text collections

github.com/stanford-futuredata/ColBERT★ 3.9k

Overview

ColBERT is a retrieval model from Stanford that scores how well a passage matches a query using token-level vectors instead of a single embedding per document. It encodes each passage into a matrix of token embeddings, encodes the query the same way at search time, and compares them with a MaxSim operator. This fine-grained, "late interaction" matching lets it rank results more accurately than single-vector models while still scaling to large corpora.

It is aimed at developers and researchers building search and retrieval-augmented generation (RAG) pipelines who need ranking quality beyond a plain vector similarity lookup. You index a collection once, then issue queries to retrieve the top-k passages for each one. ColBERTv2 ships a checkpoint trained on the MS MARCO passage ranking task that you can use directly.

Within the rerankers and hybrid search space, ColBERT sits between dense bi-encoders and slower cross-encoders: it keeps token-level detail like a cross-encoder but precomputes passage representations so retrieval stays fast. If you prefer a higher-level wrapper, the README points to the RAGatouille library, which builds on ColBERT.

What it does

Late-interaction scoring: encodes passages and queries into token-level embedding matrices and ranks with a MaxSim operator
Ships a pre-trained ColBERTv2 checkpoint trained on MS MARCO passage ranking
Python API with Indexer and Searcher classes built around Run, RunConfig, and ColBERTConfig
PLAID-based indexing precomputes passage representations on disk for fast top-k retrieval
Works with a simple tab-separated (TSV) file format for queries, passages, and ranked lists
Optional training of your own ColBERT model and support for additional Hugging Face models

Getting started

Install ColBERT, then index a collection and search it with the Python API. A GPU is required for training and indexing.

Install ColBERT

Install the package with pip. The README notes that conda is more reliable for the faiss and torch dependencies if you hit issues. ColBERT requires Python 3.7+ and PyTorch 1.9+.

bashbash

pip install colbert-ai[torch,faiss-gpu]

Index a collection

Point the Indexer at a trained checkpoint (such as the ColBERTv2 checkpoint) and a TSV collection of passages to build the index.

pythonpython

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer

if __name__=='__main__':
    with Run().context(RunConfig(nranks=1, experiment="msmarco")):
        config = ColBERTConfig(
            nbits=2,
            root="/path/to/experiments",
        )
        indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
        indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")

Search the index

Load the index with a Searcher and run your queries to retrieve the top-k passages for each one.

pythonpython

from colbert.data import Queries
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Searcher

if __name__=='__main__':
    with Run().context(RunConfig(nranks=1, experiment="msmarco")):
        config = ColBERTConfig(
            root="/path/to/experiments",
        )
        searcher = Searcher(index="msmarco.nbits=2", config=config)
        queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
        ranking = searcher.search_all(queries, k=100)
        ranking.save("msmarco.nbits=2.ranking.tsv")

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Build the retrieval stage of a RAG pipeline where ranking quality matters more than a plain vector lookup
Rerank or retrieve passages over a large corpus when single-vector embeddings miss relevant results
Run passage search experiments on MS MARCO or your own TSV collection
Train or fine-tune a domain-specific late-interaction retriever on your own data

How ColBERT compares

ColBERT alongside other open-source rerank, search & hybrid tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Elasticsearch	★ 77.1k	Distributed search and analytics engine with a built-in vector database for dense/sparse embeddings and hybrid keyword-plus-semantic retrieval.
Meilisearch Cloud	★ 58.2k	Managed cloud for the Meilisearch engine, combining fast full-text search with hybrid, semantic, and multimodal vector search.
Typesense Cloud	★ 26.1k	Managed hosting for the Typesense search engine, offering typo-tolerant keyword search plus vector and semantic search via a simple API.
Tantivy	★ 15.4k	A fast full-text search engine library in Rust that provides BM25 keyword search for the lexical half of hybrid retrieval.
FlagEmbedding	★ 11.8k	BAAI's retrieval toolkit that provides the BGE embedding and cross-encoder reranker models used widely in RAG pipelines.
Vespa	★ 7k	A search and serving engine that natively combines vector, keyword (BM25), and structured search with built-in ranking for large-scale retrieval.
RAGatouille	★ 3.9k	A wrapper that makes it easy to train and use ColBERT late-interaction retrieval inside RAG pipelines.
ColBERT	★ 3.9k	Token-level late-interaction retrieval for accurate search over large text collections

// Overview

// What it does

// Getting started