RAGatouille

Use and train ColBERT late-interaction retrieval in any RAG pipeline

github.com/AnswerDotAI/RAGatouille★ 3.9k

Overview

RAGatouille is a Python library that makes state-of-the-art retrieval models easy to use in a RAG pipeline. Right now it focuses on ColBERT, a late-interaction model that matches queries and documents at the token level instead of comparing single dense embeddings.

It is aimed at developers who already run dense embedding retrieval and want stronger results without reading years of information-retrieval research. The README notes that ColBERT models generalise well to new domains and are data-efficient, and that the pretrained ColBERTv2 model often works well zero-shot, so you can prototype before training anything of your own.

Within the rerankers and hybrid-search space, RAGatouille offers strong but adjustable defaults: a few lines of code get you started, while components such as the training data processor and negative miners can be reused or swapped on their own.

What it does

Wraps ColBERT late-interaction retrieval behind a small Python API (RAGPretrainedModel and RAGTrainer)
Works zero-shot with the pretrained colbert-ir/colbertv2.0 model, so no training is required to start
Built-in RAGTrainer.prepare_training_data() converts pairs, labelled pairs, and triplets into training triplets
Automatically mines hard negatives and removes duplicates during data preparation
Supports both training a new ColBERT from a transformer and fine-tuning an existing ColBERT model
Modular components (data processor, negative miners) are usable stand-alone or replaceable with your own

Getting started

Install the package with pip, then either use a pretrained ColBERT model or prepare data to train one. Note: Windows is not supported (the README reports it only works under WSL2), and in a script you must run code inside an if __name__ == "__main__" guard.

Install

Install RAGatouille from PyPI. It supports Python 3.9, 3.10, and 3.11.

bashbash

pip install ragatouille

Prepare training data

Pass query/passage pairs to RAGTrainer.prepare_training_data(), which builds training triplets and mines hard negatives by default. You need many more pairs than shown to actually train.

pythonpython

from ragatouille import RAGTrainer

my_data = [
    ("What is the meaning of life ?", "The meaning of life is 42"),
    ("What is Neural Search?", "Neural Search is a terms referring to a family of ..."),
]  # Unlabelled pairs here
trainer = RAGTrainer()
trainer.prepare_training_data(raw_data=my_data)

Train or fine-tune ColBERT

Instantiate RAGTrainer with a model_name and a pretrained_model_name. Passing an existing ColBERT puts the trainer in fine-tuning mode; passing another transformer trains a new ColBERT from its weights.

pythonpython

from ragatouille import RAGTrainer

trainer = RAGTrainer(
    model_name="MyFineTunedColBERT",
    pretrained_model_name="colbert-ir/colbertv2.0",
)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Add a stronger retriever to an existing RAG pipeline where dense embeddings underperform on your domain
Run zero-shot retrieval with pretrained ColBERTv2 while prototyping, before committing to any training
Fine-tune ColBERT on your own query/passage pairs to improve retrieval for a specific corpus
Reuse the data processor or hard-negative miner as stand-alone components in a custom training pipeline

How RAGatouille compares

RAGatouille alongside other open-source rerank, search & hybrid tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Elasticsearch	★ 77.1k	Distributed search and analytics engine with a built-in vector database for dense/sparse embeddings and hybrid keyword-plus-semantic retrieval.
Meilisearch Cloud	★ 58.2k	Managed cloud for the Meilisearch engine, combining fast full-text search with hybrid, semantic, and multimodal vector search.
Typesense Cloud	★ 26.1k	Managed hosting for the Typesense search engine, offering typo-tolerant keyword search plus vector and semantic search via a simple API.
Tantivy	★ 15.4k	A fast full-text search engine library in Rust that provides BM25 keyword search for the lexical half of hybrid retrieval.
FlagEmbedding	★ 11.8k	BAAI's retrieval toolkit that provides the BGE embedding and cross-encoder reranker models used widely in RAG pipelines.
Vespa	★ 7k	A search and serving engine that natively combines vector, keyword (BM25), and structured search with built-in ranking for large-scale retrieval.
RAGatouille	★ 3.9k	Use and train ColBERT late-interaction retrieval in any RAG pipeline
ColBERT	★ 3.9k	The reference implementation of ColBERT late-interaction retrieval, which ranks passages using token-level vector matching.

// Overview

// What it does

// Getting started