Overview
RAGatouille is a Python library that makes state-of-the-art retrieval models easy to use in a RAG pipeline. Right now it focuses on ColBERT, a late-interaction model that matches queries and documents at the token level instead of comparing single dense embeddings.
It is aimed at developers who already run dense embedding retrieval and want stronger results without reading years of information-retrieval research. The README notes that ColBERT models generalise well to new domains and are data-efficient, and that the pretrained ColBERTv2 model often works well zero-shot, so you can prototype before training anything of your own.
Within the rerankers and hybrid-search space, RAGatouille offers strong but adjustable defaults: a few lines of code get you started, while components such as the training data processor and negative miners can be reused or swapped on their own.
What it does
- Wraps ColBERT late-interaction retrieval behind a small Python API (RAGPretrainedModel and RAGTrainer)
- Works zero-shot with the pretrained colbert-ir/colbertv2.0 model, so no training is required to start
- Built-in RAGTrainer.prepare_training_data() converts pairs, labelled pairs, and triplets into training triplets
- Automatically mines hard negatives and removes duplicates during data preparation
- Supports both training a new ColBERT from a transformer and fine-tuning an existing ColBERT model
- Modular components (data processor, negative miners) are usable stand-alone or replaceable with your own
Getting started
Install the package with pip, then either use a pretrained ColBERT model or prepare data to train one. Note: Windows is not supported (the README reports it only works under WSL2), and in a script you must run code inside an if __name__ == "__main__" guard.
Install
Install RAGatouille from PyPI. It supports Python 3.9, 3.10, and 3.11.
pip install ragatouillePrepare training data
Pass query/passage pairs to RAGTrainer.prepare_training_data(), which builds training triplets and mines hard negatives by default. You need many more pairs than shown to actually train.
from ragatouille import RAGTrainer
my_data = [
("What is the meaning of life ?", "The meaning of life is 42"),
("What is Neural Search?", "Neural Search is a terms referring to a family of ..."),
] # Unlabelled pairs here
trainer = RAGTrainer()
trainer.prepare_training_data(raw_data=my_data)Train or fine-tune ColBERT
Instantiate RAGTrainer with a model_name and a pretrained_model_name. Passing an existing ColBERT puts the trainer in fine-tuning mode; passing another transformer trains a new ColBERT from its weights.
from ragatouille import RAGTrainer
trainer = RAGTrainer(
model_name="MyFineTunedColBERT",
pretrained_model_name="colbert-ir/colbertv2.0",
)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Add a stronger retriever to an existing RAG pipeline where dense embeddings underperform on your domain
- Run zero-shot retrieval with pretrained ColBERTv2 while prototyping, before committing to any training
- Fine-tune ColBERT on your own query/passage pairs to improve retrieval for a specific corpus
- Reuse the data processor or hard-negative miner as stand-alone components in a custom training pipeline
How RAGatouille compares
RAGatouille alongside other open-source rerank, search & hybrid tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Elasticsearch | ★ 77.1k | Distributed search and analytics engine with a built-in vector database for dense/sparse embeddings and hybrid keyword-plus-semantic retrieval. |
| Meilisearch Cloud | ★ 58.2k | Managed cloud for the Meilisearch engine, combining fast full-text search with hybrid, semantic, and multimodal vector search. |
| Typesense Cloud | ★ 26.1k | Managed hosting for the Typesense search engine, offering typo-tolerant keyword search plus vector and semantic search via a simple API. |
| Tantivy | ★ 15.4k | A fast full-text search engine library in Rust that provides BM25 keyword search for the lexical half of hybrid retrieval. |
| FlagEmbedding | ★ 11.8k | BAAI's retrieval toolkit that provides the BGE embedding and cross-encoder reranker models used widely in RAG pipelines. |
| Vespa | ★ 7k | A search and serving engine that natively combines vector, keyword (BM25), and structured search with built-in ranking for large-scale retrieval. |
| RAGatouille | ★ 3.9k | Use and train ColBERT late-interaction retrieval in any RAG pipeline |
| ColBERT | ★ 3.9k | The reference implementation of ColBERT late-interaction retrieval, which ranks passages using token-level vector matching. |