Overview
ColPali is a document retrieval model that embeds whole page images instead of extracted text. It feeds the image patches from a vision-language model (PaliGemma-3B in the original version) through a linear projection to build a multi-vector representation, then scores it against query embeddings using the ColBERT late-interaction method.
It is meant for developers building search or retrieval over PDFs and scanned documents who want to avoid the usual OCR and layout-detection steps. Because the model looks at the rendered page, it can take both the text and the visual content (tables, charts, layout) into account with a single pass.
Within the embeddings category, ColPali sits among document-focused embedding models. The repository also ships related ColVision variants such as ColQwen2 and ColSmol, which trade off backbone size and accuracy on the ViDoRe retrieval benchmark.
What it does
- Embeds full page images directly, removing the need for OCR and layout-recognition pipelines
- Multi-vector embeddings scored with ColBERT-style late interaction for fine-grained matching
- Several ready checkpoints on Hugging Face, from ColSmol (256M/500M) to ColQwen2 and ColQwen2.5
- Captures both text and visual elements like tables and charts in one model
- Single colpali-engine Python package with model and processor classes
- Benchmarked on the public ViDoRe leaderboard for visual document retrieval
Getting started
Install the package, load a checkpoint with its processor, then embed page images and queries and score them.
Install
Install the colpali-engine package from PyPI.
pip install colpali-engineEmbed pages and score a query
Load a model and processor, embed a batch of page images and queries, then compute multi-vector similarity scores. This example uses the ColQwen2 v1.0 checkpoint.
import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available
from colpali_engine.models import ColQwen2, ColQwen2Processor
model_name = "vidore/colqwen2-v1.0"
model = ColQwen2.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()
processor = ColQwen2Processor.from_pretrained(model_name)
images = [
Image.new("RGB", (128, 128), color="white"),
Image.new("RGB", (64, 32), color="black"),
]
queries = [
"What is the organizational structure for our R&D department?",
"Can you provide a breakdown of last year's financial performance?",
]
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Search over PDFs and scanned documents without first running OCR or layout detection
- Retrieve pages that rely on visual structure such as tables, charts, and forms
- Build the retrieval step of a document RAG system that works directly on page images
- Compare visual document retrievers against the ViDoRe benchmark before picking a checkpoint
How ColPali compares
ColPali alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Sentence Transformers | ★ 18.8k | The standard Python framework for loading, training, and computing embeddings with sentence and reranking models. |
| EmbeddingGemma (Gemma) | ★ 5.5k | Google DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search. |
| Text Embeddings Inference (TEI) | ★ 4.9k | Hugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU. |
| Infinity (Embeddings) | ★ 2.8k | A high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API. |
| ColPali | ★ 2.7k | Index whole document pages as images for retrieval, no OCR pipeline needed |
| Model2Vec | ★ 2.1k | A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference. |
| Instructor Embedding | ★ 2k | Instruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction. |
| Qwen3-Embedding | ★ 2k | Alibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages. |