AI/TLDR

ColPali

Index whole document pages as images for retrieval, no OCR pipeline needed

Overview

ColPali is a document retrieval model that embeds whole page images instead of extracted text. It feeds the image patches from a vision-language model (PaliGemma-3B in the original version) through a linear projection to build a multi-vector representation, then scores it against query embeddings using the ColBERT late-interaction method.

It is meant for developers building search or retrieval over PDFs and scanned documents who want to avoid the usual OCR and layout-detection steps. Because the model looks at the rendered page, it can take both the text and the visual content (tables, charts, layout) into account with a single pass.

Within the embeddings category, ColPali sits among document-focused embedding models. The repository also ships related ColVision variants such as ColQwen2 and ColSmol, which trade off backbone size and accuracy on the ViDoRe retrieval benchmark.

What it does

  • Embeds full page images directly, removing the need for OCR and layout-recognition pipelines
  • Multi-vector embeddings scored with ColBERT-style late interaction for fine-grained matching
  • Several ready checkpoints on Hugging Face, from ColSmol (256M/500M) to ColQwen2 and ColQwen2.5
  • Captures both text and visual elements like tables and charts in one model
  • Single colpali-engine Python package with model and processor classes
  • Benchmarked on the public ViDoRe leaderboard for visual document retrieval

Getting started

Install the package, load a checkpoint with its processor, then embed page images and queries and score them.

Install

Install the colpali-engine package from PyPI.

bashbash
pip install colpali-engine

Embed pages and score a query

Load a model and processor, embed a batch of page images and queries, then compute multi-vector similarity scores. This example uses the ColQwen2 v1.0 checkpoint.

pythonpython
import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2, ColQwen2Processor

model_name = "vidore/colqwen2-v1.0"

model = ColQwen2.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2Processor.from_pretrained(model_name)

images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year's financial performance?",
]

batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Search over PDFs and scanned documents without first running OCR or layout detection
  • Retrieve pages that rely on visual structure such as tables, charts, and forms
  • Build the retrieval step of a document RAG system that works directly on page images
  • Compare visual document retrievers against the ViDoRe benchmark before picking a checkpoint

How ColPali compares

ColPali alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Sentence Transformers★ 18.8kThe standard Python framework for loading, training, and computing embeddings with sentence and reranking models.
EmbeddingGemma (Gemma)★ 5.5kGoogle DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search.
Text Embeddings Inference (TEI)★ 4.9kHugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU.
Infinity (Embeddings)★ 2.8kA high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API.
ColPali★ 2.7kIndex whole document pages as images for retrieval, no OCR pipeline needed
Model2Vec★ 2.1kA tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference.
Instructor Embedding★ 2kInstruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction.
Qwen3-Embedding★ 2kAlibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages.