ColPali

Index whole document pages as images for retrieval, no OCR pipeline needed

github.com/illuin-tech/colpali★ 2.7k huggingface.co/vidore

Overview

ColPali is a document retrieval model that embeds whole page images instead of extracted text. It feeds the image patches from a vision-language model (PaliGemma-3B in the original version) through a linear projection to build a multi-vector representation, then scores it against query embeddings using the ColBERT late-interaction method.

It is meant for developers building search or retrieval over PDFs and scanned documents who want to avoid the usual OCR and layout-detection steps. Because the model looks at the rendered page, it can take both the text and the visual content (tables, charts, layout) into account with a single pass.

Within the embeddings category, ColPali sits among document-focused embedding models. The repository also ships related ColVision variants such as ColQwen2 and ColSmol, which trade off backbone size and accuracy on the ViDoRe retrieval benchmark.

What it does

Embeds full page images directly, removing the need for OCR and layout-recognition pipelines
Multi-vector embeddings scored with ColBERT-style late interaction for fine-grained matching
Several ready checkpoints on Hugging Face, from ColSmol (256M/500M) to ColQwen2 and ColQwen2.5
Captures both text and visual elements like tables and charts in one model
Single colpali-engine Python package with model and processor classes
Benchmarked on the public ViDoRe leaderboard for visual document retrieval

Getting started

Install the package, load a checkpoint with its processor, then embed page images and queries and score them.

Install

Install the colpali-engine package from PyPI.

bashbash

pip install colpali-engine

Embed pages and score a query

Load a model and processor, embed a batch of page images and queries, then compute multi-vector similarity scores. This example uses the ColQwen2 v1.0 checkpoint.

pythonpython

import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import ColQwen2, ColQwen2Processor

model_name = "vidore/colqwen2-v1.0"

model = ColQwen2.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2Processor.from_pretrained(model_name)

images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year's financial performance?",
]

batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Search over PDFs and scanned documents without first running OCR or layout detection
Retrieve pages that rely on visual structure such as tables, charts, and forms
Build the retrieval step of a document RAG system that works directly on page images
Compare visual document retrievers against the ViDoRe benchmark before picking a checkpoint

How ColPali compares

ColPali alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Sentence Transformers	★ 18.8k	The standard Python framework for loading, training, and computing embeddings with sentence and reranking models.
EmbeddingGemma (Gemma)	★ 5.5k	Google DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search.
Text Embeddings Inference (TEI)	★ 4.9k	Hugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU.
Infinity (Embeddings)	★ 2.8k	A high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API.
ColPali	★ 2.7k	Index whole document pages as images for retrieval, no OCR pipeline needed
Model2Vec	★ 2.1k	A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference.
Instructor Embedding	★ 2k	Instruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction.
Qwen3-Embedding	★ 2k	Alibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages.

// Overview

// What it does

// Getting started