Overview
Infinity is a REST API server for running embedding, reranking, CLIP, CLAP, and ColPali models. You point it at a model from HuggingFace and it serves that model over HTTP, with an API aligned to OpenAI's embeddings spec so existing clients work with minimal changes.
It is built for teams that need to host their own embedding or reranking models instead of calling a hosted API. It runs on NVIDIA CUDA, AMD ROCm, CPU, AWS Inferentia, and Apple MPS, using inference backends like PyTorch, optimum (ONNX/TensorRT), and CTranslate2, with dynamic batching to keep throughput high.
In the embedding-serving category, Infinity focuses on mixing and matching multiple models behind one server. You can launch several models at once and let Infinity route requests to each, which suits retrieval and RAG pipelines that need both an embedder and a reranker.
What it does
- Serves any embedding, reranking, sentence-transformer, CLIP, or ColPali model from HuggingFace
- OpenAI-compatible API built on FastAPI, so existing embedding clients work with small changes
- Multiple inference backends: PyTorch, optimum (ONNX/TensorRT), and CTranslate2 with FlashAttention
- Runs on NVIDIA CUDA, AMD ROCm, CPU, AWS Inferentia (INF2), or Apple MPS
- Dynamic batching with tokenization handled in dedicated worker threads
- Launch and orchestrate multiple models in one server via the v2 CLI
Getting started
Install the CLI with pip and launch a model, or run the prebuilt Docker image with GPU access.
Install via pip
Install the package with the all extra to pull in the inference backends.
pip install infinity-emb[all]Launch a model
With your virtualenv active, start the server for a HuggingFace model using the v2 CLI.
infinity_emb v2 --model-id BAAI/bge-small-en-v1.5Or run with Docker (recommended)
Use the prebuilt image and mount your GPU. You can pass multiple --model-id flags to serve several models at once.
port=7997
docker run -it --gpus all \
-v $PWD/data:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest \
v2 \
--model-id michaelfeil/bge-small-en-v1.5 \
--model-id mixedbread-ai/mxbai-rerank-xsmall-v1 \
--port $portSee all options
Inspect every available parameter for the v2 command.
infinity_emb v2 --helpCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Self-host an embedding model for a RAG or semantic search pipeline behind an OpenAI-compatible endpoint
- Serve an embedder and a reranker together to power two-stage retrieval
- Run multi-modal models like CLIP or ColPali for image and document retrieval
- Deploy embedding inference across CUDA, ROCm, CPU, Inferentia, or Apple MPS hardware
How Infinity compares
Infinity alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Sentence Transformers | ★ 18.8k | The standard Python framework for loading, training, and computing embeddings with sentence and reranking models. |
| EmbeddingGemma (Gemma) | ★ 5.5k | Google DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search. |
| Text Embeddings Inference (TEI) | ★ 4.9k | Hugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU. |
| Infinity (Embeddings) | ★ 2.8k | High-throughput REST API for serving embeddings, rerankers, CLIP and ColPali models |
| ColPali | ★ 2.7k | A vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first. |
| Model2Vec | ★ 2.1k | A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference. |
| Instructor Embedding | ★ 2k | Instruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction. |
| Qwen3-Embedding | ★ 2k | Alibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages. |