AI/TLDR

Infinity

High-throughput REST API for serving embeddings, rerankers, CLIP and ColPali models

Overview

Infinity is a REST API server for running embedding, reranking, CLIP, CLAP, and ColPali models. You point it at a model from HuggingFace and it serves that model over HTTP, with an API aligned to OpenAI's embeddings spec so existing clients work with minimal changes.

It is built for teams that need to host their own embedding or reranking models instead of calling a hosted API. It runs on NVIDIA CUDA, AMD ROCm, CPU, AWS Inferentia, and Apple MPS, using inference backends like PyTorch, optimum (ONNX/TensorRT), and CTranslate2, with dynamic batching to keep throughput high.

In the embedding-serving category, Infinity focuses on mixing and matching multiple models behind one server. You can launch several models at once and let Infinity route requests to each, which suits retrieval and RAG pipelines that need both an embedder and a reranker.

What it does

  • Serves any embedding, reranking, sentence-transformer, CLIP, or ColPali model from HuggingFace
  • OpenAI-compatible API built on FastAPI, so existing embedding clients work with small changes
  • Multiple inference backends: PyTorch, optimum (ONNX/TensorRT), and CTranslate2 with FlashAttention
  • Runs on NVIDIA CUDA, AMD ROCm, CPU, AWS Inferentia (INF2), or Apple MPS
  • Dynamic batching with tokenization handled in dedicated worker threads
  • Launch and orchestrate multiple models in one server via the v2 CLI

Getting started

Install the CLI with pip and launch a model, or run the prebuilt Docker image with GPU access.

Install via pip

Install the package with the all extra to pull in the inference backends.

bashbash
pip install infinity-emb[all]

Launch a model

With your virtualenv active, start the server for a HuggingFace model using the v2 CLI.

bashbash
infinity_emb v2 --model-id BAAI/bge-small-en-v1.5

Or run with Docker (recommended)

Use the prebuilt image and mount your GPU. You can pass multiple --model-id flags to serve several models at once.

bashbash
port=7997
docker run -it --gpus all \
 -v $PWD/data:/app/.cache \
 -p $port:$port \
 michaelf34/infinity:latest \
 v2 \
 --model-id michaelfeil/bge-small-en-v1.5 \
 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1 \
 --port $port

See all options

Inspect every available parameter for the v2 command.

bashbash
infinity_emb v2 --help

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Self-host an embedding model for a RAG or semantic search pipeline behind an OpenAI-compatible endpoint
  • Serve an embedder and a reranker together to power two-stage retrieval
  • Run multi-modal models like CLIP or ColPali for image and document retrieval
  • Deploy embedding inference across CUDA, ROCm, CPU, Inferentia, or Apple MPS hardware

How Infinity compares

Infinity alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Sentence Transformers★ 18.8kThe standard Python framework for loading, training, and computing embeddings with sentence and reranking models.
EmbeddingGemma (Gemma)★ 5.5kGoogle DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search.
Text Embeddings Inference (TEI)★ 4.9kHugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU.
Infinity (Embeddings)★ 2.8kHigh-throughput REST API for serving embeddings, rerankers, CLIP and ColPali models
ColPali★ 2.7kA vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first.
Model2Vec★ 2.1kA tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference.
Instructor Embedding★ 2kInstruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction.
Qwen3-Embedding★ 2kAlibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages.