Infinity

High-throughput REST API for serving embeddings, rerankers, CLIP and ColPali models

github.com/michaelfeil/infinity★ 2.8k michaelfeil.eu/infinity

Overview

Infinity is a REST API server for running embedding, reranking, CLIP, CLAP, and ColPali models. You point it at a model from HuggingFace and it serves that model over HTTP, with an API aligned to OpenAI's embeddings spec so existing clients work with minimal changes.

It is built for teams that need to host their own embedding or reranking models instead of calling a hosted API. It runs on NVIDIA CUDA, AMD ROCm, CPU, AWS Inferentia, and Apple MPS, using inference backends like PyTorch, optimum (ONNX/TensorRT), and CTranslate2, with dynamic batching to keep throughput high.

In the embedding-serving category, Infinity focuses on mixing and matching multiple models behind one server. You can launch several models at once and let Infinity route requests to each, which suits retrieval and RAG pipelines that need both an embedder and a reranker.

What it does

Serves any embedding, reranking, sentence-transformer, CLIP, or ColPali model from HuggingFace
OpenAI-compatible API built on FastAPI, so existing embedding clients work with small changes
Multiple inference backends: PyTorch, optimum (ONNX/TensorRT), and CTranslate2 with FlashAttention
Runs on NVIDIA CUDA, AMD ROCm, CPU, AWS Inferentia (INF2), or Apple MPS
Dynamic batching with tokenization handled in dedicated worker threads
Launch and orchestrate multiple models in one server via the v2 CLI

Getting started

Install the CLI with pip and launch a model, or run the prebuilt Docker image with GPU access.

Install via pip

Install the package with the all extra to pull in the inference backends.

bashbash

pip install infinity-emb[all]

Launch a model

With your virtualenv active, start the server for a HuggingFace model using the v2 CLI.

bashbash

infinity_emb v2 --model-id BAAI/bge-small-en-v1.5

Or run with Docker (recommended)

Use the prebuilt image and mount your GPU. You can pass multiple --model-id flags to serve several models at once.

bashbash

port=7997
docker run -it --gpus all \
 -v $PWD/data:/app/.cache \
 -p $port:$port \
 michaelf34/infinity:latest \
 v2 \
 --model-id michaelfeil/bge-small-en-v1.5 \
 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1 \
 --port $port

See all options

Inspect every available parameter for the v2 command.

bashbash

infinity_emb v2 --help

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Self-host an embedding model for a RAG or semantic search pipeline behind an OpenAI-compatible endpoint
Serve an embedder and a reranker together to power two-stage retrieval
Run multi-modal models like CLIP or ColPali for image and document retrieval
Deploy embedding inference across CUDA, ROCm, CPU, Inferentia, or Apple MPS hardware

How Infinity compares

Infinity alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Sentence Transformers	★ 18.8k	The standard Python framework for loading, training, and computing embeddings with sentence and reranking models.
EmbeddingGemma (Gemma)	★ 5.5k	Google DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search.
Text Embeddings Inference (TEI)	★ 4.9k	Hugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU.
Infinity (Embeddings)	★ 2.8k	High-throughput REST API for serving embeddings, rerankers, CLIP and ColPali models
ColPali	★ 2.7k	A vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first.
Model2Vec	★ 2.1k	A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference.
Instructor Embedding	★ 2k	Instruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction.
Qwen3-Embedding	★ 2k	Alibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages.

// Overview

// What it does

// Getting started

Install via pip

Launch a model

Or run with Docker (recommended)

See all options

// When to use it

// How Infinity compares

Overview

What it does

Getting started

When to use it

How Infinity compares