Text Embeddings Inference (TEI)

Serve open-source embedding and reranking models over a fast HTTP API

github.com/huggingface/text-embeddings-inference★ 4.9k huggingface.co/docs/text-embeddings-inference

Overview

Text Embeddings Inference (TEI) is a Hugging Face toolkit for deploying and serving open-source text embedding and sequence-classification models. You point it at a model from the Hugging Face Hub, and it exposes an HTTP API that turns text into vectors. It also handles re-ranking and classification models, so one server can cover several common tasks.

TEI is written in Rust and aimed at developers who need to run embedding models themselves instead of calling a hosted API. It supports many popular model families, including BERT, XLM-RoBERTa, GTE, E5, Nomic, ModernBERT, and Qwen3 embedding models. There is no separate graph-compilation step, so the server boots quickly and is straightforward to put behind a service.

As an embedding-serving tool, TEI sits between your application and the raw model weights. It loads Safetensors or ONNX weights, batches incoming requests by token count, and returns vectors you can store in a vector database for search, retrieval, or RAG. It runs on NVIDIA GPUs, AMD Instinct GPUs (experimental), Apple Silicon via Metal, and CPU.

What it does

Serves text-embedding, re-ranking, and sequence-classification models from one server
Supports many model families: BERT, CamemBERT, XLM-RoBERTa, JinaBERT, GTE, E5, Nomic, MPNet, ModernBERT, Qwen3, and Gemma3
Token-based dynamic batching with no model graph compilation step for fast boot times
Loads Safetensors and ONNX weights; uses Flash Attention, Candle, and cuBLASLt for inference
Runs on NVIDIA GPU, AMD Instinct (ROCm, experimental), Apple Silicon (Metal), and CPU
Production features: OpenTelemetry distributed tracing, Prometheus metrics, and a gRPC option

Getting started

The quickest way to run TEI is the official Docker image, which starts a server you can call over HTTP. You can also install it locally with Cargo or Homebrew.

Start the server with Docker

Run the CUDA image and pass a Hugging Face model id. This serves the model on port 8080 and caches weights in a local data directory.

bashbash

docker run --gpus all -p 8080:80 -v $PWD/data:/data --pull always \
  ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 \
  --model-id Qwen/Qwen3-Embedding-0.6B

Get embeddings from the API

Send text to the /embed endpoint as JSON and receive the vector in the response.

bashbash

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

Install locally (optional)

On Apple Silicon you can install TEI with Homebrew and launch the router directly. For other platforms, install Rust and build the router with Cargo.

bashbash

brew install text-embeddings-inference
text-embeddings-router --model-id Qwen/Qwen3-Embedding-0.6B --port 8080

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Self-host an embedding endpoint to feed a vector database for semantic search or RAG
Run a re-ranker to reorder retrieved documents before sending them to an LLM
Serve a sequence-classification model for tasks like sentiment or intent detection
Replace a paid embedding API with a model you control on your own GPU or CPU

How Text Embeddings Inference (TEI) compares

Text Embeddings Inference (TEI) alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Sentence Transformers	★ 18.8k	The standard Python framework for loading, training, and computing embeddings with sentence and reranking models.
EmbeddingGemma (Gemma)	★ 5.5k	Google DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search.
Text Embeddings Inference (TEI)	★ 4.9k	Serve open-source embedding and reranking models over a fast HTTP API
Infinity (Embeddings)	★ 2.8k	A high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API.
ColPali	★ 2.7k	A vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first.
Model2Vec	★ 2.1k	A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference.
Instructor Embedding	★ 2k	Instruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction.
Qwen3-Embedding	★ 2k	Alibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages.

// Overview

// What it does

// Getting started

Start the server with Docker

Get embeddings from the API

Install locally (optional)

// When to use it

// How Text Embeddings Inference (TEI) compares

Overview

What it does

Getting started

When to use it

How Text Embeddings Inference (TEI) compares