Overview
Text Embeddings Inference (TEI) is a Hugging Face toolkit for deploying and serving open-source text embedding and sequence-classification models. You point it at a model from the Hugging Face Hub, and it exposes an HTTP API that turns text into vectors. It also handles re-ranking and classification models, so one server can cover several common tasks.
TEI is written in Rust and aimed at developers who need to run embedding models themselves instead of calling a hosted API. It supports many popular model families, including BERT, XLM-RoBERTa, GTE, E5, Nomic, ModernBERT, and Qwen3 embedding models. There is no separate graph-compilation step, so the server boots quickly and is straightforward to put behind a service.
As an embedding-serving tool, TEI sits between your application and the raw model weights. It loads Safetensors or ONNX weights, batches incoming requests by token count, and returns vectors you can store in a vector database for search, retrieval, or RAG. It runs on NVIDIA GPUs, AMD Instinct GPUs (experimental), Apple Silicon via Metal, and CPU.
What it does
- Serves text-embedding, re-ranking, and sequence-classification models from one server
- Supports many model families: BERT, CamemBERT, XLM-RoBERTa, JinaBERT, GTE, E5, Nomic, MPNet, ModernBERT, Qwen3, and Gemma3
- Token-based dynamic batching with no model graph compilation step for fast boot times
- Loads Safetensors and ONNX weights; uses Flash Attention, Candle, and cuBLASLt for inference
- Runs on NVIDIA GPU, AMD Instinct (ROCm, experimental), Apple Silicon (Metal), and CPU
- Production features: OpenTelemetry distributed tracing, Prometheus metrics, and a gRPC option
Getting started
The quickest way to run TEI is the official Docker image, which starts a server you can call over HTTP. You can also install it locally with Cargo or Homebrew.
Start the server with Docker
Run the CUDA image and pass a Hugging Face model id. This serves the model on port 8080 and caches weights in a local data directory.
docker run --gpus all -p 8080:80 -v $PWD/data:/data --pull always \
ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 \
--model-id Qwen/Qwen3-Embedding-0.6BGet embeddings from the API
Send text to the /embed endpoint as JSON and receive the vector in the response.
curl 127.0.0.1:8080/embed \
-X POST \
-d '{"inputs":"What is Deep Learning?"}' \
-H 'Content-Type: application/json'Install locally (optional)
On Apple Silicon you can install TEI with Homebrew and launch the router directly. For other platforms, install Rust and build the router with Cargo.
brew install text-embeddings-inference
text-embeddings-router --model-id Qwen/Qwen3-Embedding-0.6B --port 8080Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Self-host an embedding endpoint to feed a vector database for semantic search or RAG
- Run a re-ranker to reorder retrieved documents before sending them to an LLM
- Serve a sequence-classification model for tasks like sentiment or intent detection
- Replace a paid embedding API with a model you control on your own GPU or CPU
How Text Embeddings Inference (TEI) compares
Text Embeddings Inference (TEI) alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Sentence Transformers | ★ 18.8k | The standard Python framework for loading, training, and computing embeddings with sentence and reranking models. |
| EmbeddingGemma (Gemma) | ★ 5.5k | Google DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search. |
| Text Embeddings Inference (TEI) | ★ 4.9k | Serve open-source embedding and reranking models over a fast HTTP API |
| Infinity (Embeddings) | ★ 2.8k | A high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API. |
| ColPali | ★ 2.7k | A vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first. |
| Model2Vec | ★ 2.1k | A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference. |
| Instructor Embedding | ★ 2k | Instruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction. |
| Qwen3-Embedding | ★ 2k | Alibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages. |