AI/TLDR

Text Embeddings Inference (TEI)

Serve open-source embedding and reranking models over a fast HTTP API

Overview

Text Embeddings Inference (TEI) is a Hugging Face toolkit for deploying and serving open-source text embedding and sequence-classification models. You point it at a model from the Hugging Face Hub, and it exposes an HTTP API that turns text into vectors. It also handles re-ranking and classification models, so one server can cover several common tasks.

TEI is written in Rust and aimed at developers who need to run embedding models themselves instead of calling a hosted API. It supports many popular model families, including BERT, XLM-RoBERTa, GTE, E5, Nomic, ModernBERT, and Qwen3 embedding models. There is no separate graph-compilation step, so the server boots quickly and is straightforward to put behind a service.

As an embedding-serving tool, TEI sits between your application and the raw model weights. It loads Safetensors or ONNX weights, batches incoming requests by token count, and returns vectors you can store in a vector database for search, retrieval, or RAG. It runs on NVIDIA GPUs, AMD Instinct GPUs (experimental), Apple Silicon via Metal, and CPU.

What it does

  • Serves text-embedding, re-ranking, and sequence-classification models from one server
  • Supports many model families: BERT, CamemBERT, XLM-RoBERTa, JinaBERT, GTE, E5, Nomic, MPNet, ModernBERT, Qwen3, and Gemma3
  • Token-based dynamic batching with no model graph compilation step for fast boot times
  • Loads Safetensors and ONNX weights; uses Flash Attention, Candle, and cuBLASLt for inference
  • Runs on NVIDIA GPU, AMD Instinct (ROCm, experimental), Apple Silicon (Metal), and CPU
  • Production features: OpenTelemetry distributed tracing, Prometheus metrics, and a gRPC option

Getting started

The quickest way to run TEI is the official Docker image, which starts a server you can call over HTTP. You can also install it locally with Cargo or Homebrew.

Start the server with Docker

Run the CUDA image and pass a Hugging Face model id. This serves the model on port 8080 and caches weights in a local data directory.

bashbash
docker run --gpus all -p 8080:80 -v $PWD/data:/data --pull always \
  ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 \
  --model-id Qwen/Qwen3-Embedding-0.6B

Get embeddings from the API

Send text to the /embed endpoint as JSON and receive the vector in the response.

bashbash
curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

Install locally (optional)

On Apple Silicon you can install TEI with Homebrew and launch the router directly. For other platforms, install Rust and build the router with Cargo.

bashbash
brew install text-embeddings-inference
text-embeddings-router --model-id Qwen/Qwen3-Embedding-0.6B --port 8080

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Self-host an embedding endpoint to feed a vector database for semantic search or RAG
  • Run a re-ranker to reorder retrieved documents before sending them to an LLM
  • Serve a sequence-classification model for tasks like sentiment or intent detection
  • Replace a paid embedding API with a model you control on your own GPU or CPU

How Text Embeddings Inference (TEI) compares

Text Embeddings Inference (TEI) alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Sentence Transformers★ 18.8kThe standard Python framework for loading, training, and computing embeddings with sentence and reranking models.
EmbeddingGemma (Gemma)★ 5.5kGoogle DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search.
Text Embeddings Inference (TEI)★ 4.9kServe open-source embedding and reranking models over a fast HTTP API
Infinity (Embeddings)★ 2.8kA high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API.
ColPali★ 2.7kA vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first.
Model2Vec★ 2.1kA tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference.
Instructor Embedding★ 2kInstruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction.
Qwen3-Embedding★ 2kAlibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages.