AI/TLDR

Model2Vec

Distill any sentence transformer into a tiny, fast static embedding model

Overview

Model2Vec is a Python library that turns a regular sentence transformer into a small, static embedding model. Instead of running a neural network for every input, a static model looks up pre-computed token embeddings, which makes it much smaller and faster while keeping most of the quality. The maintainers report size reductions of up to 50x and speed-ups of up to 500x, with a small drop in performance.

It is aimed at developers who need text embeddings but want to run them on CPU without the overhead of a full transformer. You can use the ready-made Potion models from the Hugging Face hub, or distill your own static model from a sentence transformer in about 30 seconds on a CPU.

As an embedding model tool, it fits tasks like text classification, retrieval, clustering, and building RAG systems. The embeddings it produces are standard vectors you can feed into any of those pipelines.

What it does

  • Loads pre-trained Potion static models directly from the Hugging Face hub, ready to use
  • Distills your own static model from any sentence transformer in about 30 seconds on a CPU
  • Runs on CPU without a neural network at inference, so embeddings are small and fast
  • Produces both pooled embeddings and per-token embedding sequences via encode and encode_as_sequence
  • Optional training extra lets you fine-tune classification models on top of a distilled or pre-trained model
  • Supports BPE and Unigram tokenizer backends, plus quantization and dimensionality reduction to shrink models further

Getting started

Install the base package, then load a Potion model and create embeddings. If you want to build your own static model, install the distillation extra.

Install the base package

Install the lightweight base package with pip.

bashbash
pip install model2vec

Load a model and make embeddings

Load a pre-trained Potion model from the Hugging Face hub and encode some text.

pythonpython
from model2vec import StaticModel

# Load a model from the HuggingFace hub (in this case the potion-base-32M model)
model = StaticModel.from_pretrained("minishlab/potion-base-32M")

# Make embeddings
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

# Make sequences of token embeddings
token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."])

Distill your own model (optional)

Install the distillation extra, then distill a static model from a sentence transformer in about 30 seconds on a CPU.

bashbash
pip install model2vec[distill]

Run the distillation

Distill a sentence transformer and save the resulting static model.

pythonpython
from model2vec.distill import distill

# Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model
m2v_model = distill(model_name="BAAI/bge-base-en-v1.5")

# Save the model
m2v_model.save_pretrained("m2v_model")

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Add fast, CPU-only text embeddings to a retrieval or RAG pipeline without hosting a full transformer
  • Shrink an existing sentence transformer into a smaller static model for resource-constrained or edge deployments
  • Generate vectors for text classification and clustering at high throughput
  • Embed multilingual text using the pre-trained potion-multilingual-128M model

How Model2Vec compares

Model2Vec alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Sentence Transformers★ 18.8kThe standard Python framework for loading, training, and computing embeddings with sentence and reranking models.
EmbeddingGemma (Gemma)★ 5.5kGoogle DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search.
Text Embeddings Inference (TEI)★ 4.9kHugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU.
Infinity (Embeddings)★ 2.8kA high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API.
ColPali★ 2.7kA vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first.
Model2Vec★ 2.1kDistill any sentence transformer into a tiny, fast static embedding model
Instructor Embedding★ 2kInstruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction.
Qwen3-Embedding★ 2kAlibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages.