Overview
LLM2Vec is a Python toolkit and recipe for turning decoder-only large language models, such as Llama 3 or Mistral, into text encoders that produce sentence and document embeddings. It does this in three steps: enabling bidirectional attention, training with masked next token prediction (MNTP), and unsupervised contrastive learning. The result can be fine-tuned further for stronger retrieval and similarity performance.
It is aimed at engineers and researchers who already work with HuggingFace models and want embeddings from an LLM they trust, rather than a separate embedding model. The library wraps a base model so you can load it, optionally attach LoRA weights, and call a single encode method to get vectors.
As an embedding-model tool, LLM2Vec fits retrieval, clustering, classification, and semantic-similarity pipelines. The McGill-NLP team also publishes pre-converted checkpoints on HuggingFace, so you can use the embeddings without running the full training recipe yourself.
What it does
- Converts decoder-only LLMs (Llama 3, Mistral, Llama 2, Sheared-LLaMA, plus Gemma and Qwen-2) into text encoders
- Three-step recipe: bidirectional attention, masked next token prediction (MNTP), and unsupervised contrastive learning
- Loads base models and optional PEFT/LoRA weights through a from_pretrained wrapper over HuggingFace models
- Supports instruction-prefixed queries for asymmetric retrieval tasks
- Configurable pooling strategy (default mean) and max sequence length (default 512)
- Pre-trained supervised and unsupervised checkpoints published on HuggingFace
Getting started
Install the package from PyPI, then load a converted model and call encode to get embeddings.
Install LLM2Vec
Install the package from PyPI, followed by flash-attention.
pip install llm2vec
pip install flash-attn --no-build-isolationLoad a converted model
Use from_pretrained with a base MNTP model and optional LoRA weights. By default the model loads with bidirectional connections enabled.
import torch
from llm2vec import LLM2Vec
l2v = LLM2Vec.from_pretrained(
"McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
peft_model_name_or_path="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-unsup-simcse",
device_map="cuda" if torch.cuda.is_available() else "cpu",
torch_dtype=torch.bfloat16,
)Encode text
Pass plain texts, or [instruction, text] pairs for queries, then compute cosine similarity.
instruction = (
"Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
[instruction, "how much protein should a female eat"],
[instruction, "summit define"],
]
q_reps = l2v.encode(queries)
documents = [
"The CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.",
"The summit is the highest point of a mountain.",
]
d_reps = l2v.encode(documents)
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))
print(cos_sim)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Generate query and document embeddings for semantic search and retrieval
- Build text classification or clustering pipelines on top of LLM-based embeddings
- Measure sentence similarity using cosine distance between encoded texts
- Reuse an existing decoder-only LLM as an encoder instead of adding a separate embedding model
How LLM2Vec compares
LLM2Vec alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Sentence Transformers | ★ 18.8k | The standard Python framework for loading, training, and computing embeddings with sentence and reranking models. |
| EmbeddingGemma (Gemma) | ★ 5.5k | Google DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search. |
| Text Embeddings Inference (TEI) | ★ 4.9k | Hugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU. |
| Infinity (Embeddings) | ★ 2.8k | A high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API. |
| ColPali | ★ 2.7k | A vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first. |
| Model2Vec | ★ 2.1k | A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference. |
| Instructor Embedding | ★ 2k | Instruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction. |
| LLM2Vec | ★ 1.7k | Turn decoder-only LLMs into strong text embedding models |