AI/TLDR

LLM2Vec

Turn decoder-only LLMs into strong text embedding models

Overview

LLM2Vec is a Python toolkit and recipe for turning decoder-only large language models, such as Llama 3 or Mistral, into text encoders that produce sentence and document embeddings. It does this in three steps: enabling bidirectional attention, training with masked next token prediction (MNTP), and unsupervised contrastive learning. The result can be fine-tuned further for stronger retrieval and similarity performance.

It is aimed at engineers and researchers who already work with HuggingFace models and want embeddings from an LLM they trust, rather than a separate embedding model. The library wraps a base model so you can load it, optionally attach LoRA weights, and call a single encode method to get vectors.

As an embedding-model tool, LLM2Vec fits retrieval, clustering, classification, and semantic-similarity pipelines. The McGill-NLP team also publishes pre-converted checkpoints on HuggingFace, so you can use the embeddings without running the full training recipe yourself.

What it does

  • Converts decoder-only LLMs (Llama 3, Mistral, Llama 2, Sheared-LLaMA, plus Gemma and Qwen-2) into text encoders
  • Three-step recipe: bidirectional attention, masked next token prediction (MNTP), and unsupervised contrastive learning
  • Loads base models and optional PEFT/LoRA weights through a from_pretrained wrapper over HuggingFace models
  • Supports instruction-prefixed queries for asymmetric retrieval tasks
  • Configurable pooling strategy (default mean) and max sequence length (default 512)
  • Pre-trained supervised and unsupervised checkpoints published on HuggingFace

Getting started

Install the package from PyPI, then load a converted model and call encode to get embeddings.

Install LLM2Vec

Install the package from PyPI, followed by flash-attention.

bashbash
pip install llm2vec
pip install flash-attn --no-build-isolation

Load a converted model

Use from_pretrained with a base MNTP model and optional LoRA weights. By default the model loads with bidirectional connections enabled.

pythonpython
import torch
from llm2vec import LLM2Vec

l2v = LLM2Vec.from_pretrained(
    "McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
    peft_model_name_or_path="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-unsup-simcse",
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.bfloat16,
)

Encode text

Pass plain texts, or [instruction, text] pairs for queries, then compute cosine similarity.

pythonpython
instruction = (
    "Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
    [instruction, "how much protein should a female eat"],
    [instruction, "summit define"],
]
q_reps = l2v.encode(queries)

documents = [
    "The CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.",
    "The summit is the highest point of a mountain.",
]
d_reps = l2v.encode(documents)

q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))
print(cos_sim)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Generate query and document embeddings for semantic search and retrieval
  • Build text classification or clustering pipelines on top of LLM-based embeddings
  • Measure sentence similarity using cosine distance between encoded texts
  • Reuse an existing decoder-only LLM as an encoder instead of adding a separate embedding model

How LLM2Vec compares

LLM2Vec alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Sentence Transformers★ 18.8kThe standard Python framework for loading, training, and computing embeddings with sentence and reranking models.
EmbeddingGemma (Gemma)★ 5.5kGoogle DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search.
Text Embeddings Inference (TEI)★ 4.9kHugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU.
Infinity (Embeddings)★ 2.8kA high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API.
ColPali★ 2.7kA vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first.
Model2Vec★ 2.1kA tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference.
Instructor Embedding★ 2kInstruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction.
LLM2Vec★ 1.7kTurn decoder-only LLMs into strong text embedding models