LLM2Vec

Turn decoder-only LLMs into strong text embedding models

github.com/McGill-NLP/llm2vec★ 1.7k mcgill-nlp.github.io/llm2vec

Overview

LLM2Vec is a Python toolkit and recipe for turning decoder-only large language models, such as Llama 3 or Mistral, into text encoders that produce sentence and document embeddings. It does this in three steps: enabling bidirectional attention, training with masked next token prediction (MNTP), and unsupervised contrastive learning. The result can be fine-tuned further for stronger retrieval and similarity performance.

It is aimed at engineers and researchers who already work with HuggingFace models and want embeddings from an LLM they trust, rather than a separate embedding model. The library wraps a base model so you can load it, optionally attach LoRA weights, and call a single encode method to get vectors.

As an embedding-model tool, LLM2Vec fits retrieval, clustering, classification, and semantic-similarity pipelines. The McGill-NLP team also publishes pre-converted checkpoints on HuggingFace, so you can use the embeddings without running the full training recipe yourself.

What it does

Converts decoder-only LLMs (Llama 3, Mistral, Llama 2, Sheared-LLaMA, plus Gemma and Qwen-2) into text encoders
Three-step recipe: bidirectional attention, masked next token prediction (MNTP), and unsupervised contrastive learning
Loads base models and optional PEFT/LoRA weights through a from_pretrained wrapper over HuggingFace models
Supports instruction-prefixed queries for asymmetric retrieval tasks
Configurable pooling strategy (default mean) and max sequence length (default 512)
Pre-trained supervised and unsupervised checkpoints published on HuggingFace

Getting started

Install the package from PyPI, then load a converted model and call encode to get embeddings.

Install LLM2Vec

Install the package from PyPI, followed by flash-attention.

bashbash

pip install llm2vec
pip install flash-attn --no-build-isolation

Load a converted model

Use from_pretrained with a base MNTP model and optional LoRA weights. By default the model loads with bidirectional connections enabled.

pythonpython

import torch
from llm2vec import LLM2Vec

l2v = LLM2Vec.from_pretrained(
    "McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
    peft_model_name_or_path="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-unsup-simcse",
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.bfloat16,
)

Encode text

Pass plain texts, or [instruction, text] pairs for queries, then compute cosine similarity.

pythonpython

instruction = (
    "Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
    [instruction, "how much protein should a female eat"],
    [instruction, "summit define"],
]
q_reps = l2v.encode(queries)

documents = [
    "The CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.",
    "The summit is the highest point of a mountain.",
]
d_reps = l2v.encode(documents)

q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))
print(cos_sim)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Generate query and document embeddings for semantic search and retrieval
Build text classification or clustering pipelines on top of LLM-based embeddings
Measure sentence similarity using cosine distance between encoded texts
Reuse an existing decoder-only LLM as an encoder instead of adding a separate embedding model

How LLM2Vec compares

LLM2Vec alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Sentence Transformers	★ 18.8k	The standard Python framework for loading, training, and computing embeddings with sentence and reranking models.
EmbeddingGemma (Gemma)	★ 5.5k	Google DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search.
Text Embeddings Inference (TEI)	★ 4.9k	Hugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU.
Infinity (Embeddings)	★ 2.8k	A high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API.
ColPali	★ 2.7k	A vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first.
Model2Vec	★ 2.1k	A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference.
Instructor Embedding	★ 2k	Instruction-tuned text embedding models that let you tailor embeddings to a task by prepending a natural-language instruction.
LLM2Vec	★ 1.7k	Turn decoder-only LLMs into strong text embedding models

// Overview

// What it does

// Getting started