Overview
Instructor is an instruction-finetuned text embedding model. Instead of training a separate embedder for each job, you write a short natural-language instruction (for example, "Represent the Science title:") and prepend it to your text. The same model then produces embeddings tuned to that task and domain without any extra finetuning.
It is aimed at developers and researchers who need embeddings for several different tasks — retrieval, classification, clustering, or text similarity — and don't want to maintain one model per task. You install a small Python package, load a pretrained checkpoint from Hugging Face, and call a single encode function.
As an embedding model, it slots into the same place as any other sentence encoder in a search or RAG pipeline. The difference is the instruction prefix, which lets you steer the output vectors toward a specific task and subject area.
What it does
- Instruction-conditioned embeddings: prepend a task instruction to steer the output vectors, no finetuning required
- One model covers many tasks — classification, retrieval, clustering, and text evaluation — across domains like science and finance
- Simple Python API: load a checkpoint and call a single encode function
- Pretrained checkpoints published on Hugging Face (for example, hkunlp/instructor-large)
- encode supports batching, progress bars, numpy or PyTorch tensor output, device selection, and optional normalization
- Reports state-of-the-art results across 70 diverse embedding tasks in the accompanying paper
Getting started
Install the package, load a pretrained checkpoint, then call encode with instruction/text pairs.
Install the package
Install the InstructorEmbedding package from PyPI.
pip install InstructorEmbeddingLoad a pretrained model
Import INSTRUCTOR and load a checkpoint from Hugging Face. See the project's model list for other available checkpoints.
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')Encode text with an instruction
Pass a list of [instruction, text] pairs to encode. The result is a list of embedding vectors.
sentence = 'Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity'
instruction = 'Represent the Science title:'
embedding = model.encode([[instruction, sentence]])
print(embedding)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Power a retrieval or RAG pipeline by encoding queries and documents with a retrieval-specific instruction
- Generate features for text classification by representing each input with a domain-aware instruction
- Cluster documents (for example, by topic) using embeddings steered toward the clustering task
- Measure semantic similarity between texts with instruction-conditioned embeddings and dot-product or cosine scoring
How Instructor Embedding compares
Instructor Embedding alongside other open-source embedding models & inference tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Sentence Transformers | ★ 18.8k | The standard Python framework for loading, training, and computing embeddings with sentence and reranking models. |
| EmbeddingGemma (Gemma) | ★ 5.5k | Google DeepMind's Gemma repo, home to EmbeddingGemma, a 308M multilingual embedding model small enough to run on-device for RAG and semantic search. |
| Text Embeddings Inference (TEI) | ★ 4.9k | Hugging Face's Rust-based server for deploying embedding, reranking, and sequence-classification models with high throughput on GPU or CPU. |
| Infinity (Embeddings) | ★ 2.8k | A high-throughput serving engine for text embeddings, rerankers, CLIP, and ColPali models, exposing an OpenAI-compatible API. |
| ColPali | ★ 2.7k | A vision-language embedding model that indexes whole document page images for retrieval, avoiding the need to parse PDFs into text first. |
| Model2Vec | ★ 2.1k | A tool that distills any sentence transformer into a tiny, fast static embedding model (the Potion models) that runs on CPU without a neural network at inference. |
| Instructor Embedding | ★ 2k | One embedding model for any task — just prepend a natural-language instruction |
| Qwen3-Embedding | ★ 2k | Alibaba's open embedding and reranking models built on the Qwen3 base, available in 0.6B/4B/8B sizes and covering over 100 languages. |