In plain English
An embedding model is a machine that accepts a sentence and returns a list of numbers. That list — the embedding vector — is a precise location in a high-dimensional space of meaning. Feed in "The kitten slept" and you get something like [0.12, -0.87, 0.41, ...] stretched across 768 or 1,024 slots. Feed in "The cat napped" and you get a very similar list, because the two sentences mean almost the same thing. Feed in "quarterly earnings forecast" and the numbers land somewhere completely different on the map.
Here is the clearest analogy: think of a massive library where every book has a physical address — shelf, row, seat. Books on the same topic are shelved near each other. The librarian doesn't memorize which books are related; she just walks to a shelf and grabs whatever is within arm's reach. An embedding model is the system that assigns those addresses. It places cat, kitten, tabby, and feline all within a few steps of each other — and clear on the other side of the library from spreadsheet or mortgage rate.
Unlike a real library, the embedding space has hundreds of dimensions, not three. You can't picture it, but you don't need to. The only question that ever matters is: how close are two addresses? Close means similar in meaning; far means unrelated. Every application built on embeddings — semantic search, RAG pipelines, recommendations — is ultimately just computing that distance over and over again, very quickly.
Why it matters
Before embedding models existed, computers compared text by counting shared characters or words. A search for puppy training would miss every article that used dog obedience instead — different letters, no match. Embeddings fixed this by translating meaning into geometry: two semantically similar phrases land near each other in vector space, so a distance check finds them even when not a single word overlaps.
For anyone building AI-powered features, text embeddings are the foundation of several essential capabilities:
- Semantic search — rank results by meaning, not keyword overlap, so users find what they're looking for even when they phrase it differently than the document author did.
- RAG (retrieval-augmented generation) — before an LLM can answer a question about your private documents, you need a fast way to pull the right documents. Embedding the question and every document chunk, then finding the nearest vectors, is how that retrieval step works.
- Recommendations — embed every product, article, or video, then surface items whose vectors sit nearest to what a user already liked.
- Clustering and deduplication — group support tickets by topic, spot near-duplicate entries in a database, or tag unlabeled content — all by looking at which vectors clump together.
- Semantic caching — avoid calling an LLM for a question that is nearly identical to one already answered, by checking if the new question's vector is close to a cached question's vector.
What did embeddings replace? Brittle synonym dictionaries, TF-IDF keyword matching, and hand-curated ontologies that took teams of humans to maintain. One embedding model, trained once, captures all of that structure automatically — and generalizes to new text it has never seen.
How it works: three steps
Every text embedding model processes your input through three stages: tokenization, transformer layers, and pooling. Understanding each stage is what turns the magic box into something predictable and debuggable.
Step 1: Tokenization
The model cannot read characters or spaces directly. It first breaks your text into tokens — chunks that sit between whole words and individual characters. Most modern embedding models use WordPiece (BERT family) or Byte-Pair Encoding (many newer models) to build these chunks. Common English words like the, cat, run typically stay as single tokens. Rare words and compound words get split: tokenization might become token + ##ization in WordPiece notation. The ## prefix marks a continuation piece, not the start of a word.
Two special tokens are injected around the real content. [CLS] is placed at the very start of the sequence; it stands for classification and was originally the slot where BERT stored a summary of the whole input. [SEP] marks the end of a segment. So "The kitten napped." becomes: [CLS], the, kit, ##ten, napped, ., [SEP] — a sequence of integer IDs, one per token, looked up from the model's fixed vocabulary table.
Step 2: Transformer layers and self-attention
Once the text is a sequence of token IDs, the model converts each ID into a static token vector via a lookup table (the embedding matrix). These initial vectors carry no context — bank in river bank and bank in central bank start out identical. The transformer's job is to fix that.
Each transformer layer runs self-attention: every token looks at every other token in the sequence and computes how much each one should influence its own updated vector. This is done through three learned projections called Query, Key, and Value (Q, K, V). Conceptually, each token asks "which other tokens are relevant to understanding my meaning right now?" and re-weights its vector accordingly. After 12 or 24 such layers (depending on model size), bank in a financial context has a vector that has absorbed signals from interest, rate, and loan, placing it far from the same word used beside river and boat.
Linguists have observed a consistent pattern across transformer layers: lower layers capture surface and syntactic structure (word order, parts of speech), middle layers capture phrase-level relationships, and upper layers carry the deepest semantic meaning. Embedding models are pre-trained on huge text corpora to develop these representations, then fine-tuned — often with contrastive learning — to make the final vectors useful for similarity tasks specifically.
Step 3: Pooling
After the transformer finishes, you have one vector per token — a whole matrix, not the single vector you need. Pooling collapses that matrix into one fixed-length vector representing the entire input.
The two most common pooling strategies are:
- Mean pooling — average all token vectors dimension by dimension. This is what Sentence-BERT (SBERT), E5, BGE, and most modern embedding models use. Averaging lets every token contribute to the final representation.
- CLS pooling — use only the
[CLS]token's final-layer vector, since it was designed as a whole-sequence summary slot. Vanilla BERT originally used this, though mean pooling generally produces better sentence embeddings in practice.
After pooling you often see a final L2 normalization step: the vector is scaled so its total length equals exactly 1. Normalized vectors allow you to compute similarity with a simple dot product instead of the slightly heavier cosine formula — they give identical rankings but the dot product runs faster in practice, which matters at scale.
Why 'cat' and 'kitten' end up as neighbors
The transformer's learned weights determine where vectors land, and those weights were shaped by how words and sentences are used together across billions of examples. During pre-training on large text corpora, the model sees cat and kitten appearing in nearly identical sentence patterns: "The kitten meowed" and "The cat meowed", "my kitten is playful" and "my cat is playful". Self-attention links them to the same surrounding context, and training nudges their vectors closer together.
Fine-tuning on contrastive pairs sharpens this further. A contrastive training example might say: the question "How do I care for a young cat?" and the document "Kitten feeding schedule and veterinary tips" should be close; both of them should be far from "How do I replace a bicycle tyre?". By seeing millions of such pairs, the model learns to place semantically equivalent sentences at near-identical addresses — even when they share no words at all.
The result is measurable. Real embedding models produce vectors for cat and kitten with cosine similarities above 0.85; the cosine between cat and mortgage typically falls below 0.1. This is why semantic search built on embeddings finds "puppy training guide" when a user types "how to train a dog" — the vector for the query lands right next to the vector for the document, even though not a word matches.
| Pair | Expected cosine similarity | Why |
|---|---|---|
| "cat" vs "kitten" | ~0.87 | Near-synonyms trained on identical contexts |
| "How do I reset my password?" vs "I forgot my login" | ~0.78 | Same intent, different phrasing — intent survives paraphrase |
| "The bank raised interest rates" vs "I sat by the river bank" | ~0.35 | Polysemy: context shifts the vector for 'bank' in each case |
| "cat" vs "mortgage rate" | ~0.08 | Unrelated topics land far apart |
Try it yourself in Python
You can run a complete embedding pipeline locally in minutes. The open-source Sentence Transformers library wraps the tokenizer, transformer, and pooling step into a single encode() call. The small all-MiniLM-L6-v2 model (22 MB) runs on CPU and produces 384-dimensional vectors:
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"The kitten napped on the warm windowsill.",
"A young cat was sleeping in a sunny spot.",
"Q3 earnings beat analyst expectations by 12%.",
]
# encode() runs tokenization + transformer + mean-pooling internally
vectors = model.encode(sentences, normalize_embeddings=True)
print(f"Vector shape: {vectors.shape}") # (3, 384)
# Cosine similarity (= dot product for normalized vectors)
scores = util.cos_sim(vectors[0], vectors)
for sent, score in zip(sentences, scores[0]):
print(f"{score:.3f} {sent}")
# 1.000 The kitten napped on the warm windowsill. <- itself
# 0.832 A young cat was sleeping in a sunny spot. <- near-synonym
# 0.071 Q3 earnings beat analyst expectations... <- unrelatedNotice the second sentence scores 0.832 even though it shares almost no words with the first. The model has learned that kitten/cat, napped/sleeping, and warm windowsill/sunny spot all pull in the same direction in vector space, so the aggregated vectors land very close together. The financial sentence ends up on a completely different part of the map.
For production use you would typically call a hosted embedding API rather than running the model yourself. The call pattern is identical — send text, receive a vector — but the computation is offloaded. OpenAI's text-embedding-3-small produces 1,536-dimensional vectors; Cohere Embed v4 and Google Gemini Embedding are strong alternatives with good multilingual coverage. At scale those per-token API costs add up, which is why many teams deploy open-source models like BGE-M3 or nomic-embed-text on their own infrastructure.
Popular embedding models and what their dimensions mean
The number of dimensions in an embedding vector is one of its most consequential properties. More dimensions allow the model to capture finer distinctions between meanings, but every dimension adds storage cost and slows similarity search. Here is a snapshot of widely-used models as of mid-2026:
| Model | Dimensions | Max tokens | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1,536 | 8,191 | Supports Matryoshka truncation to 512 |
| OpenAI text-embedding-3-large | 3,072 | 8,191 | Highest OpenAI accuracy; supports truncation |
| Cohere embed-v4 | 1,024 | 128,000 | Strong multilingual and long-context retrieval |
| BAAI/BGE-M3 | 1,024 | 8,192 | 100+ languages; multi-retrieval modes |
| nomic-embed-text | 768 | 8,192 | Open-source; MoE architecture in v2 |
| all-MiniLM-L6-v2 | 384 | 256 | Tiny and fast; great for local experimentation |
Matryoshka representation learning (MRL) is a training technique used by OpenAI and others that makes it safe to truncate vectors. Instead of being evenly distributed, the model front-loads the most important information into the first dimensions. This means a 3,072-dimensional vector can be cut to 512 dimensions at query time, trading a small accuracy loss for a large speed and storage gain — very useful at billion-scale.
Model performance is tracked on the MTEB (Massive Text Embedding Benchmark) leaderboard, which evaluates models across dozens of retrieval, clustering, and classification tasks in multiple languages. Before choosing a model, check the MTEB scores for the task type and language closest to your use case — headline accuracy numbers can be misleading if they come from a task distribution that doesn't match yours.
Going deeper
Bi-encoders vs. cross-encoders. The architecture described above — embed query once, embed every document once, compare vectors — is a bi-encoder. It is fast because you pre-compute and store document vectors, then only embed the query at runtime. A cross-encoder is different: it takes a (query, document) pair as a single input and scores them jointly, allowing each token to attend to every token in both texts at once. Cross-encoders are much more accurate but too slow to scan millions of documents — they are used as rerankers after a bi-encoder retrieves the top 50–200 candidates. Most serious retrieval pipelines use both: bi-encoder for speed, cross-encoder for precision.
Dense vs. sparse embeddings. The vectors described in this article are dense — every dimension carries a fraction of the meaning. Classical information retrieval used sparse vectors (BM25 / TF-IDF), where each dimension represents one vocabulary word and most dimensions are zero. Dense models dominate paraphrase and intent matching; sparse models win on exact keyword and named-entity matching. Modern production systems often run hybrid search — both pipelines in parallel — then merge the ranked lists with a fusion algorithm like Reciprocal Rank Fusion. The best retrieval quality almost always comes from the hybrid approach, not either alone.
Exact vs. approximate search. Once you have millions of vectors, comparing a query against every stored vector one at a time (exact kNN) is too slow. Real systems use approximate nearest neighbor (ANN) algorithms that index the space so they only check a small fraction of candidates. The most common index types are HNSW (hierarchical navigable small-world graphs) and IVF (inverted file index with quantization). HNSW is fast at recall but memory-hungry; IVF is more memory-efficient and suits larger corpora. These indexes are the core offering of vector databases — they handle this tradeoff for you.
Fine-tuning for your domain. A general-purpose embedding model was trained on web text and may place domain-specific jargon imprecisely. If your documents use specialized vocabulary — medical records, legal contracts, internal product names — fine-tuning on a small set of domain-specific positive and negative pairs can dramatically improve retrieval accuracy. Libraries like Sentence Transformers make it straightforward to continue training a pre-trained model on your own triplet data.
Embeddings encode training-data biases. Because vectors are learned from human-generated text, they absorb the associations present in that data. Models trained predominantly on English may handle other languages poorly. Models trained on web text may encode social stereotypes in their geometry — placing gender-coded terms near career terms in ways that reflect historical bias rather than desired behavior. Awareness of this is increasingly important in applications that use embeddings to make decisions about people.
- Query and documents encoded separately
- Document vectors computed once, stored
- Query vector computed at runtime only
- Sub-millisecond nearest-neighbor lookup
- Slight accuracy loss vs cross-encoder
- Query + document fed as one input
- Full mutual attention between both
- Cannot be precomputed — scores at query time
- Much slower — only viable on top-N candidates
- Higher accuracy for final ranking
FAQ
What is the difference between tokenization and embedding?
Tokenization is the step that splits raw text into integer IDs (tokens) the model can process. Embedding is the broader process of converting those tokens — and ultimately the whole sentence — into a floating-point vector that captures meaning. Tokenization happens first and is just a lookup; embedding is what the transformer layers do on top of that.
Why does the same word get different vectors in different sentences?
Because embedding models use contextual representations. After self-attention, each token's vector is influenced by all the other tokens around it. The word bank in "central bank rate" will end up with a very different vector than bank in "fishing on the river bank" — the surrounding words pull the representation toward the relevant meaning.
What does pooling do and why is mean pooling better than CLS pooling?
Pooling collapses the per-token matrix the transformer outputs into a single fixed-length vector for the whole input. Mean pooling averages all token vectors, letting every word contribute to the final representation. CLS pooling uses only the [CLS] token's vector. Research on sentence similarity tasks has consistently shown mean pooling produces more accurate embeddings, which is why Sentence-BERT and most modern models default to it.
How many dimensions does a text embedding have?
It depends on the model. Small, fast local models like all-MiniLM-L6-v2 use 384 dimensions. Mid-range models like BGE-M3 and Cohere Embed v4 use 768–1,024. Large hosted models like OpenAI text-embedding-3-large use 3,072. Higher dimensions can capture finer distinctions but cost more to store and search, especially at scale.
Can I use embeddings from one model to search embeddings from a different model?
No. Each model maps text to its own unique vector space, and those spaces are incompatible. Comparing a vector from OpenAI's model with a vector from a BERT-based model produces meaningless distances. You must use the same model — and the same model version — for both the stored embeddings and the query.
What is the maximum text length an embedding model can handle?
Each model has a token limit, typically stated in its documentation. Older BERT-style models cap at 512 tokens (roughly 380 words). Newer models like Cohere Embed v4 and nomic-embed-text support up to 8,192 tokens. Text exceeding the limit is silently truncated, which is why long documents are split into chunks before embedding in RAG systems.