In plain English
An embedding is a list of numbers that stands in for a piece of meaning. Feed an embedding model a word, a sentence, a photo, or a snippet of code, and it hands back a fixed-length list of numbers — say 384, 768, or 1,536 of them. That list is the embedding, also called a vector. The whole point: two things that mean something similar get lists of numbers that are close together, and two things that mean different things get lists that are far apart.
Here is the everyday analogy. Imagine a giant map where every possible idea has an address. Coffee shops cluster in one neighborhood, hardware stores in another, jazz clubs in a third. Streets and avenues are just numbers — a latitude and a longitude. On this map, cappuccino lands right next to latte, a couple of blocks from espresso, and clear across town from wrench. An embedding is that address: a precise location in a space of meaning. The model's only job is to put related things at nearby addresses.
A real map has two numbers per address. An embedding has hundreds — it lives in a high-dimensional space your eyes can't picture. That's fine. You never need to look at the raw numbers. You only ever ask one question of two embeddings: how close are they? Close means similar in meaning. Far means unrelated. Everything embeddings are used for — search, recommendations, Retrieval-Augmented Generation — is built on that single comparison.
Why it matters
For decades, computers compared text by matching characters. Search the word car and you'd miss every document that said automobile, vehicle, or sedan — different letters, so no match. That keyword approach is blind to meaning. Embeddings fixed exactly that. Because car and automobile land at nearly the same address in meaning-space, a search built on embeddings finds the right document even when not a single word overlaps. This is the engine behind semantic search.
Embeddings matter most right now because they are half of how you give a large language model knowledge it wasn't trained on. An LLM can't read your private documents — they weren't in its training data, and they won't fit in one prompt. The fix is RAG: embed all your documents once, embed the user's question at query time, and pull back the handful of chunks whose embeddings sit closest to the question. Those chunks get pasted into the prompt. Without embeddings, there is no efficient way to ask "which of my ten thousand documents is this question about?"
Who should care?
- Anyone building a chatbot over their own data. RAG, the standard pattern, runs on embeddings start to finish.
- Anyone building search or recommendations. "More like this", "related products", and "find duplicate support tickets" are all nearest-neighbor lookups in embedding space.
- Anyone doing classification or clustering at scale. Embeddings turn messy text into tidy numeric features that ordinary algorithms can sort, group, and label.
- Anyone cutting LLM costs. Semantic caching reuses a stored answer when a new question is close enough to an old one — measured by embeddings.
What did embeddings replace? Brittle keyword search, hand-built synonym lists, and elaborate rule systems that tried to guess when two phrases meant the same thing. One model, trained once, now captures that similarity for free.
How it works
An embedding model is a neural network trained for one purpose: read an input and output a vector whose position encodes meaning. Modern text embedding models are usually trimmed-down transformers — the same architecture behind chat models — with the text-generating head removed. What's left reads your whole input and compresses it into a single fixed-length vector.
How meaning gets baked in
The model isn't told what words mean. It learns from examples of what goes with what. A common training recipe shows the model pairs that should be close — a question and its correct answer, a sentence and its paraphrase, two captions of the same photo — and pairs that should be far apart. The model nudges its internal weights so the "should be close" pairs land near each other and the rest drift away. Repeat across billions of pairs and the model generalizes: it places any new input at an address consistent with everything it saw. This is the same statistical idea behind the original word2vec embeddings, scaled up to whole sentences.
Measuring closeness
Once you have two vectors, you need a number for how close they are. The standard choice for text is cosine similarity: it measures the angle between the two vectors, ignoring their length. The result runs from 1.0 (pointing the same way — nearly identical meaning) through 0.0 (perpendicular — unrelated) toward -1.0 (opposite). In practice you'll see most real text pairs score somewhere between about 0.3 and 0.9; you pick a threshold for your task. Some models prefer plain Euclidean (straight-line) distance instead — always check the model card for which one it was trained with.
| Pair of sentences | Roughly expect | Why |
|---|---|---|
| "How do I reset my password?" vs "I forgot my login" | high (~0.7+) | Same intent, different words — embeddings see past the vocabulary. |
| "How do I reset my password?" vs "What's the weather today?" | low (~0.1) | Unrelated topics land far apart in meaning-space. |
| "The bank raised rates" vs "I sat on the river bank" | low–medium | Good models use context to tell the two senses of bank apart. |
See it yourself in Python
You don't need an API key or a GPU to feel how this works. The open-source Sentence Transformers library downloads a small embedding model and runs it on your laptop. Here it turns four sentences into vectors and ranks them by similarity:
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer, util
# A small, fast, free model that runs locally
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"How do I reset my password?",
"I forgot my login credentials.",
"What time does the store open?",
"The cat sat on the warm windowsill.",
]
# Each sentence becomes a 384-number vector
vectors = model.encode(sentences)
print(vectors.shape) # (4, 384)
# Cosine similarity of sentence 0 against all four
query = vectors[0]
for sentence, vec in zip(sentences, vectors):
score = util.cos_sim(query, vec).item()
print(f"{score:.2f} {sentence}")
# 1.00 How do I reset my password? <- itself
# 0.74 I forgot my login credentials. <- same intent, no shared words
# 0.18 What time does the store open?
# 0.07 The cat sat on the warm windowsill.Notice the second sentence scores high even though it shares almost no words with the first — reset/forgot, password/credentials. That gap between keyword overlap and meaning overlap is exactly what embeddings close. To turn this into a real search system, you'd store all your document vectors in a vector database and let it find the nearest neighbors for you in milliseconds, even across millions of documents.
Where embeddings show up
Once you can turn anything into a vector and measure closeness, a surprising number of problems collapse into the same shape: embed everything, then find what's near.
- Semantic search — rank documents by how close their vectors sit to the query's vector instead of by keyword match.
- RAG — retrieve the document chunks closest to a user's question and hand them to an LLM as context. Embeddings are the retrieval half of retrieval-augmented generation.
- Recommendations — "users who liked this also liked…" becomes nearest-neighbor lookup in an embedding space of items.
- Clustering & deduplication — group support tickets by topic, or flag two articles that say the same thing, by spotting vectors that bunch together.
- Classification — embed your examples, embed a new item, and label it by whichever known group it lands nearest to — often without training a classifier at all.
- Multimodal matching — models that embed text and images into the same space let you search photos with a text query, because both land at comparable addresses.
The same primitive underpins all of it. That's why embeddings are foundational infrastructure for building AI apps: get the meaning-into-vectors step right and a dozen downstream features fall out of it.
Common pitfalls
- Mixing models. Vectors from different models live in different, incompatible spaces. If you embed your documents with one model and your queries with another, the distances are meaningless. Use the exact same model (and version) on both sides, and re-embed everything if you switch.
- Forgetting embeddings have no memory. An embedding model reads one input at a time with no awareness of your other inputs. It is not a chat model — it doesn't reason, follow instructions, or answer questions. It only places text in space.
- Ignoring the context limit. Each model has a maximum input length (often 512 tokens for small local models). Text past the limit is silently truncated, so a long document gets embedded from just its opening. This is why RAG pipelines do chunking — splitting documents into model-sized pieces before embedding.
- Embedding the wrong thing. Vectors capture overall meaning, not exact details. Two product specs that differ only by a model number can embed almost identically. For precise-match needs (IDs, dates, prices), pair embeddings with traditional filters rather than relying on similarity alone.
- Skipping the cost math. Re-embedding your entire corpus every time content changes adds up. Embed once, store the vectors, and only re-embed what actually changed.
Going deeper
Dimensions are a real trade-off. A 1,536-dimension vector captures finer distinctions than a 384-dimension one, but it costs four times the storage and slows every distance calculation. At scale — tens of millions of vectors — that difference decides your hardware bill. Some newer models support Matryoshka representation learning, where a single vector is trained so its first N dimensions still work on their own. You store the full vector but can truncate it on the fly to trade a little accuracy for a lot of speed.
Normalization and the cosine/dot-product equivalence. Many libraries return unit-length (L2-normalized) vectors — scaled so their length is exactly 1. For normalized vectors, cosine similarity and the dot product give identical rankings, which lets vector databases use the faster dot-product math. If you ever compute similarity by hand, check whether your vectors are normalized; mixing normalized and raw vectors silently corrupts your scores.
Exact nearest-neighbor search doesn't scale, so production cheats. Comparing a query against every stored vector is fine for thousands of items, impossible for hundreds of millions. Real systems use approximate nearest neighbor (ANN) indexes — HNSW graphs and IVF clustering are the common ones — that trade a sliver of accuracy for massive speed by not checking every vector. This indexing is the core job of a vector database, and it shapes the decision of which one to choose.
Dense vs. sparse, and hybrid search. The embeddings described here are dense — every dimension carries a fraction of the meaning. The old keyword world used sparse vectors (one dimension per word, mostly zeros) like BM25. Each wins different cases: dense nails paraphrase and intent; sparse nails exact terms, names, and codes. The strongest retrieval systems run hybrid search — both at once — then merge the rankings, often with a reranker on top. Embeddings rarely work alone in serious production setups.
Embeddings inherit their training data's blind spots. A model trained mostly on English handles other languages worse; one trained on web text may encode social biases in its geometry; one that never saw legal or medical jargon places those terms imprecisely. There is no single "best" embedding model — the right pick depends on your language, domain, length, and budget, which is why public leaderboards like MTEB exist to compare them. Choosing the embedding model is as consequential as choosing the LLM it feeds.
FAQ
What is an embedding in simple terms?
It's a list of numbers that represents the meaning of something — a word, sentence, image, or piece of code. Things that mean similar things get number-lists that are close together, so a computer can compare meaning by comparing numbers.
What is the difference between an embedding and a vector?
In this context they're used interchangeably. A vector is just a list of numbers; an embedding is a vector that a model produced to capture the meaning of some input. Every embedding is a vector, but not every vector is an embedding.
How do embeddings capture meaning if they're just numbers?
The model is trained on millions of examples of what goes with what — questions with their answers, sentences with their paraphrases. It learns to place related inputs at nearby positions, so the numbers end up encoding meaning by where they sit relative to everything else.
What's the difference between an embedding model and an LLM?
An LLM reads text and writes text — it reasons, follows instructions, and answers. An embedding model reads text and outputs a single vector — it doesn't generate anything or follow instructions. They're often used together: embeddings retrieve the right context, then the LLM writes the answer.
How big is an embedding vector?
It depends on the model. Small local models often output 384 or 768 numbers; larger hosted models output 1,024, 1,536, or more. Higher dimensions can capture finer distinctions but cost more to store and search.
Do I need a vector database to use embeddings?
Not for a few thousand items — you can compare vectors in plain Python. But once you have hundreds of thousands or millions, you need a vector database with an approximate-nearest-neighbor index to search them fast enough.