In plain English
Imagine you have a library with a million books and you want the five most similar to the one in your hand. The exact approach: compare your book to every other book, one by one. Accurate, but slow — and once your library grows to a billion books, it becomes completely unusable at real-time speed.
Approximate nearest neighbor (ANN) search solves this with a clever shortcut: instead of checking every book, it uses a pre-built index that tells you which neighborhood of the library to walk into. You skip 99% of the shelves, land near the right section, and find books that are almost certainly the closest matches — in milliseconds instead of minutes. The price is a tiny risk of missing one or two ideal matches, but in practice that tradeoff is almost always worth it.
In the world of AI, "books" are embeddings — dense vectors of numbers that encode the meaning of text, images, or audio. When a vector database runs a semantic search or a recommendation query, ANN search is the engine underneath it. It's what lets a system with 100 million product embeddings return the ten most relevant results in under 20 milliseconds.
Why it matters
Modern AI applications embed everything — every product, every document, every user action — and then search those embeddings millions of times per day. Exact search simply doesn't scale to that world.
A single 1,536-dimension OpenAI embedding takes about 6 KB. A corpus of 10 million documents holds 60 GB of vectors. An exact search computes the distance from your query to each of the 10 million vectors — roughly 10 million floating-point dot products — for every single query. At any meaningful request rate, you'd need racks of hardware just to keep up. With an ANN index, the same search visits only a few thousand vectors, and a mid-range server handles hundreds of queries per second.
Who needs to care:
- RAG (retrieval-augmented generation) builders. The retrieval step — finding the right context chunks for your LLM — runs an ANN query on every user message. Latency here adds directly to your chatbot's response time.
- Recommendation system engineers. "Users like you also bought" is an ANN lookup against millions of user-preference vectors.
- Image and multimodal search teams. Reverse image search, visual similarity, and content moderation all search embedding spaces at scale.
- Vector database users. Every managed vector database (Pinecone, Weaviate, Qdrant, Milvus, pgvector, etc.) uses ANN indexes internally. Knowing the knobs means tuning them for your workload instead of accepting defaults.
The core tradeoff is recall vs. latency. Recall@10 means: of the true 10 nearest neighbors, what fraction did the ANN search actually return? A recall of 0.95 means you got 9 or 10 of the real top-10 results in a fraction of the time it would take to guarantee all 10. For most applications, 0.90–0.95 recall is indistinguishable in quality from perfect recall — and the latency difference is the difference between a usable and an unusable product.
How it works
There are two dominant ANN algorithm families used in production vector databases today: HNSW (graph-based) and IVF (cluster-based). They take very different approaches to the same problem.
HNSW — Hierarchical Navigable Small World
HNSW builds a multi-layer proximity graph over your vectors. Think of it like a highway system: the top layer has a few nodes connected by very long-range links (the expressways), and each lower layer adds more nodes with shorter connections, until the bottom layer contains every vector with dense local links. To find nearest neighbors, you enter at the top, greedily hop toward your query's neighborhood on the fast highway, descend to the next layer, refine your position, and keep descending until the dense bottom layer — where you do a final local sweep.
HNSW has three key parameters you'll encounter in every vector database:
- M — the number of bidirectional links each node keeps in the graph. Higher M improves recall and handles high-dimensional data better, but the index uses more RAM and builds more slowly. Typical range: 8–64.
- efConstruction — controls candidate list size during index build. Higher values produce a better-connected graph (higher recall at query time) at the cost of slower indexing. Typical range: 100–500.
- efSearch — controls candidate list size during queries. The main dial for the recall-vs-latency tradeoff at query time — you can tune it without rebuilding the index. Typical range: 50–200.
IVF — Inverted File Index
IVF takes a clustering approach. During index build, it runs k-means to partition all vectors into nlist clusters (called Voronoi cells). Each cluster has a centroid. At query time, you compute the distance from your query to all centroids (fast — there are only nlist of them), pick the nprobe closest centroids, and then do exact search only within those clusters.
The key parameters for IVF:
- nlist — the number of clusters. More clusters = finer partitioning = faster search per query, but the index requires more training time and memory for centroids. A rule of thumb:
nlist ≈ sqrt(n)where n is your dataset size. - nprobe — how many clusters to search at query time. The main recall-vs-speed dial: nprobe=1 is fastest (lowest recall), nprobe=nlist is exact search. Most production deployments use 8–64.
IVF requires a training phase before it can index vectors — it needs a representative sample to build the cluster centroids. This makes it less convenient for datasets that grow continuously, but it is highly memory-efficient and works well when your data can be loaded in batch.
IVF + PQ: compression for huge datasets
Product Quantization (PQ) is often combined with IVF (as IVF-PQ or IVFADC) to compress vectors before storing them. PQ splits each vector into sub-vectors and replaces each with a short code from a learned codebook, reducing memory by 4–32× at the cost of some recall. This is how systems search billions of vectors on a single server: the centroid lookup is still exact, but the within-cluster comparison uses compressed codes instead of full floats.
Flat (exact) vs indexed: when to use each
Not every use case needs an ANN index. Choosing the right approach depends on dataset size, latency requirements, and how much you care about guaranteed accuracy.
- Checks every vector
- 100% recall guaranteed
- O(n) — slow at scale
- No index build time
- No extra memory overhead
- Good for < ~100k vectors
- Visits only a small fraction
- 95–99% recall typical
- Sub-linear — fast at scale
- One-time build cost
- Extra RAM for graph/centroids
- Essential above ~500k vectors
Most vector databases offer a Flat index type alongside their ANN options. Use it for:
- Small corpora (under ~100,000 vectors) where brute-force is fast enough and you want perfect accuracy.
- Gold-standard evaluation — running flat search offline to measure what 100% recall looks like, then choosing ANN parameters that get you to 95–99% of that quality.
- Reranking pools — after ANN retrieves a top-50 candidate set, flat search over just those 50 vectors to reorder them precisely.
Tuning recall vs. speed in practice
Every vector database exposes the recall-vs-speed knobs. Here's how to use them in Python with Faiss, the open-source ANN library that underlies many vector databases internally.
# pip install faiss-cpu numpy
import faiss
import numpy as np
import time
dim = 768 # embedding dimension (e.g. sentence-transformers)
n_vectors = 500_000
k = 10 # top-k neighbors to retrieve
# Simulate a corpus of embeddings
np.random.seed(42)
corpus = np.random.rand(n_vectors, dim).astype("float32")
query = np.random.rand(1, dim).astype("float32")
# --- Flat (exact) index ---
flat = faiss.IndexFlatL2(dim)
flat.add(corpus)
t0 = time.perf_counter()
_, ids_flat = flat.search(query, k)
print(f"Flat: {(time.perf_counter()-t0)*1000:.1f} ms ids={ids_flat[0][:3]}")
# --- HNSW index ---
hnsw = faiss.IndexHNSWFlat(dim, 32) # M=32
hnsw.hnsw.efConstruction = 200
hnsw.add(corpus) # one-time build
hnsw.hnsw.efSearch = 64 # tune at query time, no rebuild
t0 = time.perf_counter()
_, ids_hnsw = hnsw.search(query, k)
print(f"HNSW: {(time.perf_counter()-t0)*1000:.1f} ms ids={ids_hnsw[0][:3]}")
# Recall@10: how many of the exact top-10 did HNSW return?
recall = len(set(ids_hnsw[0]) & set(ids_flat[0])) / k
print(f"Recall@10: {recall:.2f}")On a typical run with 500k vectors, the flat search might take 300–500 ms while HNSW returns results in 2–5 ms with recall above 0.95. To push recall toward 0.99, raise efSearch; to drop latency further, lower it. You find the right point by measuring your own data — there is no universal "best" setting.
The same pattern applies in managed vector databases. In Pinecone, recall is controlled by the top_k and metadata filtering settings. In Weaviate, you set efConstruction, maxConnections (M), and ef (efSearch) per collection. In Qdrant, the equivalent parameters are m, ef_construct, and query-time ef. The names differ; the concepts are identical.
Going deeper
HNSW and IVF are the workhorses, but they are not the whole story. Several forces are reshaping ANN search in 2025–2026.
Filtered ANN search
Most real applications don't just search all vectors — they apply metadata filters ("only products in category X, priced under $50"). Naively filtering after ANN retrieval fails when the filter is highly selective: you retrieve 100 candidates, filter them down to 2, and miss the 8 that were close but outside the ANN window. Vector databases now offer native filtered ANN, either by pre-filtering before graph traversal or by building per-filter sub-indexes. This is an active research area, with IVF sometimes outperforming HNSW under very low-selectivity filters.
Quantization: shrinking vectors
Storing 1 billion float32 vectors at 768 dimensions requires 3 TB of RAM — impractical. Scalar quantization (SQ8) compresses each float to an 8-bit integer, cutting memory by 4× with minimal recall loss. Binary quantization goes further, compressing each dimension to a single bit (32× compression) at a more significant recall cost. Matryoshka embeddings (like OpenAI's text-embedding-3 models) allow truncating a 3,072-dim vector to 256 dims at index time, scaling the recall-storage tradeoff continuously.
GPU-native ANN
GPU-accelerated ANN (IVF-RaBitQ, FAISS GPU, RAPIDS cuVS) can build and query indexes orders of magnitude faster than CPU-based implementations. This matters for systems that need to rebuild indexes frequently (streaming data) or serve extremely high query throughput. As of 2025–2026, GPU-native ANN is moving from research into production deployments.
The recall floor problem at scale
HNSW's recall degrades as the dataset grows beyond the parameters it was tuned for. An index built for 1 million vectors with M=16 may drop from 0.97 recall to 0.85 when the corpus reaches 100 million — because the graph's local neighborhoods become worse approximations of the true nearest-neighbor structure. Production RAG systems that grow continuously must monitor recall over time and schedule periodic index rebuilds with recalibrated M and efConstruction values.
Understanding ANN search at this level is what separates a vector database user from a vector database operator. The defaults in every managed service are tuned for a median workload. If your application demands sub-10 ms p99 latency at 0.99 recall over 50 million vectors, you need to measure, tune, and monitor — not just accept the defaults. The how to choose a vector database guide covers how these performance characteristics map to specific database choices.
FAQ
What is the difference between approximate and exact nearest neighbor search?
Exact (flat) search computes the distance from your query to every vector in the dataset and is guaranteed to return the true nearest neighbors, but it scales as O(n) — too slow for large corpora. Approximate nearest neighbor search uses an index structure (like HNSW or IVF) to skip most vectors, returning results that are almost always the true nearest neighbors in sub-linear time. The tradeoff is a small, tunable probability of missing one or two ideal matches.
What is HNSW and why do vector databases use it?
HNSW (Hierarchical Navigable Small World) is a multi-layer graph index where each node connects to its nearest neighbors across sparse upper layers and a dense bottom layer. Queries navigate from coarse upper layers down to the dense bottom layer, finding approximate nearest neighbors in near-logarithmic time. Vector databases favor HNSW because it supports incremental inserts (no full rebuild), delivers high recall at low latency, and has only three tuning parameters (M, efConstruction, efSearch).
What does recall mean in vector search?
Recall@k measures the fraction of the true k nearest neighbors that an ANN search actually returned. For example, recall@10 = 0.95 means the ANN search returned 9 or 10 of the 10 true nearest neighbors. Higher recall costs more latency. Most production applications target 0.90–0.97 recall, where the quality difference from perfect recall is imperceptible to users.
What is the difference between HNSW and IVF?
HNSW builds a navigable graph over all vectors and traverses it at query time — no training needed, supports live inserts, uses more RAM. IVF clusters vectors using k-means into partitions (cells) and at query time only searches the nprobe closest clusters — requires a training phase, more memory-efficient, works well for batch-loaded datasets. IVF-PQ extends IVF by compressing vectors with Product Quantization for very large datasets.
When should I use flat (exact) search instead of ANN?
Use flat search when your dataset is small (under roughly 100,000 vectors), when you need guaranteed 100% recall for correctness-critical applications, or as an offline benchmark to measure what perfect recall looks like before choosing ANN parameters. For datasets above ~500,000 vectors or latency requirements under 20 ms, an ANN index is almost always necessary.
How do I tune ANN parameters in a vector database?
The workflow is: (1) run flat search on a sample of real queries to get the ground-truth top-k results; (2) run your ANN index with default parameters and compute recall@k against the ground truth; (3) raise or lower the query-time parameter (efSearch for HNSW, nprobe for IVF) until you hit your recall target; (4) measure p50/p99 query latency at that setting. Only rebuild the index (changing M or efConstruction/nlist) if you can't reach your recall target by adjusting the query-time knob.