In plain English
FAISS (Facebook AI Similarity Search) is an open-source library from Meta's Fundamental AI Research group. It does one thing: given a large collection of vectors, find the ones closest to a query vector — fast. It is not a database, not a server, and has no REST API. It is a C++ library with Python bindings that lives inside your process, reads your vectors from memory, and hands back results as arrays of indices and distances.

A useful analogy: imagine a phone book printed on paper. A database is the whole phone book plus a librarian who files new entries, handles deletions, enforces permissions, and can answer questions like "show me everyone whose last name starts with S." FAISS is only the page of the phone book — the part that organizes names so you can find one fast. It gives you raw lookup speed at a fraction of the overhead, but you are responsible for everything the librarian normally handles: keeping the book on a shelf, adding new pages, handling concurrent readers, and knowing which page maps to which entry.
This design is a feature, not a bug. Many production systems — recommendation engines at Netflix scale, image deduplication pipelines, research prototypes, and real-time RAG services — need maximum search throughput with minimum overhead, and running a full database server in the same process as the model is more machinery than the problem needs. FAISS gives you control.
Why it matters
The core problem FAISS solves is nearest-neighbor search at scale. If you have a thousand vectors, comparing your query to all of them is trivial — a loop in pure Python completes in milliseconds. Stretch that to ten million vectors of 768 dimensions each and brute-force search takes several seconds per query, far too slow for real-time use. FAISS is built around specialized indexes and C++/BLAS kernels that reduce this to milliseconds even at billion-vector scale.
Three specific scenarios where FAISS wins over a full vector database:
- You already manage infrastructure yourself. FAISS is a library dependency —
pip install faiss-cpu— not a service to run and monitor. If your system already manages its own state (a compiled model server, a Lambda function, an offline batch job), adding a managed database is unnecessary complexity. - You need raw throughput. FAISS indexes live in process memory; a managed database adds a network round-trip, serialization, and connection overhead. For high-QPS search services where latency budgets are tight, in-process search can be ten times faster.
- You need GPU acceleration. FAISS ships GPU-native implementations of its key indexes. You can offload an entire billion-vector flat search to a GPU in one call, something almost no database product exposes directly.
- You are building a vector database. Weaviate, Milvus, and Vespa all use FAISS or its algorithms internally. Understanding FAISS means understanding what is happening underneath the managed layer.
The tradeoff is real: you get speed and flexibility, but you give up metadata filtering, persistence, horizontal sharding, real-time updates, and authentication. FAISS does not know what a "document" is. It stores vectors and their integer IDs, nothing else. Any time you need to filter by date, author, or document type, you maintain a separate lookup table yourself and post-filter the FAISS results.
How it works
Every FAISS workflow follows the same three steps: choose an index, train it on your data (some indexes require this), and add vectors. Then at query time you call index.search(query_vectors, k) and receive two arrays: distances and integer IDs. That's the whole API.
The index zoo
FAISS ships a large family of index types. They all expose the same search interface but differ in how they trade off speed, memory, and recall (the fraction of true nearest neighbors actually returned). Understanding four of them covers 95% of real use cases.
| Index | Strategy | Memory | Speed | Recall | Best for |
|---|---|---|---|---|---|
| IndexFlatL2 / IndexFlatIP | Brute-force exact | High (full float32) | Slow at scale | 100% | < 1M vectors; ground-truth baseline |
| IndexIVFFlat | Cluster + search top nprobe clusters | Medium | Fast | ~95-99% | 1M – 100M vectors, tunable accuracy |
| IndexHNSWFlat | Navigable graph traversal | High (graph + vectors) | Very fast | ~98-99% | Latency-sensitive; no training needed |
| IndexIVFPQ | IVF clustering + product quantization | Low (compressed) | Fast | ~85-95% | Billion-scale; memory-constrained |
IndexFlatL2 compares the query to every stored vector using exact L2 (Euclidean) distance. It always returns the true nearest neighbors. Use it to verify correctness of a faster index, or when your collection is small enough that brute force is acceptable. IndexFlatIP is the inner-product variant — use it when your embedding model was trained with dot-product similarity.
IndexIVFFlat partitions the vector space into nlist Voronoi cells using k-means clustering during training. At search time it only visits the nprobe closest cells instead of the entire dataset, dramatically reducing comparisons. Raising nprobe toward nlist approaches brute-force recall; lowering it toward 1 maximizes speed. A common starting point: nlist = 4 * sqrt(N) where N is the number of vectors, and nprobe = 16 for a first benchmark.
IndexHNSWFlat builds a Hierarchical Navigable Small World graph over the vectors. Search starts at the top of the hierarchy and greedily walks toward the query through progressively denser layers. HNSW requires no training step and delivers excellent recall, but it stores the full graph structure in memory on top of the raw vectors, making it the most RAM-heavy option.
IndexIVFPQ combines IVF partitioning with Product Quantization (PQ) compression. PQ splits each vector into M sub-vectors and replaces each sub-vector with an 8-bit code pointing to the nearest centroid in a learned codebook for that sub-vector. The result: a 768-dimension float32 vector (3,072 bytes) can be compressed to as few as 64 bytes — a 48x reduction — while remaining searchable.
Python quickstart
The CPU package installs in one command. The GPU package is distributed through conda only (pip GPU builds were discontinued after version 1.7.3).
# CPU (pip)
pip install faiss-cpu
# CPU (conda, recommended)
conda install -c pytorch -c conda-forge faiss-cpu
# GPU (conda only)
conda install -c pytorch -c nvidia -c conda-forge faiss-gpuBelow is a complete example that creates three index types, indexes the same random data into each, and compares recall against the brute-force baseline.
import numpy as np
import faiss
# ---------------------------------------------------------------
# Setup: 100 000 vectors of dimension 128, float32
# ---------------------------------------------------------------
d = 128 # dimension
N = 100_000 # number of stored vectors
k = 10 # nearest neighbors to retrieve
np.random.seed(42)
vectors = np.random.random((N, d)).astype("float32")
query = np.random.random((1, d)).astype("float32")
# ---------------------------------------------------------------
# 1. Exact baseline: IndexFlatL2
# ---------------------------------------------------------------
flat = faiss.IndexFlatL2(d)
flat.add(vectors)
D_exact, I_exact = flat.search(query, k)
print("Exact top-10 IDs:", I_exact[0])
# ---------------------------------------------------------------
# 2. IVF with nlist=256, nprobe=8
# ---------------------------------------------------------------
quantizer = faiss.IndexFlatL2(d) # coarse quantizer
ivf = faiss.IndexIVFFlat(quantizer, d, 256, faiss.METRIC_L2)
ivf.train(vectors) # required: learns cluster centroids
ivf.add(vectors)
ivf.nprobe = 8 # search 8 of 256 clusters
D_ivf, I_ivf = ivf.search(query, k)
# ---------------------------------------------------------------
# 3. HNSW (no training needed)
# ---------------------------------------------------------------
hnsw = faiss.IndexHNSWFlat(d, 32) # 32 = M (graph degree)
hnsw.hnsw.efConstruction = 200
hnsw.add(vectors)
hnsw.hnsw.efSearch = 64
D_hnsw, I_hnsw = hnsw.search(query, k)
# ---------------------------------------------------------------
# Recall vs baseline
# ---------------------------------------------------------------
exact_set = set(I_exact[0])
print(f"IVF recall: {len(set(I_ivf[0]) & exact_set) / k:.0%}")
print(f"HNSW recall: {len(set(I_hnsw[0]) & exact_set) / k:.0%}")
# ---------------------------------------------------------------
# Save / load (manual persistence)
# ---------------------------------------------------------------
faiss.write_index(hnsw, "hnsw.index")
hnsw2 = faiss.read_index("hnsw.index")GPU in four extra lines
Moving an index to a GPU is a near-drop-in replacement. FAISS handles transfers to and from GPU memory automatically.
import faiss
res = faiss.StandardGpuResources()
cpu_index = faiss.IndexFlatL2(128)
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index) # 0 = first GPU
gpu_index.add(vectors)
D, I = gpu_index.search(query, 10)GPU FAISS supports GpuIndexFlatL2, GpuIndexFlatIP, GpuIndexIVFFlat, GpuIndexIVFScalarQuantizer, and GpuIndexIVFPQ. HNSW does not have a GPU implementation, as its pointer-chasing graph traversal does not parallelize well on GPU hardware.
FAISS vs. managed vector databases
The question "should I use FAISS or a vector database?" comes down to what you are willing to build yourself. FAISS gives you the search kernel; a vector database wraps that kernel in a durable, queryable, scalable product.
- In-process, no server
- No metadata filtering built-in
- Manual save/load for persistence
- No real-time add/delete
- GPU-native acceleration
- Zero network overhead
- You own the glue code
- Standalone server or managed cloud
- Built-in metadata + hybrid search
- Automatic durability + WAL
- Real-time upserts and deletes
- Vendor-managed hardware
- Network latency per query
- API handles the glue
The sweet spot for FAISS: batch pipelines and research experiments where you build the index once (perhaps nightly), load it into a serving process, and query it at high throughput with no mutations. Classic examples include offline recommendation engines, document deduplication jobs, and RAG services where the knowledge base changes slowly.
Vector databases earn their keep when your data changes continuously. FAISS indexes are not designed for constant insertions; some index types (IVFFlat, IVFPQ) become increasingly inaccurate after many add() calls without retraining, because the cluster centroids were computed on the original distribution. HNSW supports incremental adds without retraining, but FAISS provides no delete operation for it — once a vector is in, it stays. High-churn data almost always warrants a database.
Going deeper
Once you have basic search working, the next layer of FAISS expertise is tuning the accuracy-speed-memory triangle. Every index exposes knobs, and the defaults are conservative. Typical production iterations:
- Establish a recall baseline using
IndexFlatL2on a sample. Record the ground-truth neighbors — you will compare every other index against these. - Choose nlist for IVF using the rule of thumb
nlist ≈ 4 * sqrt(N). For 10M vectors that's roughly 12,649; common values are the nearest power of two or multiple of 256. - Sweep nprobe from 1 to nlist/4. Plot recall vs. query latency. A recall of 95% at nprobe=16 is a typical sweet spot; 99% at nprobe=64 costs roughly 4x more CPU.
- Add PQ compression if memory is the bottleneck.
M=64sub-vectors with 8-bit codes compresses a 768-dim vector to 64 bytes (12x). Recall drops 3-8% depending on the data distribution — acceptable for most applications. - Use index_factory for composite indexes:
faiss.index_factory(d, "IVF4096,PQ64")combines IVF clustering with PQ compression in one call.
Index selection decision tree
Scalar quantization as a middle ground
Between full float32 and the aggressive compression of PQ sits scalar quantization (IndexIVFScalarQuantizer). It stores each vector component as an 8-bit integer instead of a 32-bit float, reducing memory by 4x with only a 1-2% recall penalty — often the best first step when memory is tight but you want to avoid the training complexity of PQ. It also has a GPU implementation, making it attractive for GPU-accelerated serving.
The index_factory string DSL
FAISS exposes a small domain-specific language for building composite indexes without chaining constructors. The string "IVF4096,PQ64" means: cluster into 4,096 cells, then compress each vector with 64 PQ sub-vectors. "IVF8192,Flat" means: cluster into 8,192 cells, store full float32 vectors in each cell. "HNSW32" means: HNSW graph with M=32. You can chain preprocessing transforms too — "OPQ64_128,IVF4096,PQ64" applies Optimized Product Quantization before indexing, which can boost recall by rotating the vector space to minimize PQ quantization error.
import faiss
d = 128
# IVF with PQ compression
index = faiss.index_factory(d, "IVF4096,PQ64")
index.train(train_vectors) # train centroids for IVF cells AND PQ codebooks
index.add(vectors)
index.nprobe = 32
D, I = index.search(query, 10)
# HNSW (no training)
hnsw = faiss.index_factory(d, "HNSW32")
hnsw.add(vectors) # no .train() needed
# OPQ pre-processing + IVF + PQ
opq = faiss.index_factory(d, "OPQ64_128,IVF4096,PQ64")
opq.train(train_vectors)
opq.add(vectors)Sharding at scale
FAISS itself has no built-in distributed mode, but its IndexShards and IndexReplicas classes let you spread an index across multiple GPUs (or CPUs) in one process. For truly distributed search across machines, you shard the dataset manually — assign each node a partition of the vectors, fan out the query to all shards in parallel, and merge the top-k results. This is exactly what Milvus does internally. At the billion-vector scale described in the original FAISS paper, a shard of 100M vectors on one GPU with IVF-PQ can return top-100 results in under 10ms.
FAQ
Is FAISS a vector database?
No. FAISS is a library for similarity search — it runs inside your process and has no server, no REST API, no metadata storage, and no built-in persistence. Many vector databases use FAISS algorithms internally, but a database adds durability, filtering, sharding, and an access layer on top of the raw search kernel.
Does FAISS support GPU acceleration?
Yes. FAISS ships GPU implementations of IndexFlatL2, IndexIVFFlat, IndexIVFScalarQuantizer, and IndexIVFPQ. You move a CPU index to the GPU with faiss.index_cpu_to_gpu(res, device_id, index). The GPU package is distributed through conda only; pip GPU builds were discontinued after version 1.7.3.
What is nlist and nprobe in FAISS IVF indexes?
nlist is the number of Voronoi clusters the vector space is partitioned into during training — a larger value gives finer granularity but requires more training time and memory. nprobe is how many of those clusters are searched at query time. Higher nprobe improves recall at the cost of speed; the ratio nprobe / nlist controls the accuracy-speed tradeoff.
How do I save and load a FAISS index?
Use faiss.write_index(index, "path.index") to save and faiss.read_index("path.index") to reload. FAISS does not auto-save; if your process exits without calling write_index, the index is lost. For durable, always-on systems consider wrapping FAISS in a service that saves on shutdown and loads on startup.
Which FAISS index should I start with?
For fewer than 100K vectors use IndexFlatL2 — it is exact and requires no setup. For 100K to a few million vectors, IndexIVFFlat or IndexHNSWFlat both work well; HNSW skips the training step and tends to have higher out-of-the-box recall. For tens of millions or more vectors with memory constraints, IndexIVFPQ compresses vectors dramatically and is the standard production choice.
Can FAISS do metadata filtering?
Not natively. FAISS stores only vectors and integer IDs; it has no concept of fields, tags, or document types. The standard pattern is to run a FAISS search for a generous top-k (say, 200), then post-filter those IDs against a separate metadata store (Postgres, SQLite, a dict) to get the candidates that satisfy your filter. For complex filtering at scale, a dedicated vector database is often a better fit.