How to Choose an Embedding Model: OpenAI, Voyage, or Open Source

Q: Is text-embedding-3-small good enough for RAG, or should I use text-embedding-3-large?

`text-embedding-3-small` is good enough for most RAG applications on general English text. Upgrade to `text-embedding-3-large` if your corpus is highly technical (code, science, law), if queries are long and nuanced, or if you're seeing noticeable retrieval misses at the small model's quality level. The 6.5x price difference only becomes material above roughly 50 million tokens per month.

Get a decision framework — quality benchmarks, dimensions, price, and hosting — for picking the right embedding model for your project.

INTERMEDIATE13 MIN READUPDATED 2026-06-12

In plain English

Choosing an embedding model is a lot like choosing a translator. Two translators might both speak French, but one handles legal contracts beautifully while the other excels at casual conversation. An embedding model does exactly one job: turn a piece of text into a list of numbers that represents its meaning. The question is which model to trust with that job — and the answer depends on your quality requirements, your budget, how much data you'll embed, and whether you can accept an external API dependency.

The market breaks into three camps. API-hosted commercial models — OpenAI, Voyage AI, Cohere — give you one line of code and a monthly bill. Open-source models — BGE-M3, Nomic Embed, Qwen3-Embedding — run on your own hardware with no per-token cost and no data leaving your network. And a growing hybrid tier sits in between: open-weight models with commercial-grade MTEB scores that you can self-host or rent by the GPU-hour. There is no single winner. The right model is the cheapest one that clears your quality bar, fits your latency budget, and doesn't lock you into a dependency you'll regret.

Why the choice matters more than you'd expect

Most teams pick an embedding model in five minutes — grab the OpenAI default, add it to the LangChain boilerplate, ship. That works fine for prototypes. But at production scale or production quality, the model choice ripples through every layer of your system in ways that are expensive to reverse.

Quality: not all vectors are equal

A weak embedding model produces vectors where semantically related documents don't actually land near each other. Your RAG pipeline retrieves the wrong chunks, the LLM answers with irrelevant context, and users notice that the system sounds confident but wrong. Retrieval quality is often the single biggest lever on end-to-end RAG accuracy — better than chunk size tuning, better than prompt engineering, better than reranker selection. Choosing a model that scores 5 MTEB retrieval points higher can be worth more than a week of pipeline tuning.

Cost: the math compounds fast

API pricing spans nearly two orders of magnitude. text-embedding-3-small costs $0.02 per million tokens. voyage-4-large costs $0.12 per million tokens — 6x more. Self-hosted BGE-M3 on a spot A100 runs roughly $0.001 per million tokens, 20x cheaper than the OpenAI budget option. If you're embedding 100 million tokens a month, that difference is $2,000 vs $100 vs $10. The crossover where self-hosting becomes economically rational is roughly 500 million to 1 billion tokens per month, assuming you already have MLOps infrastructure.

Lock-in: dimensions are a contract

Once you embed a corpus with a model that outputs 1,536-dimensional vectors, every vector in your database is in that model's specific coordinate system. If you switch to a different model, the old vectors become meaningless — a query embedded with model B will point to a completely different neighborhood in space than the same query embedded with model A. Migration means re-embedding your entire corpus from scratch. Pick a model you're willing to live with for at least 12-18 months.

How to read the benchmarks

MTEB (Massive Text Embedding Benchmark) is the industry-standard scorecard for embedding models. It evaluates models across up to 56 tasks — retrieval, classification, clustering, semantic similarity, reranking, and more — and reports an average score. The headline number you see on leaderboards is that average. Higher is better, but the fine print matters.

nDCG@10: the metric that matters for RAG

For RAG and semantic search specifically, focus on the Retrieval subtask score, measured with nDCG@10 (Normalized Discounted Cumulative Gain at rank 10). This metric asks: of the 10 documents your model retrieved, how many were actually relevant, and were the most relevant ones at the top? A model that retrieves 8 relevant documents but buries the best one at position 9 scores lower than a model that puts the best result first. An nDCG@10 score of 0.60 is a useful threshold — below that, retrieval quality tends to feel noticeably weak in practice.

// From MTEB score to production decision

MTEB Retrieval scorenDCG@10 on your domain's languageDomain fit checkCode, legal, medical, multilingual?Latency budgetAPI (~50-200ms) vs local (~5-20ms)Cost at your scale$0.02-$0.18/M tokens vs self-hostLicensing & data privacyCan data leave your network?Model selectedLock in dimensions and re-embed cadence

Why the average score can mislead you

A model that performs consistently across all 56 tasks can outscore a specialized retrieval model in the headline average, while being weaker at the one task you care about. Conversely, a model optimized for English retrieval may tank on the clustering and classification tasks that inflate the average. Before trusting any headline number, drill into the Retrieval column and — if your use case is multilingual — the Multilingual Retrieval subtask separately.

The model landscape: who makes what

OpenAI: the safe default

OpenAI offers two embedding models: text-embedding-3-small (1,536 dimensions, $0.02/M tokens) and text-embedding-3-large (3,072 dimensions, $0.13/M tokens). Both support Matryoshka Representation Learning (MRL) — you can request fewer dimensions (e.g., 256 or 512) and the truncated vector still performs well, saving storage and speeding up similarity search. text-embedding-3-small is the most widely deployed embedding model in production and a sensible starting point for most teams. text-embedding-3-large trades cost for meaningfully better retrieval quality and is worth the premium when your corpus is long, nuanced, or highly technical.

Voyage AI: highest retrieval quality per dollar

Voyage AI launched the Voyage 4 family in January 2026, built on a Mixture-of-Experts (MoE) architecture. The flagship voyage-4-large costs $0.12/M tokens, supports 2,048/1,024/512/256 dimensions via MRL, has a 32K token context window, and consistently ranks near the top of MTEB retrieval subtasks. The budget option voyage-4-lite matches OpenAI at $0.02/M tokens and approaches the retrieval quality of the older voyage-3.5. Voyage also offers specialized models: voyage-code-3 for code retrieval and voyage-context-3 for long-document tasks. All new accounts get 200 million free tokens to evaluate the family before committing.

Cohere: multimodal and multilingual

Cohere's embed-v3 (English and Multilingual variants) outputs 1,024 dimensions at $0.10/M tokens. It is notably a multimodal model — it can embed both text and images in the same vector space, making it a natural fit for product search where users submit photo queries against text-described inventory. The English variant has a 512-token context window, which limits it for long-document retrieval unless you pre-chunk aggressively. The multilingual variant covers 100+ languages and is one of the stronger options for global applications that need semantic search to work equally well in English, French, Spanish, and Mandarin.

Open source: maximum control, zero per-token cost

The open-source tier has closed the gap with commercial models substantially. Qwen3-Embedding-8B (Apache 2.0, Alibaba) scores approximately 70.6 on the MTEB composite and ranked No. 1 on the MTEB Multilingual Leaderboard as of June 2025 — better than every commercial API option at time of writing, at zero per-token cost once hosted. It comes in 0.6B, 4B, and 8B parameter sizes; the 0.6B fits on any GPU with 8GB VRAM and still outperforms models twice its size on multilingual retrieval. BGE-M3 (Apache 2.0, BAAI) supports retrieval in 100+ languages and excels at long-context tasks with up to 8,192 input tokens. Nomic Embed v2 uses a Mixture-of-Experts architecture trained on 1.6 billion contrastive pairs and delivers strong multilingual quality. The standard Python library to run these locally is sentence-transformers.

Model	Dimensions	Context	Price/M tokens	License
text-embedding-3-small	1,536 (MRL)	8,192	$0.02	Proprietary
text-embedding-3-large	3,072 (MRL)	8,192	$0.13	Proprietary
voyage-4-large	2,048 (MRL)	32,000	$0.12	Proprietary
voyage-4-lite	1,024 (MRL)	32,000	$0.02	Proprietary
Cohere embed-v3	1,024	512	$0.10	Proprietary
Qwen3-Embedding-8B	1,024 (user-defined)	32,768	self-host	Apache 2.0
BGE-M3	1,024	8,192	self-host	Apache 2.0
Nomic Embed v2	768	8,192	self-host	Apache 2.0

A practical decision framework

Run through these questions in order. Each one can eliminate a whole category of models before you spend time benchmarking.

1. Can your data leave your network?

If you work with sensitive documents — medical records, legal briefs, proprietary code, PII — and your organization prohibits sending data to third-party APIs, you are in the open-source lane regardless of benchmark scores. Deploy BGE-M3, Qwen3-Embedding, or Nomic Embed on your own infrastructure via sentence-transformers or Hugging Face's Text Embeddings Inference (TEI) server. This constraint alone eliminates the entire commercial API tier.

2. What is your volume?

Below 50 million tokens per month, the API cost difference is negligible — $1-6/month separates text-embedding-3-small from voyage-4-large. Use the API. Between 50 million and 500 million tokens per month, do the math: $10 to $90/month on the cheapest APIs, versus the GPU cost to self-host a small model. Above 500 million tokens per month, self-hosting almost always wins economically if you have MLOps capacity.

3. Is your content domain-specific?

General-purpose MTEB scores are measured on Wikipedia, news, and Common Crawl data — not your proprietary corpus. A model that ranks second on MTEB might outperform the top-ranked model on your domain because its training data happened to include more similar text. For code retrieval, use voyage-code-3 or text-embedding-3-large with code-specific chunking — both significantly outperform general models on code tasks. For legal or medical text, run a small offline benchmark on 50-100 representative queries from your actual corpus before committing.

4. Do you need multilingual support?

If users query in multiple languages or your documents span languages, multilingual performance matters more than the English-only MTEB retrieval score. The open-source Qwen3-Embedding family leads the MTEB Multilingual Leaderboard. Among APIs, Cohere embed-v3 Multilingual and voyage-4-large both handle 100+ languages well. Avoid text-embedding-3-small for mixed-language corpora — it was optimized primarily for English.

5. What is your latency requirement?

OpenAI, Voyage, and Cohere APIs typically return in 50-200ms per batch request over the network. A well-provisioned self-hosted model on a T4 GPU delivers 10-20ms per request. If you are embedding live user queries at sub-100ms end-to-end targets, network round trips to an external API eat most of that budget. In latency-critical stacks, run the embedding model on the same server as your vector index — often the strongest argument for self-hosting even at low volume.

// API vs. self-hosted embedding tradeoffs

API (OpenAI / Voyage / Cohere)

Zero setup, one API key
Pay per million tokens
50-200ms network latency
Data leaves your network
Model updates handled for you
Easy to switch models

Self-hosted (BGE / Qwen3 / Nomic)

Requires GPU + MLOps
~$0.001/M tokens at scale
5-20ms local latency
Data stays in your network
You manage model updates
Switching requires re-embedding

Dimensions, Matryoshka, and storage math

Every embedding model outputs a fixed-size vector. A 1,536-dimension float32 vector takes 6 KB of memory. That sounds tiny, but at 10 million documents it becomes 60 GB — which determines whether your vector index fits in RAM (fast) or spills to disk (slow). Choosing a model with fewer dimensions, or using a model that supports Matryoshka Representation Learning (MRL), can cut storage by 4-8x with minimal quality loss.

How Matryoshka dimensions work

MRL is a training technique where the model is explicitly trained so that the first N dimensions of a full vector already form a useful, lower-dimensional embedding. With OpenAI's text-embedding-3-large (3,072 dims), you can request just the first 256 dimensions — saving 12x storage — and the truncated vector still performs remarkably well for retrieval. Voyage AI and Qwen3-Embedding also support MRL with dimension options of 2,048/1,024/512/256.

The practical rule of thumb: for most English-only RAG use cases, 768-1,024 dimensions hits the quality-storage sweet spot. Going below 512 starts showing retrieval quality degradation. Going above 1,536 shows diminishing returns — you're adding storage cost without proportional quality gains on standard benchmarks.

Quantization: a free 2x storage reduction

Beyond dimensionality, quantization lets you shrink each number in the vector. The default float32 uses 4 bytes per dimension. Switching to float16 or bfloat16 halves storage with essentially zero quality loss — a free win you should almost always take. The Voyage 4 series and BGE-M3 also support int8 and binary quantization (1 bit per dimension), which compresses vectors by 32x with modest quality trade-offs that are often acceptable for first-pass retrieval when a reranker handles final ranking.

Going deeper

Fine-tuning for domain adaptation

The largest quality gains for specialized corpora come not from choosing a bigger model, but from fine-tuning an existing model on your domain. Contrastive fine-tuning on even 1,000-5,000 (query, positive document, hard negative) triplets from your corpus can improve retrieval nDCG@10 by 5-15 points on your specific task — more than the difference between any two commercial API models. The sentence-transformers library makes this straightforward with its MultipleNegativesRankingLoss. This is the path most teams take once they've validated pipeline quality on a general model.

Late interaction: ColBERT and beyond

Standard bi-encoder embeddings compress an entire document into a single vector. ColBERT (Contextualized Late Interaction over BERT) keeps one vector per token and computes similarity at query time by matching query token vectors against document token vectors. This multi-vector approach is significantly more accurate but also significantly more expensive to store and query. It is worth considering for high-value retrieval tasks (legal discovery, medical literature) where a reranker isn't enough and you can afford the storage overhead.

Shared embedding spaces

A 2026 development worth watching: Voyage AI's Voyage 4 series introduced shared embedding spaces, where multiple model variants (large, base, lite) map text into the same coordinate system. This means you can embed your corpus with the cheap voyage-4-lite model, then query with the more expensive voyage-4-large at retrieval time — getting quality improvements on queries without re-embedding millions of documents. It is the first API offering to solve the ingestion-vs-query quality trade-off directly.

Benchmarking on your own data

No external benchmark substitutes for measuring on your real corpus. A minimal offline evaluation: take 50-100 queries you know the answer to (question + correct document), embed both with each candidate model, run cosine similarity retrieval, and compute recall@5 and nDCG@10. Even this quick test on your own data is more predictive than MTEB scores from Wikipedia and news corpora. Budget 2-3 hours and $10-20 in API fees for this evaluation — it will pay back in weeks.

Quick offline benchmark with sentence-transformerspython

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Replace with your model of choice
model = SentenceTransformer("BAAI/bge-m3")

# Your test set: list of (query, correct_doc_index, [all_docs])
queries = ["What is our refund policy?"]
corpus = [
    "Refunds are accepted within 30 days of purchase.",
    "We ship worldwide via FedEx and DHL.",
    "Contact support at help@example.com.",
]
correct_idx = [0]  # query 0 should retrieve corpus[0]

q_embs = model.encode(queries, normalize_embeddings=True)
c_embs = model.encode(corpus, normalize_embeddings=True)

scores = cosine_similarity(q_embs, c_embs)
for i, (q, expected) in enumerate(zip(queries, correct_idx)):
    ranked = np.argsort(scores[i])[::-1]
    rank = list(ranked).index(expected) + 1
    print(f"Query: {q!r}  |  Correct doc rank: {rank}/{ len(corpus)}")

FAQ

What is a good MTEB score for a production embedding model?

For English retrieval tasks, an nDCG@10 score above 0.60 on the MTEB Retrieval subtask is generally acceptable for production use. Scores above 0.65 are strong. The absolute numbers shift as the benchmark evolves — what matters more is your model's relative score versus the current top performers on the specific subtask (Retrieval, Multilingual, or both) that matches your use case.

Is text-embedding-3-small good enough for RAG, or should I use text-embedding-3-large?

text-embedding-3-small is good enough for most RAG applications on general English text. Upgrade to text-embedding-3-large if your corpus is highly technical (code, science, law), if queries are long and nuanced, or if you're seeing noticeable retrieval misses at the small model's quality level. The 6.5x price difference only becomes material above roughly 50 million tokens per month.

Can I mix embeddings from different models in the same vector database?

No. Embeddings from different models live in different coordinate systems — mixing them produces meaningless similarity scores. Your corpus and all query embeddings must use exactly the same model. If you switch models, you must re-embed your entire corpus before querying.

How do Matryoshka embeddings work and should I use them?

Matryoshka Representation Learning (MRL) trains a model so the first N dimensions of a full vector are already a useful lower-dimensional embedding. You can request 256 dimensions from a 1,536-dim model and get a 6x storage reduction with modest quality loss. Use MRL dimensions when storage or query latency is a bottleneck. Start at 512 or 768 — going below 256 starts hurting retrieval quality noticeably.

When does self-hosting an open-source embedding model make sense?

Self-hosting makes sense in three scenarios: (1) your data cannot leave your network due to security or compliance requirements, (2) you're embedding more than 500 million tokens per month and the GPU cost undercuts API fees, or (3) you need sub-20ms embedding latency and can't afford the network round trip to an external API. For everything else, start with an API and revisit at scale.

Does a higher-dimensional embedding model always give better retrieval?

Not necessarily. Research shows diminishing returns above roughly 768 dimensions for most tasks — doubling dimensions from 768 to 1,536 typically yields less than one MTEB point of improvement while doubling storage costs. What matters more than raw dimensionality is the model's training quality and domain fit. A 768-dim model that was fine-tuned on your domain will usually beat a 3,072-dim general model.

// In plain English

// Why the choice matters more than you'd expect

Quality: not all vectors are equal

Cost: the math compounds fast

Lock-in: dimensions are a contract

// How to read the benchmarks

nDCG@10: the metric that matters for RAG

Why the average score can mislead you

// The model landscape: who makes what

OpenAI: the safe default

Voyage AI: highest retrieval quality per dollar

Cohere: multimodal and multilingual

Open source: maximum control, zero per-token cost

// A practical decision framework

1. Can your data leave your network?

2. What is your volume?

3. Is your content domain-specific?

4. Do you need multilingual support?

5. What is your latency requirement?

// Dimensions, Matryoshka, and storage math

How Matryoshka dimensions work

Quantization: a free 2x storage reduction

// Going deeper

Fine-tuning for domain adaptation

Late interaction: ColBERT and beyond

Shared embedding spaces

Benchmarking on your own data

// FAQ

// Further reading

// Related

In plain English

Why the choice matters more than you'd expect

How to read the benchmarks

The model landscape: who makes what

A practical decision framework

Dimensions, Matryoshka, and storage math

Going deeper

FAQ

Further reading

Related