What Is MTEB? Reading the Embedding Leaderboard

Q: What does MTEB stand for?

MTEB stands for **Massive Text Embedding Benchmark**. It's an open benchmark and public leaderboard that scores text embedding models across many task types — retrieval, classification, clustering, reranking, semantic similarity, and more — so they can be compared fairly. Its multilingual extension is called MMTEB.

Q: Should I just use the #1 model on the MTEB leaderboard?

Usually not blindly. The headline rank is an *average across many tasks and domains*, so it doesn't tell you how a model performs on *your* task or *your* data. Use the leaderboard to build a shortlist — sorted by the task type you actually need — then test those few candidates on your own examples to pick the winner.

Q: Does a high MTEB score mean a model is fast or cheap?

No. MTEB measures embedding *quality* on its tasks — not inference cost, latency, rate limits, maximum input length, or vector size. Those production factors often matter more than a point or two of benchmark score, so weigh them separately when you choose a model.

You will understand what MTEB measures, how to read its leaderboard, and why the top model is rarely the right pick for your use case.

INTERMEDIATE10 MIN READUPDATED 2026-06-14

HUGGING FACEhuggingface.co/spaces/mteb embeddings-benchmark/mteb3.3k

In plain English

An embedding model turns text into a list of numbers — a vector — so a computer can compare meaning by measuring how close two vectors sit. But there are hundreds of embedding models, open and closed, big and tiny. Which one should you actually use? You can't eyeball a vector. You need a scoreboard.

MTEB Leaderboard — illustration — MTEB Leaderboard — kaszkowiak.org

MTEB — the Massive Text Embedding Benchmark — is that scoreboard. It runs a large batch of embedding models through dozens of standardized tasks (search, classification, clustering, and more), scores each one, and stacks them in a public leaderboard you can sort and filter. When someone says "this model is #3 on MTEB," this is what they mean.

Think of it like a decathlon, not a 100-metre sprint. A sprint crowns whoever is fastest in one race. A decathlon scores ten very different events and reports an average — so the overall winner might not be the best at any single event you personally care about. MTEB is the decathlon of embeddings: a strong all-rounder tops the table, but the model that wins your specific event may be sitting several rows down.

Why it matters

If you're building semantic search, a RAG pipeline, or a recommendation feature, the embedding model is the quiet component that decides whether the right documents ever reach your application. A bad embedding model means the right answer never gets retrieved — and no amount of clever prompting downstream can fix that. Picking well is high-leverage.

Before MTEB, comparing models was a mess. Every vendor published numbers on a different task, with a different dataset and a different metric, so the claims weren't comparable. MTEB fixed that by giving everyone one shared, reproducible test harness. That's its real contribution: not crowning a single "best" model, but making fair, apples-to-apples comparison possible at all.

It narrows the field fast. Hundreds of models become a sortable, filterable shortlist in minutes.
It exposes tradeoffs. The leaderboard shows score next to model size, embedding dimensions, and license — so you can weigh quality against cost and speed, not just chase the top score.
It's reproducible. Because the tasks and code are public, you can re-run the evaluation yourself, or test a model on the specific tasks that match your use case.

How it works

MTEB evaluates a model the same way for everyone, so the numbers are comparable. The model under test is treated as a black box: it only has to turn text into vectors. MTEB takes those vectors, runs them through a fixed set of tasks with fixed datasets and metrics, and records the scores.

// How a model gets onto the leaderboard

Embedding modelthe thing being testedEncode datasetstext → vectorsRun task suiteretrieval, classification…Score each taskper-task metricAverage + rankleaderboard row

The task categories

The key thing to understand is that MTEB is not one test — it's a bundle of task types, each measuring a different ability. A model can be excellent at one and mediocre at another. These are the main families you'll see as columns on the leaderboard:

Task type	What it measures	Who cares most
Retrieval	Find the right passages for a query	RAG, search, Q&A
Reranking	Re-order candidate results by relevance	Two-stage search
Classification	Group text into labels via the vectors	Intent, sentiment, routing
Clustering	Bundle similar texts with no labels	Topic discovery, dedup
Semantic similarity (STS)	Rate how alike two sentences are	Paraphrase, matching
Summarization / pair tasks	Score relatedness of text pairs	Niche pipelines

The headline number on the leaderboard is an average across all these tasks. That average is what sorts the table — and it's exactly what can mislead you. A model that's average everywhere can outrank a model that's outstanding at retrieval but weak at clustering. If you only do retrieval, the second model is the better pick even though it sits lower overall.

How to read the leaderboard

The leaderboard is a big sortable table. Reading it well is mostly about ignoring the headline number and looking at the columns next to it. Here is a practical order of operations.

Pick the right board first. MTEB has language- and domain-specific splits (English, multilingual/MMTEB, code, legal, and more). Choose the one that matches your data before you read any scores. A great English model can be useless on your German support tickets.
Sort by your task, not the average. Re-sort by the column that matches your use case (Retrieval for RAG, Classification for routing, and so on). This is the single most important habit.
Read score next to size and dimensions. A tiny gain in score for a model that's 5× larger and produces 4× bigger vectors is usually a bad trade. Bigger embedding dimensions cost more storage and slower search at every query, forever.
Check the license and access. Some top entries are open-weight and self-hostable; others are paid API-only. That single column can rule a model out regardless of its score.
Treat small gaps as noise. A fraction-of-a-point difference between two rows is rarely meaningful. Within the top cluster, model size, latency, cost, and language fit should decide — not the third decimal place.

// What the headline rank hides

Headline average says

One number per model
Higher = better, full stop
Top of the list wins

What actually matters

Score on YOUR task type
Language + domain coverage
Size, dimensions, latency, cost
Open vs API-only license
Gap size — is it even real?

Common traps and pitfalls

A public leaderboard creates incentives, and incentives create distortions. Knowing the traps keeps you from over-trusting a number.

Benchmark overfitting. When everyone optimizes for one scoreboard, models start being tuned to do well on MTEB's specific datasets rather than to embed text well in general. A high score can partly reflect practice on the test, not real-world skill.
Contamination. If a model was trained on text that overlaps with MTEB's evaluation datasets, it gets an unfair boost — it has effectively seen the answers. This is hard to detect from the outside and inflates scores quietly.
Average-chasing. Climbing the overall ranking rewards being decent at everything. That can mean a model is tuned to never be bad at any task, rather than to be excellent at the one task you actually need.
Stale assumptions. Rankings churn constantly as new models land. A blog post that named "the best embedding model" months ago is probably out of date — always check the live board.
Domain mismatch. MTEB's datasets are mostly general-purpose. If your text is medical notes, code, or legal contracts, general scores may not transfer. The board is a filter, not a guarantee.

From leaderboard to a real decision

Use MTEB to narrow, then test to decide. The leaderboard gives you a shortlist of three or four candidates that fit your task, language, size budget, and license. The actual winner is whichever one performs best on your own data — which only your own small evaluation can tell you.

// The right way to use MTEB

Filter the boardtask + language + sizeShortlist 3–4candidates that fitTest on your datayour own eval setPick the winnerfor YOUR use case

Your "own eval" doesn't have to be fancy. A few dozen real queries from your domain, each paired with the document that should be retrieved, is enough to rank your shortlist meaningfully. Embed the documents with each candidate, run your real queries, and measure how often the correct document comes back near the top.

compare_shortlist.pypython

# A tiny, framework-free eval: does the right doc come back on top?
# Run this for each model on your SHORTLIST, not all of MTEB.
import numpy as np

# (query, index of the doc that SHOULD rank first)
gold = [("how do refunds work?", 0), ("what are your support hours?", 2)]
docs = [
    "Refunds on physical items are accepted within 30 days.",
    "Digital goods are non-refundable once downloaded.",
    "Support is open 9am to 6pm Eastern, Monday to Friday.",
]

def hit_rate(embed):
    """embed(list[str]) -> np.ndarray of L2-normalized row vectors."""
    doc_vecs = embed(docs)
    hits = 0
    for query, want in gold:
        q = embed([query])[0]
        scores = doc_vecs @ q          # cosine sim (vectors normalized)
        if int(np.argmax(scores)) == want:
            hits += 1
    return hits / len(gold)

# Compare candidates on YOUR data, then pick the best — not the
# highest MTEB row.
# print("model A:", hit_rate(embed_with_model_a))
# print("model B:", hit_rate(embed_with_model_b))

Going deeper

Once the basics click, a few finer points separate a careful read of the leaderboard from a naive one.

MMTEB and language splits. The original MTEB was English-heavy. The community-driven MMTEB extension broadened it to a large set of languages and many more tasks, and the leaderboard now lets you view results per language or language group. If you serve non-English users, the global English ranking is the wrong board to read — switch to the split that matches your audience before drawing any conclusion.

Dimensions, truncation, and Matryoshka. Some models support Matryoshka embeddings, where you can truncate the vector to a shorter length and keep most of the quality. That changes the leaderboard math: a model might score near the top at full size and still be very good at a quarter of the dimensions, which can be the better production choice once you factor in storage and search speed. Always read the score in the context of the embedding dimension it was produced at.

Prompts and asymmetry. Many modern embedding models behave differently when you tell them whether a piece of text is a query or a document, often via a short instruction prefix. MTEB accounts for this, but it means a model's real-world quality depends on you using it the way it was evaluated. Skip the recommended query/document prompting and you may not reproduce the leaderboard's results at all.

The honest limit. A benchmark can only measure what it chose to measure, on the data it happened to include. MTEB is the best shared yardstick the field has, and that's genuinely valuable — but a yardstick is not a fitting room. The durable habit is simple: use the leaderboard to shortlist, then trust a small evaluation on your own data to decide. The rank tells you which models are worth testing; your data tells you which one to ship.

FAQ

What does MTEB stand for?

MTEB stands for Massive Text Embedding Benchmark. It's an open benchmark and public leaderboard that scores text embedding models across many task types — retrieval, classification, clustering, reranking, semantic similarity, and more — so they can be compared fairly. Its multilingual extension is called MMTEB.

Should I just use the #1 model on the MTEB leaderboard?

Usually not blindly. The headline rank is an average across many tasks and domains, so it doesn't tell you how a model performs on your task or your data. Use the leaderboard to build a shortlist — sorted by the task type you actually need — then test those few candidates on your own examples to pick the winner.

How do I read the MTEB leaderboard for retrieval or RAG?

Pick the board that matches your language, then re-sort the table by the Retrieval column instead of the overall average. Look at the score alongside the model's size, embedding dimensions, and license, and treat tiny score gaps as noise. That gives you a much better RAG shortlist than the headline ranking.

Why do MTEB rankings keep changing?

New embedding models are released constantly and submitted to the public leaderboard, so the order churns. That's why you should always check the live board rather than trusting a months-old blog post that named a single "best" model — that claim is probably already stale.

Can MTEB scores be gamed or inflated?

Yes, which is why you shouldn't over-trust them. Two known issues are overfitting (tuning a model specifically to do well on MTEB's datasets rather than to embed text well in general) and contamination (a model having been trained on text that overlaps with the evaluation data). Both can inflate a score without reflecting real-world quality.

Does a high MTEB score mean a model is fast or cheap?

No. MTEB measures embedding quality on its tasks — not inference cost, latency, rate limits, maximum input length, or vector size. Those production factors often matter more than a point or two of benchmark score, so weigh them separately when you choose a model.

// In plain English

// Why it matters

// How it works

The task categories

// How to read the leaderboard

// Common traps and pitfalls

// From leaderboard to a real decision

// Going deeper

// FAQ

// Further reading

// Related