AI/TLDR

What Is MMLU? The Classic Knowledge Benchmark

You will understand what MMLU measures, why it shaped benchmark culture, and why top models have largely saturated it.

INTERMEDIATE10 MIN READUPDATED 2026-06-14

In plain English

MMLU stands for Massive Multitask Language Understanding. It is a giant multiple-choice quiz for AI models. The quiz covers 57 different subjects — everything from elementary mathematics and US history to college-level physics, law, medicine, and moral philosophy. Each question gives the model four answer choices (A, B, C, D), and the model has to pick the right one. Score the model on thousands of these questions and you get a single accuracy number: the share it got correct.

MMLU — illustration
MMLU — borealisai.com

Think of it like the standardized exam a student sits to prove they have broad general knowledge — an SAT or a bar exam, but spanning dozens of fields at once. You can't cram one subject and pass; you need a little of everything. That breadth was the whole point. MMLU was built to measure how much a language model actually knows across the map of human knowledge, not just whether it can finish a sentence or answer trivia about one narrow topic.

Why it matters

Before MMLU, most language-model tests were narrow. A model might be scored on sentiment classification, or on filling in a missing word, or on one type of reading comprehension. Those tests were useful but small — a model could look strong on a narrow task while being shallow overall. MMLU changed the culture by asking one blunt question across 57 fields at once: how much does this thing really know?

That made it the canonical knowledge benchmark for years. When a new model launched, its MMLU score was almost always in the headline numbers, right next to its name. Press releases, model cards, and leaderboards all quoted it. For a long stretch, "what's its MMLU?" was shorthand for "how smart is it?" It became the common yardstick everyone could point at to compare one model against another.

Why a builder should still care

  • It's a shared baseline. Almost every notable model from the last several years reports an MMLU score, so it's one of the few numbers you can use to roughly line up old and new models on the same axis.
  • It shaped how benchmarks are reported. The conventions MMLU popularized — broad multi-subject coverage, multiple-choice scoring, few-shot prompting — set the template that later benchmarks copied or reacted against. Understanding MMLU is how you understand the ones that came after it.
  • It teaches the saturation lesson. MMLU's rise and fall is the clearest case study in why a benchmark stops being useful once everyone aces it. That pattern repeats with every benchmark, so learning it here pays off everywhere.

How it works

Mechanically, MMLU is simple, and that simplicity is part of why it caught on. There's no human grader, no fuzzy judgment call, no second model scoring the answer. Every question has exactly one correct letter, so grading is just string matching: did the model pick the right option, yes or no?

The shape of a question

Each item is a question stem plus four labelled choices. The model is asked to output the letter of the best answer. Here is the format (this is an illustrative example, not a real MMLU item):

one MMLU-style multiple-choice itemtext
The following are multiple choice questions (with answers)
about college biology.

Which molecule carries amino acids to the ribosome during
translation?
A. mRNA
B. tRNA
C. rRNA
D. DNA polymerase
Answer:

The model is expected to produce a single letter — here, B. The grader compares that letter to the known answer key. Run this across the whole dataset and the model's MMLU score is simply the percentage of items it answered correctly, often reported per-subject and as one overall average.

Few-shot prompting and the random-guess floor

MMLU is traditionally run few-shot: before the real question, the prompt includes a handful of solved example questions from the same subject (the classic setup uses five examples, called 5-shot). This shows the model the expected answer format so it doesn't ramble instead of emitting a clean letter. Because there are four choices, a model that knows nothing and guesses at random scores about 25% — that's the floor. Useful knowledge is anything meaningfully above that baseline.

The 57 subjects are grouped into four broad areas — STEM, humanities, social sciences, and a catch-all other (business, health, and more). Reporting per-group accuracy is common because a model can be strong in one area and weak in another. There's also a small companion set, sometimes called MMLU-dev or the Pro 0-shot split in later variants, but the headline number almost always refers to the full test averaged across all 57 subjects.

Why MMLU saturated

A benchmark is saturated when the best models score so high, and so close to each other, that it can no longer tell them apart. Around the GPT-4 era (roughly 2023 onward), frontier models pushed their MMLU scores up near the practical ceiling. Once several models all sit within a couple of points of one another, near the top, the benchmark stops doing its job: ranking them. A test where everyone gets an A+ doesn't measure who's best — it measures that the test got easy.

Several forces pushed MMLU toward saturation. Knowing them helps you read any benchmark critically.

  • Sheer capability. Models simply got much better at broad factual recall, which is exactly what MMLU rewards. The test didn't get easier; the models got stronger.
  • A noisy ceiling. MMLU contains some questions with debatable, mislabelled, or ambiguous answers. Once models are near the top, those flawed items cap the achievable score, so the difference between a 'good' and 'great' model gets lost in the noise.
  • Contamination risk. MMLU is public and widely copied across the web, so its questions can leak into training data. A model that has effectively seen the answers during training scores higher without being smarter — see benchmark contamination.
  • Overfitting to the format. When one number matters to marketing, there's pressure to optimize for it specifically. Tuning a model to do well on MMLU-style multiple choice is benchmark overfitting, and it inflates the score faster than real-world ability.

MMLU vs MMLU-Pro and successors

To restore headroom, researchers built MMLU-Pro, a harder redesign of the original. It keeps MMLU's spirit — broad, multiple-choice, multi-subject — but raises the difficulty in a few deliberate ways.

Two changes do most of the work. First, more answer options (up to ten instead of four) lower the random-guess floor and reduce the chance of stumbling onto the right letter. Second, the questions lean more on multi-step reasoning than pure recall, which is harder to ace just by memorizing facts. The result is a benchmark with room for models to improve again — at least until it, too, saturates.

MMLU-Pro is one of several reactions to saturation. The broader pattern is a treadmill: each benchmark becomes the standard, gets saturated, and is replaced by a harder one. Knowledge-focused successors include GPQA and Humanity's Last Exam; the field also moved toward coding benchmarks, agent benchmarks, and human-preference rankings like LMArena that are far harder to saturate.

How to read an MMLU score

If you see an MMLU number in a model announcement, read it with a few caveats in mind. The score is real, but it carries less signal than it used to.

  • Check the setup. A score depends on how the test was run: 5-shot vs 0-shot, with or without chain-of-thought, plain MMLU vs MMLU-Pro. Two numbers are only comparable if the setup matches. A 0-shot score and a 5-shot score for the same model can differ noticeably.
  • Tiny gaps near the top are noise. Once models are saturated, a one- or two-point difference can come from prompt formatting or a handful of ambiguous questions, not real capability. Don't pick a model over a near-identical MMLU edge.
  • It measures recall, not usefulness. MMLU rewards broad factual knowledge in a multiple-choice format. It says little about whether a model writes clean code, follows your instructions, uses tools, or stays grounded in a RAG pipeline.
  • Cross-check with other tests. Treat MMLU as one data point among many. Pair it with reasoning, coding, and agentic benchmarks, plus your own task-specific evals, before trusting a model for real work.

Going deeper

Once the basics click, a few finer points separate a casual reader of leaderboards from someone who can interpret them well.

The format gives free hints. Multiple choice constrains the answer to four (or ten) options the model can see. That's easier than open-ended generation, where the model must produce the right answer with no menu to choose from. A strong MMLU score doesn't prove a model can produce that knowledge unprompted — only that it can recognize the right option. This is one reason the field shifted toward open-ended, test-verified tasks like fixing real code or completing agent workflows.

Scoring isn't always one method. Some evaluations read the literal letter the model types; others compare the model's internal probability for each option and pick the most likely one. These two methods can yield different scores for the same model, which is another reason setups must match before numbers are compared. The widely used lm-evaluation-harness exists partly to standardize these choices so results are reproducible.

Public benchmarks have a built-in shelf life. Any test posted openly online eventually leaks into training data and gets optimized against. This is the core tension of public vs private benchmarks: public ones are transparent and reproducible but contamination-prone; private, held-out ones resist gaming but you have to trust whoever holds them. MMLU's journey from canonical to saturated is this lifecycle playing out in full view.

Where to go next. To put MMLU in context, read what LLM benchmarks are and how to read a benchmark score. To understand the failure modes that retired MMLU, study benchmark contamination and benchmark overfitting. The durable lesson: every benchmark is a snapshot of one moment in a moving field. MMLU was the right test for its era, taught the whole industry how to compare models, and earned its retirement by being beaten — which is exactly what a good benchmark is supposed to provoke.

FAQ

What does MMLU stand for?

MMLU stands for Massive Multitask Language Understanding. It is a multiple-choice benchmark that tests a language model's knowledge across 57 subjects, from elementary math to college-level law, medicine, and philosophy. It's also sometimes called the Hendrycks test after its lead author.

How many subjects and questions are in MMLU?

MMLU covers 57 subjects, grouped into four broad areas: STEM, humanities, social sciences, and an 'other' category (business, health, and more). It contains thousands of multiple-choice questions in total, and a model's score is usually reported both per-subject and as one overall average accuracy.

Why is MMLU considered saturated?

Top models now score so high on MMLU — and so close to one another — that the benchmark can no longer separate the leaders. A test where every frontier model gets an A+ stops ranking them. Saturation set in around the GPT-4 era, driven by stronger models, ambiguous questions capping the top score, and possible data contamination.

What is the difference between MMLU and MMLU-Pro?

MMLU-Pro is a harder redesign built to restore headroom after the original saturated. It expands each question from four answer choices to up to ten (lowering the random-guess floor) and includes more reasoning-heavy questions instead of pure factual recall, so models can't ace it as easily.

Is a high MMLU score enough to choose a model?

No. MMLU measures broad multiple-choice knowledge recall, not whether a model writes good code, follows instructions, uses tools, or stays grounded in your data. Use it as one data point, check that the test setup matches, and always validate a model against your own task-specific evals before relying on it.

Why is MMLU run with few-shot examples?

MMLU is traditionally run 5-shot: five solved example questions are placed in the prompt before the real one. This shows the model the expected answer format so it cleanly outputs a single letter (A, B, C, or D) instead of rambling, which makes automatic grading reliable. With four choices, random guessing scores about 25%.

Further reading