What Are LLM Benchmarks? MMLU, GPQA Explained

In plain English

An LLM benchmark is a standardized test for an AI model. It's a fixed set of questions with known correct answers, run the same way on every model, so you can score them on the exact same exam and compare the numbers. When a new model launches and the announcement says "88.7% on MMLU," that 88.7% came from a benchmark.

Think of it like the SAT, but for language models instead of high-schoolers. Every test-taker gets the same questions, the same time limit, and the same scoring rubric. The whole point of a standardized test is that a 1400 from one student means the same thing as a 1400 from another. Benchmarks do that for models: they turn "this one feels smarter" into a number two labs can actually argue about.

Different benchmarks test different skills. MMLU is a 57-subject general-knowledge exam (think trivia across history, law, medicine, math). GPQA is a small set of brutally hard graduate-science questions that you can't just Google your way through. SWE-bench checks whether a model can fix real bugs in real codebases. Each one is a separate exam measuring a separate ability, and a model can ace one while flunking another.

Why it matters

Before benchmarks, comparing two models was hopeless. Every lab demoed its model on questions it happened to do well on, and "ours is better" was pure marketing. There was no shared yardstick, so you couldn't tell a real improvement from a cherry-picked screenshot. Benchmarks replaced the screenshots with a common exam everyone has to sit.

That shared yardstick does three jobs. It lets researchers measure progress over time (is this year's model actually smarter, or just bigger?). It lets buyers compare options before committing — is the cheaper open model good enough for your task, or do you need the frontier one? And it gives the whole field a scoreboard, which is why every model launch leads with a wall of benchmark numbers.

Who should care

Anyone choosing a model — benchmarks are the first filter when you're deciding between providers or open models. They narrow the field before you run your own tests.
Developers reading model cards — every release ships a table of scores. Knowing what each benchmark measures tells you whether the number is relevant to your use case.
Teams doing local or open models — leaderboards are how you find a small model that punches above its size for your task.
Anyone following AI news — "new model tops the leaderboard" headlines are meaningless until you know what the leaderboard tests.

How it works

Under the hood every benchmark is the same pipeline. A fixed dataset of questions (with known answers) goes through the model one by one, a scorer checks each answer against the key, and the results roll up into a single percentage — the score you see quoted. The genius and the danger both live in the details of each step.

// How a benchmark produces a score

Fixed datasetquestions + answer keyRun the modelone question at a timeScore answersmatch the keyAverageone % per model

Most classic benchmarks are multiple-choice, which makes scoring trivial: the model picks A, B, C, or D, and you check it against the key. MMLU and GPQA both work this way. The catch is that on a 4-choice question, random guessing already scores 25% — so a number near 25% means the model is basically guessing, and the interesting range is well above it.

Other benchmarks score differently. Coding benchmarks like SWE-bench or HumanEval can't use multiple choice — they run the model's code against a test suite and mark it correct only if the tests pass. Open-ended benchmarks need a model-graded judge or human raters because there's no single answer key. The scoring method matters as much as the questions.

One detail that quietly changes scores: how you prompt the model during the test. The same benchmark can be run zero-shot (just the question), few-shot (a handful of solved examples first), or with chain-of-thought (let the model reason step by step). These setups can swing a score by several points, which is why a fair comparison requires the same setup for every model.

// Two ways to score a benchmark

Multiple-choice

Pick A / B / C / D
Check against answer key
Instant, deterministic
MMLU, GPQA, ARC

Task-based

Model produces real output
Run tests / judge it
Slower, messier
SWE-bench, HumanEval

The benchmarks you'll actually see

There are hundreds of benchmarks, but a handful show up on nearly every model announcement. Here are the ones worth recognizing on sight, what each measures, and roughly what a "good" score looks like today.

Benchmark	What it tests	Format	Notes
MMLU	Broad knowledge across 57 subjects	Multiple choice	The classic generalist exam; top models are near the ceiling
MMLU-Pro	Harder, cleaner MMLU with more options	Multiple choice	Made because plain MMLU got too easy
GPQA	Graduate-level science (bio, physics, chem)	Multiple choice	"Google-proof" — experts struggle, non-experts fail
SWE-bench	Fixing real GitHub issues in real repos	Run tests	The headline agentic-coding benchmark
HumanEval	Writing small Python functions from a spec	Run tests	Older, largely saturated coding test
MATH / AIME	Hard competition mathematics	Exact answer	Where reasoning models show their edge
HellaSwag	Commonsense "what happens next"	Multiple choice	Older; mostly saturated now

Two terms decode most of the table. Saturated means top models score so high (say 90%+) that the benchmark can no longer tell them apart — it's been "solved" and is no longer useful for ranking frontier models. That's exactly why the field keeps inventing harder tests: MMLU got saturated, so MMLU-Pro and GPQA appeared to create headroom again.

The other big shift is from knowledge tests to agentic tests. Old benchmarks asked "do you know the answer?" Newer ones like SWE-bench ask "can you do the task — read a codebase, run tools, iterate?" As models got good enough to act as agents, the benchmarks that matter most moved from quizzes toward real work.

How to read a model card score

Here's a realistic snippet of the kind of benchmark table you'll find in a model announcement or model card. Numbers are illustrative, not a real model:

model_card_excerpt.jsonjson

{
  "model": "example-model-v2",
  "benchmarks": {
    "MMLU":      { "score": 88.7, "shots": "5-shot" },
    "GPQA":      { "score": 59.4, "shots": "0-shot, CoT" },
    "SWE-bench": { "score": 49.0, "variant": "Verified" },
    "MATH":      { "score": 76.2, "shots": "0-shot, CoT" }
  }
}

Read it like this. MMLU 88.7% is strong but everyone is bunched up near the top here, so it barely separates frontier models anymore. GPQA 59.4% is the more telling number — it's hard, so a 60ish score is genuinely good and a 30 would be near the random-guess floor. SWE-bench 49% means the model fixed about half of a set of real bugs, which is a lot harder than it sounds. MATH 76% signals solid step-by-step reasoning.

Notice the fine print: 5-shot, 0-shot, CoT, Verified. That fine print is not decoration — it's the difference between a fair comparison and a misleading one. Two models are only comparable on a benchmark if they ran it the same way.

Check the test setup matches. A 0-shot score and a 5-shot score on the same benchmark aren't comparable. Neither is one model's CoT run against another's plain run.
Check the variant. "SWE-bench" alone is ambiguous — there's Full, Lite, and Verified, and they're different difficulties. Compare like with like.
Weight the benchmark by your use case. Building a coding agent? SWE-bench dwarfs MMLU in relevance. Building a research assistant? GPQA and MATH matter more than trivia.
Distrust self-reported numbers near the ceiling. A 0.5-point lead on a saturated benchmark is noise, not a reason to switch models.

When benchmarks lie

Benchmarks are useful, but they fail in predictable ways, and beginners get burned by trusting a single high number. The big failure modes:

Contamination

If a benchmark's questions and answers ended up in a model's training data, the model isn't reasoning — it memorized the answer key. Because benchmarks are public and the web gets scraped, this leaks constantly. A suspiciously high score on a hard benchmark, especially an older one, can mean contamination rather than ability. It's the single most important reason to never take one number at face value.

Teaching to the test

When a benchmark becomes the scoreboard everyone optimizes for, labs tune models to do well on it specifically — sometimes at the expense of real-world ability. A model can climb MMLU while feeling no smarter in your actual app. This is Goodhart's law: when a measure becomes a target, it stops being a good measure.

The benchmark doesn't match your job

MMLU is academic trivia. If your app summarizes support tickets in a specific tone, MMLU tells you almost nothing about whether the model will do that well. Benchmarks measure general capabilities; your task is specific. The gap between "good on the benchmark" and "good on my data" is exactly why you still need your own evals.

Going deeper

Once you're comfortable reading single scores, a deeper set of issues separates people who use benchmarks well from people who get fooled by them.

Pass@k and sampling

Coding benchmarks often report pass@k: the chance at least one of k sampled attempts passes the tests. pass@1 (one shot, must be right first try) is far harder than pass@10 (ten tries, any one counts). A model card boasting pass@10 looks better than its real-world pass@1 behavior, so always check which k is quoted. Sampling settings like temperature also nudge these numbers run to run.

The saturation treadmill and private benchmarks

Public benchmarks have a short shelf life: they get saturated, contaminated, or gamed, then the field builds harder ones (MMLU → MMLU-Pro, and "frontier exam" style benchmarks designed to resist contamination). A growing answer is private, held-out test sets — the questions are never published, so models can't have memorized them, and re-running them on each new release gives a cleaner signal. The tradeoff is you have to trust whoever holds the secret set.

Evaluating agents, not just answers

Single-answer benchmarks can't capture an agent that plans, calls tools, and loops. SWE-bench was an early move toward scoring whole tasks, and the frontier now includes long-horizon, computer-use, and multi-step tool benchmarks. These are messier: the final result can be right while the path was wasteful, or wrong because of one bad early step. There are far fewer settled best practices here than for old multiple-choice exams.

Build a sense for the whole picture

The mature habit is to never trust one benchmark. Look at a spread — a knowledge test, a reasoning test, a coding test, and a human-preference leaderboard — and notice where a model is strong and weak. Then, before you commit, build a small private eval on your own data, because a model's rank on someone else's exam is only ever a prediction about how it'll do on yours. Benchmarks shortlist; your evals decide.

FAQ

What are LLM benchmarks in simple terms?

They're standardized tests for AI models — a fixed set of questions with known answers, run identically on every model so you can compare scores. When a launch says "88% on MMLU," that percentage came from a benchmark. They let you compare models on the same exam instead of trusting marketing.

What does MMLU measure?

MMLU (Massive Multitask Language Understanding) is a multiple-choice exam covering 57 subjects — history, law, medicine, math, and more. It tests broad general knowledge. Top models now score near the ceiling, so it's increasingly "saturated" and harder variants like MMLU-Pro were created to tell frontier models apart again.

What is the GPQA benchmark?

GPQA is a small set of graduate-level science questions in biology, physics, and chemistry, designed to be "Google-proof" — even non-expert humans with full web access score around a third. Because it's so hard, a GPQA score in the 50–70% range is a much stronger signal of real reasoning than a near-ceiling MMLU score.

How do I read benchmark scores on a model card?

Read the fine print, not just the number. Check the test setup (0-shot vs 5-shot, chain-of-thought or not) and the variant (e.g. SWE-bench Verified vs Lite) match across models. Weight each benchmark by your use case, and ignore tiny leads on saturated benchmarks — they're noise.

Why shouldn't I trust benchmark scores completely?

Three reasons: contamination (the model may have memorized a public answer key), teaching to the test (labs optimize for the benchmark, not real use), and mismatch (a trivia benchmark says little about your specific task). Use benchmarks to shortlist, then run your own evals on your own data to decide.

What's the difference between a benchmark and an eval?

A benchmark is a public, standardized test that ranks raw models (MMLU, GPQA, SWE-bench). An eval usually means testing your own application on your own data and success criteria. Same basic loop — inputs, outputs, a score — at different scopes. See What Are LLM Evals?

What Are LLM Benchmarks? MMLU, GPQA, and Friends Explained

In plain English

Why it matters

Who should care

How it works

The benchmarks you'll actually see

How to read a model card score

When benchmarks lie

Contamination

Teaching to the test

The benchmark doesn't match your job

Going deeper

Pass@k and sampling

The saturation treadmill and private benchmarks

Evaluating agents, not just answers

Build a sense for the whole picture

FAQ

Further reading

// In plain English

// Why it matters

Who should care

// How it works

// The benchmarks you'll actually see

// How to read a model card score

// When benchmarks lie

Contamination

Teaching to the test

The benchmark doesn't match your job

// Going deeper

Pass@k and sampling

The saturation treadmill and private benchmarks

Evaluating agents, not just answers

Build a sense for the whole picture

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

The benchmarks you'll actually see

How to read a model card score

When benchmarks lie

Going deeper

FAQ

Further reading

Related