AI/TLDR

What Is GPQA? Google-Proof Science Q&A

You will understand what makes GPQA 'Google-proof' and why its Diamond subset became a key reasoning benchmark.

INTERMEDIATE10 MIN READUPDATED 2026-06-14

In plain English

GPQA stands for Graduate-Level Google-Proof Q&A. It is a benchmark — a fixed exam — made of a few hundred multiple-choice science questions in biology, physics, and chemistry. What makes it unusual is the design goal baked into the name: the questions are written so that you can't just look up the answer. A smart person with a search engine and plenty of time still mostly fails. That is the whole point.

GPQA — illustration
GPQA — vals.ai

Think of the difference between a pub-quiz question and a take-home final from a PhD program. "What is the chemical symbol for gold?" is a pub-quiz question — type it into a search box and you have the answer in five seconds. A GPQA question is the take-home final: it describes a specific experimental setup or a subtle theoretical situation, and the only way to the right answer is to actually understand the field and reason through it. Searching the web just buries you in pages that don't quite match.

Each question is written by a domain expert (someone with or pursuing a PhD in that field) and then validated by other experts to confirm it is both correct and genuinely hard. So GPQA is not testing whether a model memorized a textbook fact. It is testing whether the model can do expert-level scientific reasoning on a problem it has very likely never seen before.

Why it matters

By the time GPQA appeared, the older knowledge benchmarks were running out of road. MMLU, the classic 57-subject exam, had become saturated: the best models scored so high that the test could no longer tell them apart. When everyone gets an A, the exam stops being useful for ranking. The field needed harder questions that left real headroom at the top.

GPQA was built to be that harder exam, and it solves two specific problems at once.

  • It restores difficulty. Graduate-level questions in narrow scientific subfields are hard enough that even strong models had clear room to improve when GPQA launched. That gives the benchmark discrimination — the ability to separate a good model from a great one instead of bunching them all at the ceiling.
  • It resists contamination. A huge problem with public benchmarks is contamination: if a question and its answer were in the model's training data, the model can recite the answer without reasoning at all. Because GPQA questions are obscure, expert-authored, and not the kind of thing that sits indexed in a hundred study guides online, they are harder to have accidentally memorized.

Why does "Google-proof" matter for measuring an AI specifically? A large language model trained on much of the public web is, in a sense, the ultimate search-and-recall machine. If a question can be answered by retrieving a fact, a good model will often just retrieve it — and you've measured memory, not intelligence. By forcing questions that can't be looked up, GPQA tries to isolate the thing people actually care about: genuine reasoning. That's why it became a headline benchmark for the new wave of reasoning models and shows up on most frontier model cards today.

How it works

GPQA is a multiple-choice benchmark, which makes the mechanics simple. Each item is a hard science question with four answer options, exactly one correct. The model reads the question, picks an option, and a scorer checks it against the answer key. Run that over the whole set and average the results into a single percentage — the score you see quoted.

The cleverness isn't in the scoring; it's in how the questions are created and filtered. GPQA uses a multi-stage human pipeline designed to guarantee that every surviving question is correct, expert-answerable, and non-expert-resistant.

Reading that pipeline left to right: a domain expert writes a question and its answer. Independent experts then attempt it — if they can't agree on the right answer, the question is too ambiguous and gets revised or dropped. Separately, skilled non-experts (capable people from outside that subfield) attempt the same question with full internet access. The questions where experts succeed but web-armed non-experts fail are the ones that earn the "Google-proof" label. That expert-minus-non-expert gap is the benchmark's core measurement of difficulty.

The Diamond subset

GPQA ships in a few sizes, and the one you'll almost always see quoted is GPQA Diamond. Diamond is the highest-quality slice: the questions that both validating experts answered correctly and the web-armed non-experts got wrong. In other words, it keeps only the items with the strongest evidence that they are correct and genuinely Google-proof, and throws out anything ambiguous or accidentally easy. It is smaller than the full set but much cleaner, which is why model cards lead with "GPQA Diamond" — it is the sharpest version of the test.

What a GPQA-style question looks like

The real GPQA questions are kept under wraps to limit contamination, so here is an illustrative example in the same spirit — multiple choice, narrow, and impossible to answer by pattern-matching a search result. It is invented for explanation, not a real GPQA item.

illustrative GPQA-style itemtext
Q: A weak monoprotic acid is titrated with a strong base.
   At the half-equivalence point, which statement is true?

   A) The solution pH equals the pKa of the acid.
   B) All of the acid has been neutralized.
   C) The solution contains only the conjugate base.
   D) The pH equals 7 regardless of the acid.

Answer: A

Notice what searching the web gives you here: a pile of general chemistry pages about titration, none of which hand you the answer to this phrasing. You have to actually know that at the half-equivalence point the acid and its conjugate base are present in equal amounts, which makes pH equal to the pKa. That is the Google-proof property in miniature — the facts are out there, but assembling them into the right answer requires understanding, not retrieval.

GPQA vs other knowledge benchmarks

GPQA is easiest to understand by contrast with the benchmarks it sits next to on a model card. They look similar — all multiple choice — but they measure different things and have aged differently.

BenchmarkWhat it testsDifficultyStatus today
MMLUBroad knowledge across 57 subjectsModerate, lookup-friendlyLargely saturated
MMLU-ProHarder, cleaner MMLU with more optionsHigher than MMLUStill useful headroom
GPQA / DiamondGraduate-level science reasoningVery high, Google-proofStrong but approaching saturation
ARC-AGIAbstract pattern reasoning, no knowledgeVery high, different axisFar from solved

The key distinction is lookup vs reasoning. MMLU is full of questions a determined person could research; that's part of why models climbed it so fast and why it saturated. GPQA deliberately removes the lookup path, so a high GPQA score is a stronger signal that the model is reasoning rather than recalling. A model can be near the ceiling on MMLU and still have meaningful room to grow on GPQA Diamond — which is exactly the headroom the harder benchmark was built to provide.

GPQA is also narrower than it sometimes sounds. It covers science — biology, physics, chemistry — not coding, not math competitions, not commonsense. A great GPQA score tells you a model reasons well over hard science problems. It does not tell you how the model will do at fixing a bug (that's SWE-bench) or at abstract puzzles with no domain knowledge at all. As always, weight a benchmark by how close it sits to your actual job.

Going deeper

Once you can read a GPQA number on a model card, a few subtleties separate a careful reading from a naive one.

It is approaching saturation too

GPQA was hard enough to leave plenty of headroom when it launched, but the strongest reasoning models now score very high on Diamond. That is the saturation treadmill every public benchmark rides: a test that cleanly separated models a couple of years ago starts to bunch them near the top, and tiny leads become noise rather than signal. GPQA isn't dead — it's still a meaningful filter — but the days when a few points of difference clearly meant a better model are fading, and the field is already building harder successors.

Google-proof is not contamination-proof

These are two different claims, and it's easy to conflate them. Google-proof means a non-expert can't search their way to the answer at test time. Contamination-resistant means the questions weren't in the model's training data. GPQA is strong on the first by design, and better than older benchmarks on the second because its questions are obscure — but no public benchmark is fully safe. Once questions are published and the web is scraped again, some can leak into future training runs. A model trained after a benchmark went public should always be read with a little extra skepticism. To learn the mechanics of this, see benchmark contamination and benchmark overfitting.

The guessing floor and prompt setup

GPQA is four-choice multiple choice, so pure random guessing already scores around 25%. That means the interesting range starts well above 25 — a score near the floor signals the model is essentially guessing, not that the questions are impossible. And as with any benchmark, the test setup moves the number: whether the model is run zero-shot or with chain-of-thought, and how the prompt is phrased, can swing results by several points. Two GPQA scores are only comparable if both models ran the same subset with the same setup — see how to read a benchmark score.

Where to go next

GPQA is one data point, not a verdict. The mature habit is to read a spread of benchmarks — a knowledge-and-reasoning test like GPQA, a coding test, an abstract-reasoning test, and a human-preference leaderboard — and notice where a model is strong and weak. As Google-proof science questions saturate, the frontier is moving toward harder expert exams, private held-out test sets that can't be memorized, and agentic benchmarks that score whole tasks rather than single answers. GPQA's durable lesson is the one worth keeping: the most informative benchmark is the one a model can't shortcut.

FAQ

What is the GPQA benchmark?

GPQA (Graduate-Level Google-Proof Q&A) is a multiple-choice benchmark of a few hundred hard biology, physics, and chemistry questions written and validated by domain experts. It is designed so that even skilled non-experts with full web access can't simply look up the answers, which makes it a test of real scientific reasoning rather than memory.

What does "Google-proof" mean in GPQA?

It means the questions are written so that searching the web doesn't hand you the answer. The benchmark's authors verified this: experts in the field answer the questions correctly, while skilled non-experts given unrestricted internet access still mostly fail. That gap is the evidence the answers require understanding, not retrieval.

What is GPQA Diamond?

GPQA Diamond is the highest-quality subset of the benchmark — the questions that validating experts answered correctly and that web-armed non-experts got wrong. It is smaller but cleaner than the full set, with the strongest evidence each item is correct and genuinely Google-proof. It's the subset most often quoted on model cards, so check that two models are compared on the same subset.

What subjects does GPQA cover?

GPQA focuses on graduate-level natural science: biology, physics, and chemistry, often in narrow subfields. It does not test coding, competition math, or commonsense reasoning, so a strong GPQA score tells you a model reasons well over hard science — not how it will do on those other tasks.

Is GPQA still a good benchmark?

It is still useful, but the strongest reasoning models now score very high on GPQA Diamond, so it is approaching saturation. That means small differences between top models are increasingly noise rather than signal. It remains a meaningful filter, but it's best read alongside other benchmarks rather than on its own.

How is GPQA different from MMLU?

MMLU is a broad 57-subject knowledge exam with many lookup-friendly questions, and it has largely saturated. GPQA is narrower (science only) and deliberately Google-proof, so a high GPQA score is a stronger signal of genuine reasoning. A model can sit near the ceiling on MMLU while still having room to improve on GPQA Diamond.

Further reading