AI/TLDR

Public vs Private Benchmarks: Why Internal Evals Matter

You'll understand why a model topping public leaderboards may still fail your task, and why teams keep private, held-out benchmarks.

BEGINNER9 MIN READUPDATED 2026-06-13

In plain English

A public benchmark is a shared exam that every AI lab takes, with the questions and answers published for all to see. Think of MMLU, GPQA, or SWE-bench: a fixed set of problems, a public scoreboard, and a single number that lets you compare model A against model B. These are the scores you see in launch blog posts and on leaderboards.

Public vs Private Benchmarks — illustration
Public vs Private Benchmarks — shutterstock.com

A private benchmark is your own exam, written from your own task, and kept secret. It's the set of real questions your support bot must answer, the actual contracts your tool must summarize, the genuine bug reports your agent must fix. Nobody outside your team sees it, so no model was trained on it and no vendor can tune to it.

Here's the everyday analogy. A public benchmark is like a candidate's SAT score: standardized, comparable, useful for a first cut. A private benchmark is the take-home assignment you write for the exact job. A perfect SAT tells you someone is generally smart. It does not tell you they can do your job, with your tools, on your messy data. To learn that, you have to give them your own test — and you must keep it off the public internet, or the next candidate will simply memorize the answers.

Why it matters

The model topping the public leaderboard is not automatically the best model for your product. A builder who picks a model purely from public scores is optimizing the wrong number. Three gaps explain why.

  • Public benchmarks measure general ability, not your task. MMLU tests broad knowledge; HumanEval tests short, self-contained coding puzzles. Your product might be summarizing insurance policies in German, or routing customer tickets, or calling three internal tools in the right order. A high general score is weakly correlated with success on that — sometimes a smaller, cheaper model wins on your specific task while losing every public chart.
  • Public benchmarks leak. Once questions and answers are on the internet, they drift into training data. A model can then score high by having effectively seen the answers, not by reasoning — this is benchmark contamination. Your private set, by definition, cannot be contaminated, because it was never published.
  • Public benchmarks can be gamed. When a number drives marketing and funding, there's pressure to teach to the test — to tune a model so it shines on the known exam without getting genuinely better. A held-out private set is the standard defense: you can't optimize for questions you've never seen.

The payoff is simple but large: your private benchmark is the score that actually predicts production quality. It is the closest cheap proxy you have for "will users be happy." When you upgrade a model, change a prompt, or swap a retrieval step, the private number tells you — on your traffic — whether you got better or worse. Public scores can't do that, because they don't know what your users actually ask.

How it works

The two kinds of benchmark sit at different points in your decision. Public ones live outside your company and answer "which models are generally strong?" Private ones live inside and answer "which choice wins on our task, today?" The diagram below shows how a sensible team uses both in sequence.

What makes a benchmark public vs private

The difference isn't the format — both can be multiple-choice, both can be graded by code or by an LLM judge. The difference is who can see the questions and whether the score generalizes to you. A public benchmark is comparable across the whole world but generic and leak-prone. A private benchmark is specific and leak-proof but only meaningful to you.

How a private benchmark runs

Mechanically, a private benchmark is just an eval suite you keep to yourself. You collect real inputs from your domain, decide what a good output looks like for each, run every candidate model through them, and grade. The grading can be exact-match code, a metric, or a model-graded judge — see code vs model-graded evals. The score is the percentage that pass.

one row of a private benchmarktext
input:    "Customer: my order #4417 never arrived, I want a refund."
expected: should call lookup_order(4417), confirm not delivered,
          offer refund per policy, stay polite, no invented order data
grader:   model-graded — does the reply call the tool and follow policy?
result:   model A pass · model B fail (skipped the tool call)

Run that over 50–200 such rows and you get a single faithfulness number per model — on your work, not on someone's published quiz. That is the whole mechanism: same machinery as any eval, but the dataset is yours and stays hidden.

Building a small private benchmark

You do not need thousands of examples or a research budget. A focused set of 20 to 50 real cases already beats eyeballing a few demos, and it's enough to catch a regression before your users do. Start small and grow it as production throws new failures at you.

  1. Collect real inputs. Pull actual prompts, tickets, or documents from your domain — production logs, support transcripts, sample contracts. Synthetic data is fine to start, but real inputs expose the messy edge cases that matter.
  2. Cover the spread. Include easy cases, hard cases, weird formatting, and the failures that already bit you. Deliberately add a few tricky ones — the questions where models tend to slip.
  3. Define "good" per case. For each input, write down what a correct answer must contain or do. This is your golden dataset — the answer key.
  4. Pick a grader. Exact match for structured outputs, a metric for retrieval, or an LLM-as-judge for open-ended text. Keep the grader consistent across all models you compare.
  5. Hold it out. Store it privately, never paste it into a public tool or training run, and never tune your prompts against the whole set — keep a slice your changes never touch, so it stays an honest test.

For a step-by-step walkthrough of authoring and grading these, see how to build an eval suite and write your first LLM eval.

When to trust which

Neither type is "better" in the abstract — they answer different questions. Use this table to decide which number to look at for a given job.

Your questionTrust the public numberTrust your private number
Which models are generally worth trying?Yes — this is what they're forNot yet — you haven't tested them
Which model is best for MY task?No — too genericYes — this is the whole point
Did my prompt change help or hurt?Can't tellYes — re-run and compare
Is a high score real reasoning or memorized?Maybe contaminatedClean — never published
Can a vendor have gamed this?PossiblyNo — they've never seen it

A healthy mental model: public benchmarks tell you about the model, private benchmarks tell you about your system. Your system is a model plus your prompt, your retrieval, your tools, and your data — and that whole bundle is what users experience. Two teams using the same top-ranked model can ship wildly different quality, and only a private benchmark sees that difference.

Going deeper

Once you have a private benchmark running, a few subtler points separate a toy eval from one you can bet a release on.

Keep a held-out slice you never touch. It's tempting to iterate on your prompt until every case passes — but if you tune against the whole private set, you've quietly turned it into a public-style test you can teach to. The fix is the classic train/validation/test split: develop against most of your set, but reserve a slice you only look at right before shipping. If you keep tweaking until the test slice passes too, find fresh examples.

Watch your set drift. Your users' questions change over time — new products, new slang, new failure modes. A private benchmark built last year can quietly stop representing today's traffic. Refresh it from recent production logs on a schedule, and retire cases that no longer reflect reality.

Mind the size–noise tradeoff. Twenty cases catch obvious regressions but give a jumpy score; a model that passes 18/20 vs 17/20 may just be luck. The larger your set, the more you can trust a small change in the number. For high-stakes decisions, grow toward a few hundred cases and look at confidence, not just the headline percentage — see LLM eval metrics.

Public benchmarks aren't all equally leak-prone. Live, human-voted ones like Chatbot Arena / LMArena use fresh prompts and pairwise Elo ratings, which resists the static-answer leakage that hurts fixed quiz sets. They're a useful middle ground for shortlisting — but they still measure broad preference, not your task, so they never replace a private set.

Agentic systems make private evals non-negotiable. Once a model is calling tools and acting over many steps — see agent benchmarks — "correct" depends on your tools and your environment. There's no public quiz for "did it call our refund API correctly," so the only honest score is the one you build. The durable lesson: leaderboards start the conversation, but the benchmark that decides what you ship is the one only you can see.

FAQ

What is the difference between a public and a private benchmark?

A public benchmark uses shared, published questions that let everyone compare models on the same scale (like MMLU or SWE-bench). A private benchmark uses your own task data, kept secret. Public benchmarks measure general ability and can leak into training data; private ones measure quality on your specific use case and can't be contaminated because nobody else has seen them.

Why do public benchmarks mislead?

Three reasons. They measure broad, generic ability rather than your specific task, so a top-ranked model can still fail your job. Their questions leak onto the internet and into training data, so high scores can reflect memorization instead of reasoning. And because the scores drive marketing, there's pressure to tune models to the known test. A private, held-out set sidesteps all three.

How big does a private benchmark need to be?

Smaller than you'd think. A focused set of 20–50 real cases already beats judging from a handful of demos and is enough to catch regressions. Grow it toward a few hundred cases for high-stakes decisions, since larger sets give a less noisy score you can trust on small changes.

What does it mean to hold out an eval set?

Holding out a set means you set those examples aside and never publish them, never send them to a service that might log them, and never tune your model or prompts against that slice. Because the model has never seen them, the score stays an honest measure of real ability rather than memorization.

Should I pick a model based on leaderboard rank?

Use leaderboards to narrow the field to two or three strong candidates, not to make the final call. Then run those candidates through your own private benchmark on your real task. The model that wins on your data is the one to ship — it may not be the one at the top of the public chart.

Can a private benchmark get contaminated?

Only if you leak it. As long as the questions stay secret — never published, never pasted into a public tool, never sent to a service that logs prompts — no model can have trained on them, so the benchmark stays clean. The moment you share the cases publicly, it starts to lose that protection. Share scores, not cases.

Further reading