In plain English
You run an eval on 50 test cases. Your old prompt scores 80%. Your new prompt scores 84%. You ship the new prompt and tell the team it's a 4-point improvement. But here's the uncomfortable question: is that 4 points real, or did you just get lucky on a couple of cases?

This article is about sample size — how many test cases you need before a score difference means anything — and statistical significance — the tool that tells you whether a "win" is a genuine improvement or just noise. The math is gentle, the intuition is what matters.
Think of it like flipping a coin. Flip a fair coin 10 times and you might get 7 heads — that doesn't make it a biased coin, you just had a small sample. Flip it 10,000 times and you'll land very close to 50%. An eval score is the same kind of measurement: a pass rate estimated from a limited number of trials. With few trials, the estimate jiggles a lot. With many, it settles down. A 4-point gap on 50 cases is two coin flips landing your way; the same gap on 5,000 cases is a real signal.
Why it matters
Every team building with LLMs makes the same loop: change a prompt, swap a model, tweak retrieval, re-run the eval, compare the number. If you can't tell a real gain from random wobble, that whole loop becomes superstition. You'll ship changes that did nothing, revert changes that actually helped, and slowly convince yourself of "improvements" that are pure chance.
What goes wrong with small eval sets
- False wins. You declare victory on a +3% delta that's well inside the noise. Next week it 'regresses' for no reason — because there was never a real change to begin with.
- False alarms. A genuinely-better change shows a tiny drop by chance, so you throw it away. The improvement was real; your sample was too small to see it.
- Whack-a-mole tuning. Each run reshuffles which cases pass, so you keep 'fixing' different things and never converge. The leaderboard moves, but only because the dice keep rolling.
- Overfitting to the eval. With 30 cases, you can hand-tune a prompt until all 30 pass and learn nothing about the millions of inputs you didn't test.
The cost cuts both ways, which is why this is a judgment call, not a slogan. Each test case costs money and time to run — model calls, maybe a human or an LLM judge grading it. You can't make every eval enormous. The goal isn't 'more is always better'; it's enough cases that the comparison you actually care about is trustworthy. Knowing the math tells you where 'enough' is, so you neither fool yourself with 30 cases nor waste a fortune running 50,000 when 800 would settle it.
How it works
Here's the core idea in one sentence: a pass rate measured on N cases has a built-in margin of error, and that margin shrinks as N grows — roughly with the square root of N. Once you can put a margin of error (an error bar) on each score, comparing two systems becomes obvious: if their error bars barely overlap, the difference is probably real; if they overlap heavily, it's probably noise.
From a pass rate to a confidence interval
Say your system passes p fraction of N cases. The standard error of that pass rate — the typical amount it would wobble if you reran on a fresh random sample — is approximately the square root of p × (1 − p) / N. A rough 95% confidence interval is your score plus-or-minus two standard errors. That interval is the honest version of your score: not '84%', but '84%, give or take this much.'
The p × (1 − p) part says noise is largest near 50% (a coin flip is maximally uncertain) and smallest near 0% or 100%. The / N part is the lever you control: to halve your error bar, you need four times the cases. That square-root relationship is the single most important fact in this whole article — it's why going from 50 to 100 cases helps a lot, but going from 5,000 to 5,050 helps essentially nothing.
How big is the wobble, really?
It helps to see concrete numbers. For a system scoring around 80%, here's the approximate 95% margin of error — the plus-or-minus on the score — at different sample sizes:
| Cases (N) | Approx. 95% margin | Honest reading of an 80% score |
|---|---|---|
| 20 | ± 18 pts | Could be anywhere from ~62% to ~98% |
| 50 | ± 11 pts | Roughly 69% to 91% |
| 100 | ± 8 pts | Roughly 72% to 88% |
| 500 | ± 3.5 pts | Roughly 76.5% to 83.5% |
| 1,000 | ± 2.5 pts | Roughly 77.5% to 82.5% |
| 5,000 | ± 1.1 pts | Roughly 79% to 81% |
Now the punchline. On 50 cases, an 80% score has a margin of about ±11 points. That means a result of 80% and a result of 84% have massively overlapping error bars — the gap between them is smaller than the noise in either one. You genuinely cannot tell them apart. To resolve a 4-point difference with confidence, you need roughly 500–1,000 cases, where the margin drops below the gap you're trying to measure.
A worked example: is +4% real?
Let's run the exact scenario from the top. Old prompt: 40/50 = 80%. New prompt: 42/50 = 84%. Did the new prompt actually win?
The honest way to compare two pass rates is to look at the difference and its own error bar. The standard error of the difference is roughly the square root of the sum of each side's variance. Plug in the numbers and you get a margin on the gap of around ±15 points. Your observed gap is +4 points. Since the margin (±15) is far bigger than the effect (+4), the confidence interval for the difference comfortably includes zero — meaning 'no real difference' is a totally plausible explanation. Verdict: not significant. You learned nothing.
from statsmodels.stats.proportion import proportions_ztest, confint_proportions_2indep
# old prompt: 40 of 50 passed; new prompt: 42 of 50 passed
passes = [42, 40]
totals = [50, 50]
stat, p_value = proportions_ztest(passes, totals)
print(f"p-value: {p_value:.2f}") # ~0.61 -> NOT significant
# 95% confidence interval on the DIFFERENCE in pass rates
low, high = confint_proportions_2indep(
passes[0], totals[0], passes[1], totals[1]
)
print(f"diff 95% CI: [{low:.2f}, {high:.2f}]") # spans 0 -> could be no change
# How many cases would we need to detect a true 4-point gap?
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
effect = proportion_effectsize(0.84, 0.80)
n = NormalIndPower().solve_power(effect, alpha=0.05, power=0.8, ratio=1)
print(f"need ~{n:.0f} cases per side") # ~1,300 -> small evals can't see +4%
The last block answers the planning question: to reliably catch a true 4-point improvement, you'd need on the order of 1,300 cases per side. That's the gap between feeling like you measured something and actually measuring it. The smaller the improvement you care about, the more cases it takes to see it — halving the effect you want to detect roughly quadruples the sample you need.
Practical rules of thumb
You won't run a power calculation before every eval, and you don't need to. Internalize a few rough numbers and you'll make good calls on instinct.
- Under ~30 cases: treat the score as a vibe check, not a measurement. Useful for catching obvious breakage; useless for comparing two decent options.
- ~100 cases: your margin is roughly ±10 points. Good enough to see big differences (a model swap that moves things 15+ points), hopeless for small tuning gains.
- ~400 cases: margin drops to about ±5 points. This is a reasonable default for everyday prompt iteration where you care about ~5-point deltas.
- 1,000+ cases: margin under ±3 points. Needed when you're chasing small, real gains or making a high-stakes ship/no-ship call.
- Big picture: to detect a delta of size D, you roughly need N on the order of
1 / D²cases. Want to see 1-point differences? That's tens of thousands of cases. Want to see 10-point differences? A few hundred will do.
- Smoke test / catch crashes
- Spot huge regressions
- Fast, cheap, daily
- Cannot trust small deltas
- Wide error bars
- Compare two real candidates
- Detect a few-point gain
- Ship/no-ship decisions
- Tight, trustworthy deltas
- Narrow error bars
Common pitfalls
- Reporting a bare score. '84%' with no sample size and no error bar is half a number. Always report N alongside it, ideally with the confidence interval.
- Peeking and stopping early. Re-running the eval after every tweak and stopping the moment the new number looks higher inflates false wins. The number will cross your favorite threshold by chance if you keep looking — decide your N up front.
- Tuning on your test set. If you optimize the prompt against the same cases you score on, the score lies. Keep a held-out set you don't peek at, like a proper golden dataset.
- Ignoring per-case noise. If your grader is an LLM judge that itself flips on the same input, that adds extra wobble on top of sampling noise — run judged evals a couple of times or pin a low temperature.
- One average hiding a split. An overall 80% can be 99% on easy cases and 40% on the hard slice you care about. Slice your eval by category so a flat average doesn't mask a real regression.
- Treating significance as importance. With 100,000 cases, a 0.2-point difference can be 'statistically significant' and still completely irrelevant to users. Significance tells you a gap is real; you still decide if it's big enough to matter.
Going deeper
The square-root-of-N intuition takes you a long way, but a few nuances matter once you're making serious decisions.
Paired comparisons are far more powerful. When you test old vs new on the exact same cases, you can compare them case-by-case instead of comparing two independent averages. Many cases pass or fail under both systems and cancel out — what matters is only the cases that changed. A paired test (McNemar's test for pass/fail, a paired t-test for scores) often detects a real difference with several times fewer cases than two separate runs. If you can, always reuse the same test set across candidates.
Bootstrapping when the math gets awkward. Simple formulas assume independent cases and a single pass/fail per case. Real evals have messy metrics — averaged scores, weird distributions, grouped data. The bootstrap sidesteps the algebra: resample your results with replacement thousands of times, recompute the metric each time, and read the error bar straight off the spread of those results. It's a few lines of code and works for almost any metric.
Sampling bias beats sample size. All of this assumes your test cases are a fair, representative sample of real traffic. A perfectly significant result on 10,000 unrepresentative cases tells you nothing about production. No amount of N fixes a biased sample — getting the right cases matters more than getting many. See how to build an eval suite for sourcing representative cases.
Multiple comparisons inflate luck. Test 20 prompt variants against a baseline and, by chance alone, about one will look 'significant' at p < 0.05 even if all 20 are identical. If you're screening many candidates, correct for it (a Bonferroni adjustment is the blunt, safe option) or treat the screen as a filter and confirm the winner on a fresh, larger run.
Where to go next: pair this with how eval metrics are defined so you know what you're measuring, and with LLM benchmarks to see why public leaderboard gaps of a point or two are often within noise too. The durable lesson: a number without an error bar is a guess wearing a lab coat. Always ask 'out of how many?' before you believe a score — including your own.
FAQ
How many examples do I need for an LLM eval?
It depends on how small a difference you want to detect. For a quick smoke test, 30–100 cases is fine. To trust a ~5-point improvement, aim for around 400–500 cases; to trust a ~2–4 point gain, you'll want roughly 1,000+. The rule of thumb: to detect a delta of size D, you need on the order of 1/D² cases.
Is a 3% improvement on my eval real or just noise?
On a small set it's almost certainly noise. On 50 cases the margin of error around an 80% score is about ±11 points, so a 3-point gap is well inside the wobble. Run a two-proportion z-test: if the p-value is above 0.05 (or the confidence interval on the difference includes zero), you can't claim a real improvement yet.
What is a confidence interval for an eval score?
It's the honest 'give or take' on your score. For a pass rate p over N cases, a rough 95% interval is p ± 2 × sqrt(p(1−p)/N). So an 80% score on 100 cases isn't really '80%' — it's '80%, plus or minus about 8 points.' Report the interval, not just the bare number.
Why does my eval score change every time I run it?
Two reasons. If you sample different test cases each run, sampling noise alone shifts the number — especially on small sets. And if an LLM judge grades your outputs, the judge itself can flip on identical inputs, adding extra variance. Pin your test set, fix a low judge temperature, and use enough cases to shrink the wobble.
Does doubling my test set halve the noise?
No — noise shrinks with the square root of N, so to halve your error bar you need four times the cases, not twice. That's why going from 50 to 100 cases helps a lot, but going from 5,000 to 5,100 barely moves the needle.
What's the difference between statistical significance and a meaningful improvement?
Significance only tells you a difference is probably real rather than chance. It says nothing about size. With a huge sample, a 0.2-point gap can be 'significant' yet irrelevant to users. Always check both: is the gap real (significance) and is it big enough to care about (effect size)?