Coding Benchmarks Explained: From HumanEval to SWE-bench

Q: What is pass@k and how does it differ from accuracy?

Pass@k is the probability that at least one of `k` generated code samples passes all unit tests. Pass@1 (one attempt) is the hardest; pass@10 (ten attempts) is more forgiving. It differs from simple accuracy in that it models the real-world scenario where a developer might ask a model to try multiple times. Most modern benchmarks report pass@1 as the primary metric.

Understand how coding ability is benchmarked, from single-function tests to fixing real GitHub issues.

BEGINNER11 MIN READUPDATED 2026-06-12

In plain English

A coding benchmark is a standardized exam for AI models that write code. Instead of asking trivia questions and checking a letter answer, it hands the model a programming task and then runs the code to see if it actually works. You can't bluff your way through a unit test — either the tests pass or they don't.

Think of it like a trade-skills certification instead of a written exam. A plumber certification doesn't just ask "what is a pipe wrench?" — it has you install a fitting while an inspector watches. Coding benchmarks do the same thing: they observe the model doing real work and score the result, not the explanation.

Two benchmarks dominate model announcements: HumanEval and SWE-bench. They both test coding, but they test very different things. HumanEval gives the model a short docstring and asks it to fill in a single Python function — think of it as a one-question coding interview. SWE-bench hands the model a real GitHub repository plus a real bug report and asks it to submit a patch that fixes the bug and passes every test — think of it as a full sprint task on someone else's codebase.

Why it matters

Before coding benchmarks existed, claims about AI coding ability were basically vibes. Labs would demo their model writing a web server or solving a LeetCode problem and call it groundbreaking. There was no shared exam, so there was no way to know if the next model was genuinely better or just cherry-picked a more impressive demo.

Standardized coding benchmarks solved this in the same way standardized tests solve it for humans: every model sits the exact same exam. When a model card says "72% on SWE-bench Verified," you know exactly what was tested, how it was scored, and what 72% means in context — you can compare it directly to every other model that ran the same benchmark.

Who should care

Developers choosing a coding assistant — SWE-bench score is the single most informative number for real agentic coding tasks like bug fixing, refactoring, and feature work.
Teams evaluating open models — coding benchmarks appear on every model card; knowing what they measure lets you weight them correctly against your actual work.
Researchers tracking progress — HumanEval created the first shared scoreboard for code generation; watching it go from ~30% to ~90% over three years is one of the clearest views of how fast the field moved.
Anyone following AI news — "Model X tops SWE-bench" headlines are meaningless without understanding what SWE-bench actually tests and whether the variant being quoted is the hard one or the easy one.

How it works

Both benchmarks follow the same basic pipeline: give the model a task, collect its output, execute it in a sandbox, and record pass or fail. The crucial detail that sets coding benchmarks apart from multiple-choice benchmarks is that the judge is the test suite, not a human or another model. There's no subjectivity — either the code is correct or it isn't.

// How a coding benchmark scores a model

Task given to modeldocstring, issue, or specModel generates codefunction, patch, or diffCode runs in sandboxisolated Docker environmentTests executepass = correct, fail = wrongScore tallied% of tasks passed

HumanEval: 164 self-contained functions

HumanEval consists of 164 hand-crafted Python problems. Each problem has a function signature, a docstring describing what the function should do, and an average of 7.7 unit tests. The model must complete the function body. That's it — no imports of outside libraries, no multi-file projects, no state to manage. Each problem is deliberately self-contained.

The scoring metric is pass@k: the probability that at least one of k generated attempts passes all unit tests. Pass@1 (one shot, must be right) is the hardest; pass@10 (ten attempts, any one counts) is more forgiving. Most model cards quote pass@1 because it best reflects real-world single-shot usage. When Codex was released in 2021, it achieved 28.8% pass@1. Frontier models today score above 90%.

SWE-bench: real GitHub issues in real repositories

SWE-bench is a completely different beast. It was constructed from 90,000+ merged pull requests across 12 popular Python GitHub repositories (including pytest, scikit-learn, and Django). After extensive filtering, 2,294 task instances were selected. Each instance is a real issue report plus the actual commit that fixed it — the ground truth is the human-written patch and the repository's own test suite.

Scoring works by having the model generate a patch file, applying that patch to the repo, and running the full test suite. The model passes if the previously failing tests now pass and no previously passing tests are broken. The evaluation harness runs each instance inside a Docker container to guarantee a reproducible environment — a full SWE-bench run spins up one container per task instance.

// HumanEval vs SWE-bench at a glance

HumanEval

164 problems
Single Python function
Docstring as spec
7.7 tests per problem
No imports / multi-file
Score: pass@k %

SWE-bench

2,294 tasks (full)
Entire repository
GitHub issue as spec
Full regression suite
Multi-file edits required
Score: % resolved

Variants you will see quoted

Both benchmarks have evolved since their original releases, and the variant being quoted matters a lot. Comparing a score on SWE-bench Lite to one on SWE-bench Verified is like comparing an open-book quiz to a closed-book exam — the numbers are not directly comparable.

HumanEval variants

Variant	What changed	Why it exists
HumanEval (original)	164 Python problems, 2021	The original baseline
HumanEval+	More edge-case tests per problem	Many models pass original tests but fail edge cases; this exposes them
HumanEval-X	Same problems ported to C++, Java, JS, Go, Rust	Tests multilingual coding, not just Python
HumanEval-T	Template-based variants to fight contamination	When training data leaks, scores inflate; T-variants generate unseen variants

SWE-bench variants

Variant	Size	What it is
SWE-bench (full)	2,294 tasks	The complete benchmark; expensive to run
SWE-bench Lite	300 tasks	A curated subset for faster, cheaper evaluation across 11 of 12 repos
SWE-bench Verified	500 tasks	Manually validated by humans to ensure tasks are solvable and tests are fair; introduced by OpenAI in 2024
SWE-bench Multimodal	~500 tasks	Adds visual context like screenshots and UI specifications; tests vision-coding ability

SWE-bench Verified is now the standard for leaderboard comparisons. OpenAI introduced it after finding that the original benchmark had tasks with ambiguous issue descriptions and tests that could penalize correct solutions. The Verified subset is manually checked to ensure that if you solve the issue, your patch will definitely pass. An 80% on Lite is not comparable to a 70% on Verified — treat them as different exams.

Why scores keep climbing — and what that means

Coding benchmark scores have risen so fast it can feel like the field is solving coding entirely. That is partly real progress and partly an artifact of how benchmarks work. Understanding the difference is important for anyone reading model launch posts.

Contamination

Because HumanEval's problems and solutions are public, they are almost certainly present in the training data of most frontier models. Research has shown that models score 5 to 14 percentage points lower on carefully constructed contamination-resistant variants like HumanEval-T compared to the original. When a model's training corpus includes the answer key, the score measures memory more than reasoning.

Saturation

A benchmark is saturated when top models score so high there is no longer meaningful separation between them. HumanEval is effectively saturated — leading models exceed 90% pass@1, making it nearly useless for comparing frontier models. That's exactly why SWE-bench emerged: it created headroom again by choosing tasks that were genuinely hard even for frontier models at the time of its release.

Teaching to the test

When a benchmark becomes the scoreboard, labs tune training pipelines to do well on it specifically. This is Goodhart's law: when a measure becomes a target, it stops being a good measure. A model can climb the SWE-bench leaderboard by learning patterns common in the benchmark's 12 specific repositories without necessarily becoming better at your codebase. The newest trend is contamination-resistant rolling benchmarks — pipelines like SWE-rebench that only pull issues created after the model's training cutoff, so the model cannot have seen the answers.

Going deeper

Once you understand HumanEval and SWE-bench, you can situate the broader landscape of coding benchmarks — and understand which benchmarks to look for when the current generation gets saturated too.

The benchmark landscape beyond HumanEval and SWE-bench

Benchmark	What it tests	Key distinction
MBPP	500 beginner Python problems from crowdsourced tasks	Broader task variety than HumanEval; often paired with it
LiveCodeBench	Contest problems from LeetCode / Codeforces / AtCoder, updated continuously	Rolling problems post-model-cutoff; contamination-resistant by design
BigCodeBench	Practical tasks calling real library APIs across 139 libraries	Tests real-world API usage, not just algorithm puzzles
SWE-bench Multimodal	SWE-bench tasks with visual inputs like UI screenshots	Evaluates vision-coding agents on front-end and design-related issues

The pass@k math

The pass@k formula uses an unbiased estimator: generate n samples per problem (where n >= k), count how many pass (c), and compute the probability that a random draw of k includes at least one correct solution. The formula avoids sampling bias that would come from simply running k samples and averaging. In practice, most benchmarks today quote pass@1 (n=1, single attempt) because that maps most directly to real-world single-inference usage.

pass@k unbiased estimator (simplified)python

from math import comb

def pass_at_k(n: int, c: int, k: int) -> float:
    """Unbiased pass@k estimator from the HumanEval paper.
    n = total samples generated per problem
    c = number of samples that passed all tests
    k = the k in pass@k
    """
    if n - c < k:
        return 1.0
    return 1.0 - comb(n - c, k) / comb(n, k)

Agentic evaluation: the next frontier

SWE-bench was the first widely adopted agentic coding benchmark — tasks where the model needs to read a codebase, reason across files, and iterate rather than fill in a single function. This is harder to evaluate cleanly, because the path to a correct answer matters as well as the result: a model that takes 40 tool calls to do what should take 5 is technically passing the test while being impractical in production.

Emerging metrics for agentic coding benchmarks track not just pass rate but also cost-efficiency (tokens and tool calls consumed per solved issue), edit precision (did the patch touch only what needed changing?), and regression safety (did the patch break anything that was passing before?). As models become more capable agents, expect these richer metrics to appear alongside raw pass rates on leaderboards.

Private and rolling test sets

The long-term answer to contamination is private or rolling benchmarks. Private benchmarks keep the test questions secret so models cannot have memorized them; rolling benchmarks continuously add new problems from issues filed after each model's training cutoff. Projects like SWE-rebench automate the pipeline: scrape new GitHub issues, verify they have a clean ground-truth fix, run the model, retire old problems. This is the direction serious evaluation is heading, even though the community still quotes public benchmarks for comparability.

FAQ

What is the HumanEval benchmark?

HumanEval is a set of 164 hand-crafted Python programming problems created by OpenAI in 2021. Each problem gives the model a function signature and docstring; the model must complete the function body. Correctness is judged by running unit tests against the generated code. The benchmark introduced the pass@k metric, and it is now largely saturated with frontier models scoring above 90%.

What does SWE-bench measure?

SWE-bench measures whether a model can fix real GitHub issues in real Python repositories. Each task gives the model a full codebase plus a natural-language issue report; the model must submit a patch that passes the repository's test suite. It tests multi-file editing, context retrieval, and agentic iteration — much closer to real engineering work than single-function benchmarks.

What is the difference between SWE-bench Lite, Verified, and full?

The full benchmark has 2,294 tasks and is expensive to run. Lite is a curated 300-task subset for faster evaluation. Verified is a 500-task subset that OpenAI manually reviewed to ensure tasks are solvable and tests are fair. Verified is now the standard for leaderboard comparisons. Scores across variants are not directly comparable — always check which variant is being quoted.

What is pass@k and how does it differ from accuracy?

Pass@k is the probability that at least one of k generated code samples passes all unit tests. Pass@1 (one attempt) is the hardest; pass@10 (ten attempts) is more forgiving. It differs from simple accuracy in that it models the real-world scenario where a developer might ask a model to try multiple times. Most modern benchmarks report pass@1 as the primary metric.

Why are HumanEval scores so high now — is coding solved?

Frontier models exceed 90% on original HumanEval primarily because the benchmark is both saturated (the problems are too easy for top models) and contaminated (its questions appear in training data). Research shows scores drop 5-14 points on contamination-resistant variants. HumanEval measures a narrow slice of coding; SWE-bench Verified, which is much harder and harder to memorize, is a better signal of real-world coding ability.

Which coding benchmark should I look at when choosing a model?

For agentic coding tasks like bug fixing or feature development, SWE-bench Verified is the most informative single number available today. For function-level code generation or autocomplete, HumanEval+ (the extended variant) and MBPP are still useful. For contamination-resistant, up-to-date signal, LiveCodeBench uses contest problems posted after most models' training cutoffs.

// In plain English

// Why it matters

Who should care

// How it works

HumanEval: 164 self-contained functions

SWE-bench: real GitHub issues in real repositories

// Variants you will see quoted

HumanEval variants

SWE-bench variants

// Why scores keep climbing — and what that means

Contamination

Saturation

Teaching to the test

// Going deeper

The benchmark landscape beyond HumanEval and SWE-bench

The pass@k math

Agentic evaluation: the next frontier

Private and rolling test sets

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Variants you will see quoted

Why scores keep climbing — and what that means

Going deeper

FAQ

Further reading

Related