Meta AI Research · 2026-05-05 · notable
ProgramBench — Meta + Princeton Benchmark Where the Best Model Fully Solves 0 of 200 Programs
From the SWE-bench team. Agents get a binary plus docs and must rebuild the program from scratch; behaviour is checked with fuzz-generated tests. Across 9 frontier models, none fully resolve any of the 200 tasks.
A new benchmark from the SWE-bench authors that gives agents a compiled binary and asks them to rebuild the source — and current frontier models score zero full solves.
Key specs
| GitHub stars | 295 |
|---|---|
| Tasks | 200 |
| Models evaluated | 9 |
| Best full task solve rate | 0% |
| Best model 95% test pass rate | 3% of tasks |
What is it?
ProgramBench is a 200-task benchmark spanning small CLI tools up to FFmpeg, SQLite, and the PHP interpreter. Each task gives the agent only the reference executable and its documentation; the agent must produce a codebase whose behaviour matches under fuzz-generated tests.
How does it work?
Tasks are evaluated with end-to-end behavioural tests synthesized via agent-driven fuzzing — no prescribed file layout, no scaffolding hints. Models are graded on test pass-rate per task, not architecture similarity, so monolithic single-file implementations are allowed but rarely score well.
Why does it matter?
Most coding benchmarks measure patches in an existing repo. ProgramBench measures whether a model can architect a full program from spec — a capability that has to land before agents can autonomously build new software. The 0% full-solve rate sets a clear ceiling for current frontier models.
Who is it for?
agent and benchmark researchers, frontier-lab eval teams
Try it
pip install programbench