Meta AI Research · 2026-05-05 · notable

ProgramBench — Meta + Princeton Benchmark Where the Best Model Fully Solves 0 of 200 Programs

From the SWE-bench team. Agents get a binary plus docs and must rebuild the program from scratch; behaviour is checked with fuzz-generated tests. Across 9 frontier models, none fully resolve any of the 200 tasks.

facebookresearch/ProgramBench GitHub repository social card

A new benchmark from the SWE-bench authors that gives agents a compiled binary and asks them to rebuild the source — and current frontier models score zero full solves.

Key specs

GitHub stars	295
Tasks	200
Models evaluated	9
Best full task solve rate	0%
Best model 95% test pass rate	3% of tasks

What is it?

ProgramBench is a 200-task benchmark spanning small CLI tools up to FFmpeg, SQLite, and the PHP interpreter. Each task gives the agent only the reference executable and its documentation; the agent must produce a codebase whose behaviour matches under fuzz-generated tests.

How does it work?

Tasks are evaluated with end-to-end behavioural tests synthesized via agent-driven fuzzing — no prescribed file layout, no scaffolding hints. Models are graded on test pass-rate per task, not architecture similarity, so monolithic single-file implementations are allowed but rarely score well.

Why does it matter?

Most coding benchmarks measure patches in an existing repo. ProgramBench measures whether a model can architect a full program from spec — a capability that has to land before agents can autonomously build new software. The 0% full-solve rate sets a clear ceiling for current frontier models.

Who is it for?

agent and benchmark researchers, frontier-lab eval teams

Try it

pip install programbench