OpenAI · 2026-02-23 · major

OpenAI Retires SWE-bench Verified: 59% of Failed Tests Were Flawed

OpenAI retired SWE-bench Verified after finding 59.4% of audited failed tests were broken and that GPT-5.2, Claude Opus 4.5, and Gemini 3 could reproduce exact solutions from memory. The story resurfaced on HN today at 152 points, prompting broad discussion on benchmark contamination.

OpenAI found that nearly 60% of SWE-bench Verified's failed tests were broken, and that frontier models had memorized the solutions during training.

What is it?

SWE-bench Verified was the dominant benchmark for measuring AI coding ability — how well models fix real GitHub issues. OpenAI audited 138 problems where their models failed and found 59.4% of those failures stemmed from flawed test cases, not model limitations. Tests either enforced specific implementation details (rejecting valid alternative solutions) or checked for features never mentioned in the problem description.

How does it work?

Two problems compounded: first, test design flaws that made failures look like model failures; second, training contamination where frontier models (GPT-5.2, Claude Opus 4.5, Gemini 3) could reproduce exact code fixes from memory given only a task ID and a hint. Apparent "improvements" reflected memorization rather than better reasoning. On SWE-bench Pro — a newer benchmark with diverse codebases — performance drops from ~70% to ~23%.

Why does it matter?

Every team comparing coding agents on SWE-bench Verified was comparing contaminated numbers. The 70% → 23% gap when switching to SWE-bench Pro shows how misleading leaderboard rankings had become. If you used those scores to make vendor or deployment decisions, the signal was unreliable.

Who is it for?

Teams evaluating AI coding tools, ML researchers, anyone reading model leaderboards

OpenAI Retires SWE-bench Verified: 59% of Failed Tests Were Flawed

What is it?

How does it work?

Why does it matter?

Who is it for?

Sources · 3 outlets

Tags