New AI Benchmarks & Leaderboards
New AI benchmarks, evals and leaderboards — how today's models are measured, and who's actually on top, explained plainly.
19 releases tracked
- MosaicLeaks — ServiceNow benchmark for research-agent privacy leaks
MosaicLeaks measures how much an agent leaks into its web searches; PA-DR's privacy-aware RL cuts leakage from 51.7% to 9.9% on Qwen3-4B with no loss of task success.
- LifeSciBench — OpenAI's 750-task benchmark for life-science research
OpenAI's new biology benchmark grades models with expert rubrics, and even GPT-Rosalind passes only one task in three.
- Endor Labs Benchmarks Claude Fable 5 on Real-World Vulnerability Fixing — 59.8% Functional, 19.0% Security on the Agent Security League's 200 Tasks, With a Record 38/200 Confirmed Memorization Hits and 15 Run Timeouts, Even as Fable Cracks Four First-of-Their-Kind Security Solves
First independent Fable 5 coding eval lands mid-table, with the highest memorization rate Endor has ever recorded.
- Cognition Launches FrontierCode — 150-Task Benchmark Grades Whether Coding Agents Produce Mergeable Pull Requests, Claude Opus 4.8 Tops the Hardest Diamond Tier at 13.4% While GPT-5.5 Hits 6.3% and Gemini 3.1 Pro 4.7%
First coding benchmark grading whether AI agents write code maintainers would actually merge — not just code that passes tests.
- METR Adds Claude Mythos Preview to Time Horizons — 50% Time Horizon of At Least 16 Hours, Top of Their Measurable Range
METR added Claude Mythos Preview to its time-horizons chart and says the model is at the top of what they can measure.
- ProgramBench — Meta + Princeton Benchmark Where the Best Model Fully Solves 0 of 200 Programs
A new benchmark from the SWE-bench authors that gives agents a compiled binary and asks them to rebuild the source — and current frontier models score zero full solves.
- UC Berkeley: Every Major AI Agent Benchmark Can Be Hacked for Perfect Scores Without Solving a Single Task
Eight major agent benchmarks fall to reward hacking — a pytest hook achieves 100% on SWE-bench without fixing a single bug.
- Hugging Face: AI Evals Are Becoming the New Compute Bottleneck — $40K for One Agent Leaderboard Run
Evaluation costs now rival training costs — and they can't be compressed for agents, threatening who gets to define AI capability.
- HAL — Princeton's Holistic Agent Leaderboard Accepted at ICLR 2026, Now Tracks 26K+ Rollouts
HAL standardizes agent evaluation across 9 benchmarks with cost-tracking — revealing 100x cost differentials for 1% accuracy gains.
- ICLR 2026 Outstanding Papers — Transformers Are Succinct, LLMs Get Lost in Multi-Turn, and Muon Optimizer
ICLR 2026's best papers cover Transformer theory, multi-turn degradation, and a sharper optimizer — a foundational year for understanding LLMs.
- NIST CAISI Evaluation: DeepSeek V4 Pro Lags U.S. Frontier by ~8 Months Across Five Domains
NIST's CAISI publishes its first independent technical evaluation of DeepSeek V4 Pro across five capability domains.
- OpenAI Retires SWE-bench Verified: 59% of Failed Tests Were Flawed
OpenAI found that nearly 60% of SWE-bench Verified's failed tests were broken, and that frontier models had memorized the solutions during training.
- ParseBench — LlamaIndex's Document Parsing Benchmark for AI Agents
The first benchmark that measures whether AI agents can actually read enterprise PDFs correctly — tables, charts, semantic formatting, and layout all tested.
- WorldMark — First Unified Benchmark for Interactive Video World Models
The first benchmark that compares Genie 3, YUME 1.5, HY-World, and Matrix-Game head-to-head on identical scenes and action sequences.
- MirrorCode: AI Reimplements a 16k-Line Codebase — Weeks of Human Work in One Run
Claude Opus 4.6 autonomously reimplemented a 16k-line bioinformatics toolkit from scratch — no source code, just an executable and tests.
- RoboLab — NVIDIA's 120-Task Sim Benchmark Shows SOTA Robot Policies Top Out at 25.8%
NVIDIA's new 120-task simulation benchmark reveals that the best open robot policies succeed less than 26% of the time on manipulation tasks.
- VAKRA — IBM Research's Enterprise Agent Benchmark with 8,000+ Live APIs
IBM Research's benchmark runs AI agents against 8,000+ live enterprise APIs to measure real tool-use and multi-hop reasoning.
- Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
Grading AI agents on what they did, not just what they said — and discovering final-output grading misses 44 percent of safety violations.
- Video-MME-v2
The Video-MME benchmark gets a harder, more honest successor — with grading that penalises models for inconsistent answers.