New AI Benchmarks & Leaderboards

New AI benchmarks, evals and leaderboards — how today's models are measured, and who's actually on top, explained plainly.

19 releases tracked

MosaicLeaks — ServiceNow benchmark for research-agent privacy leaksServiceNow Research · 2026-06-18 · notable
MosaicLeaks measures how much an agent leaks into its web searches; PA-DR's privacy-aware RL cuts leakage from 51.7% to 9.9% on Qwen3-4B with no loss of task success.
LifeSciBench — OpenAI's 750-task benchmark for life-science researchOpenAI · 2026-06-17 · major
OpenAI's new biology benchmark grades models with expert rubrics, and even GPT-Rosalind passes only one task in three.
Endor Labs Benchmarks Claude Fable 5 on Real-World Vulnerability Fixing — 59.8% Functional, 19.0% Security on the Agent Security League's 200 Tasks, With a Record 38/200 Confirmed Memorization Hits and 15 Run Timeouts, Even as Fable Cracks Four First-of-Their-Kind Security SolvesEndor Labs · 2026-06-10 · major
First independent Fable 5 coding eval lands mid-table, with the highest memorization rate Endor has ever recorded.
Cognition Launches FrontierCode — 150-Task Benchmark Grades Whether Coding Agents Produce Mergeable Pull Requests, Claude Opus 4.8 Tops the Hardest Diamond Tier at 13.4% While GPT-5.5 Hits 6.3% and Gemini 3.1 Pro 4.7%Cognition · 2026-06-08 · major
First coding benchmark grading whether AI agents write code maintainers would actually merge — not just code that passes tests.
METR Adds Claude Mythos Preview to Time Horizons — 50% Time Horizon of At Least 16 Hours, Top of Their Measurable RangeMETR · 2026-05-08 · major
METR added Claude Mythos Preview to its time-horizons chart and says the model is at the top of what they can measure.
ProgramBench — Meta + Princeton Benchmark Where the Best Model Fully Solves 0 of 200 ProgramsMeta AI Research · 2026-05-05 · notable
A new benchmark from the SWE-bench authors that gives agents a compiled binary and asks them to rebuild the source — and current frontier models score zero full solves.
UC Berkeley: Every Major AI Agent Benchmark Can Be Hacked for Perfect Scores Without Solving a Single TaskUC Berkeley RDI · 2026-04-12 · major
Eight major agent benchmarks fall to reward hacking — a pytest hook achieves 100% on SWE-bench without fixing a single bug.
Hugging Face: AI Evals Are Becoming the New Compute Bottleneck — $40K for One Agent Leaderboard RunEvalEval Coalition / Hugging Face · 2026-04-29 · major
Evaluation costs now rival training costs — and they can't be compressed for agents, threatening who gets to define AI capability.
HAL — Princeton's Holistic Agent Leaderboard Accepted at ICLR 2026, Now Tracks 26K+ RolloutsPrinceton University · 2026-04-23 · major
HAL standardizes agent evaluation across 9 benchmarks with cost-tracking — revealing 100x cost differentials for 1% accuracy gains.
ICLR 2026 Outstanding Papers — Transformers Are Succinct, LLMs Get Lost in Multi-Turn, and Muon OptimizerICLR 2026 Program Committee · 2026-04-23 · notable
ICLR 2026's best papers cover Transformer theory, multi-turn degradation, and a sharper optimizer — a foundational year for understanding LLMs.
NIST CAISI Evaluation: DeepSeek V4 Pro Lags U.S. Frontier by ~8 Months Across Five DomainsNIST CAISI · 2026-05-01 · major
NIST's CAISI publishes its first independent technical evaluation of DeepSeek V4 Pro across five capability domains.
OpenAI Retires SWE-bench Verified: 59% of Failed Tests Were FlawedOpenAI · 2026-02-23 · major
OpenAI found that nearly 60% of SWE-bench Verified's failed tests were broken, and that frontier models had memorized the solutions during training.
ParseBench — LlamaIndex's Document Parsing Benchmark for AI AgentsLlamaIndex · 2026-04-13 · notable
The first benchmark that measures whether AI agents can actually read enterprise PDFs correctly — tables, charts, semantic formatting, and layout all tested.
WorldMark — First Unified Benchmark for Interactive Video World ModelsAlaya Studio / University of Tokyo / Shanghai Innovation Institute · 2026-04-23 · notable
The first benchmark that compares Genie 3, YUME 1.5, HY-World, and Matrix-Game head-to-head on identical scenes and action sequences.
MirrorCode: AI Reimplements a 16k-Line Codebase — Weeks of Human Work in One RunEpoch AI / METR · 2026-04-10 · major
Claude Opus 4.6 autonomously reimplemented a 16k-line bioinformatics toolkit from scratch — no source code, just an executable and tests.
RoboLab — NVIDIA's 120-Task Sim Benchmark Shows SOTA Robot Policies Top Out at 25.8%NVIDIA · 2026-04-14 · notable
NVIDIA's new 120-task simulation benchmark reveals that the best open robot policies succeed less than 26% of the time on manipulation tasks.
VAKRA — IBM Research's Enterprise Agent Benchmark with 8,000+ Live APIsIBM Research · 2026-04-15 · notable
IBM Research's benchmark runs AI agents against 8,000+ live enterprise APIs to measure real tool-use and multi-hop reasoning.
Claw-Eval: Toward Trustworthy Evaluation of Autonomous AgentsPeking University, HKU · 2026-04-07 · notable
Grading AI agents on what they did, not just what they said — and discovering final-output grading misses 44 percent of safety violations.
Video-MME-v2Xiamen University, Shanghai AI Lab, Tencent · 2026-04-06 · notable
The Video-MME benchmark gets a harder, more honest successor — with grading that penalises models for inconsistent answers.

← All releases · Learn AI