METR · 2026-05-08 · major

METR Adds Claude Mythos Preview to Time Horizons — 50% Time Horizon of At Least 16 Hours, Top of Their Measurable Range

METR's frontier-model time-horizon table now lists Claude Mythos Preview at a 50% horizon of ≥16h (95% CI 8.5–55h) — the top end of what their 228-task suite can reliably resolve, since only 5 tasks exceed 16h.

METR

METR added Claude Mythos Preview to its time-horizons chart and says the model is at the top of what they can measure.

Key specs

50% time horizon	>=16h
95% ci	8.5h - 55h
Tasks in suite	228
Tasks above 16h	5

What is it?

An update to METR's public Task-Completion Time Horizons page. METR estimates how long a task takes a human expert at which a model still succeeds 50% of the time. Mythos Preview was evaluated during a limited March 2026 window for Anthropic's pre-deployment risk assessment.

How does it work?

METR runs frontier models against a 228-task suite weighted toward software engineering, ML, and cybersecurity. They fit a logistic curve mapping human-expert task duration to model success rate, then read off the 50% point. Mythos lands at ≥16h with a 95% confidence interval of 8.5h to 55h. Only 5 tasks in the suite are estimated at 16h or longer, so anything above 16h sits in METR's stated unreliable zone.

Why does it matter?

Frontier-model agentic capability is moving past the resolution of the standard public eval. METR is openly saying their bench can no longer separate top labs' newest models from each other. That is a flag for both safety teams (your eval is saturating) and product teams (capability claims about long-horizon agent work need new tasks to verify).

Who is it for?

AI safety researchers, eval builders, agent developers tracking long-horizon capability.

Try it

https://metr.org/time-horizons/