We’re updating the way we measure model time horizons on software tasks (TH 1.0→1.1). The updated methodology incorporates more of the tasks from HCAST, expanding our total from 170 to 288. This produces tighter estimates, especially at longer horizons.

Our new time horizon estimates are a bit lower for GPT-4-era models and a bit higher for recent models. This doesn’t change the long-run trend (2019-2025), but it does make the growth since 2023 appear significantly steeper.
We’re also replacing Vivaria, our original evaluation infrastructure. Our tasks now run on Inspect, an open-source evaluation framework developed & maintained by @AISecurityInst.
We are exploring additional ways to raise the ceiling for our measurements. Even this updated suite has relatively few long tasks (ones that take humans 8+ hours to complete), while model capabilities are continuing to rapidly improve.
We've updated our interactive graphs and data to include estimates from time horizon 1.1 in addition to 1.0.
For more details on the TH 1.0→1.1 update, check out our blog:
metr.org/blog/2026-1-29…
For more details on the TH 1.0→1.1 update, check out our blog:
metr.org/blog/2026-1-29…
Generated by Thread Navigator
Press ⌘ + S to quick-export
