@METR_Evals: When will AI systems be able t...
@METR_Evals
14 views
Mar 20, 2025
3
We measure human and AI performance on a variety of software tasks, some sourced from existing METR benchmarks like HCAST and some brand new.
Human completion times on these tasks range from 1 second to 16 hours.
Human completion times on these tasks range from 1 second to 16 hours.
View Tweet
8
We are fairly confident in the rough trend of 1-4 doublings in horizon length per year. That is fast! Measures like these help make the notion of “degrees of autonomy” more concrete and let us quantify when AI abilities may rise above specific useful (or dangerous) thresholds.
View Tweet
9
We give more high-level information about these results and what they might imply on the METR blog: metr.org/blog/2025-03-1…
10
For the details, read “Measuring AI Ability to Complete Long Tasks,” now on available on arXiv: arxiv.org/abs/2503.14499
11
If you are interested in contributing to more research like this, on quantitative evaluation of frontier AI capabilities: METR is hiring!
hiring.metr.org
hiring.metr.org






