METR (@METR_Evals)

Thread Archive

67

We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks. ...

Apr 10, 2026

Thread Archive

13

We’re updating the way we measure model time horizons on software tasks (TH 1.0→1.1). The updated methodology incorporates more of the tasks from HCAST, expanding our total from 170 to 288. This produces tighter estimates, especially at longer horizons. ...

Jan 29, 2026

Thread Archive

11

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't....

Jul 10, 2025

Thread Archive

20

METR tested pre-release versions of o3 + o4-mini on tasks involving autonomy and AI R&D. For each model, we examined how capable it is on our tasks & how often it tries to “hack” them. We detail our findings in a new report, a summary of which is included in OpenAI's system card. <a target=...

May 05, 2025

Thread Archive

14

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months. ...

Mar 20, 2025