@METR_Evals: We ran a randomized controlled...
@METR_Evals
11 views
Jul 10, 2025
4
We were surprised by this, given a) impressive AI benchmark scores, b) widespread adoption of AI tooling for software development, and c) our own recent research measuring trends in the length of tasks that agents are able to complete.
6
To better understand these factors, we investigate 20 properties of our setting, finding 5 likely contributors, and 8 mixed/unclear factors.
We also analyze to make sure the result isn’t a fluke, and find that slowdown persists across different outcome measures, estimator methodologies, and many other subsets/analyses of our data.
We also analyze to make sure the result isn’t a fluke, and find that slowdown persists across different outcome measures, estimator methodologies, and many other subsets/analyses of our data.
7
Why did we run this study?
AI agent benchmarks have limitations—they’re self-contained, use algorithmic scoring, and lack live human interaction. This can make it difficult to directly infer real-world impact.
If we want an early warning system for whether AI R&D is being accelerated by AI itself, or even automated, it would be useful to be able to directly measure this in real-world engineer trials, rather than relying on proxies like benchmarks or even noisier information like anecdotes.
AI agent benchmarks have limitations—they’re self-contained, use algorithmic scoring, and lack live human interaction. This can make it difficult to directly infer real-world impact.
If we want an early warning system for whether AI R&D is being accelerated by AI itself, or even automated, it would be useful to be able to directly measure this in real-world engineer trials, rather than relying on proxies like benchmarks or even noisier information like anecdotes.
8
So how do we reconcile our results with other sources of data on AI capabilities, like impressive benchmark results, and anecdotes/widespread adoption of AI tools?
9
Our RCT may underestimate capabilities for various reasons, and benchmarks and anecdotes may overestimate capabilities (likely some combination)—we discuss some possibilities in our accompanying blog post.
10
What do we take away?
1. It seems likely that for some important settings, recent AI tooling has not increased productivity (and may in fact decrease it).
2. Self-reports of speedup are unreliable—to understand AI’s impact on productivity, we need experiments in the wild.
1. It seems likely that for some important settings, recent AI tooling has not increased productivity (and may in fact decrease it).
2. Self-reports of speedup are unreliable—to understand AI’s impact on productivity, we need experiments in the wild.
11
Another implication:
It is sometimes proposed that we should monitor AI R&D acceleration inside of frontier AI labs via simple employee surveys. We’re now more pessimistic about these, given how large of a gap we observe between developer-estimated and observed speed-up.
It is sometimes proposed that we should monitor AI R&D acceleration inside of frontier AI labs via simple employee surveys. We’re now more pessimistic about these, given how large of a gap we observe between developer-estimated and observed speed-up.
13
We’re exploring running experiments like this in other settings—if you’re an open-source developer or company interested in understanding the impact of AI on your work, reach out to us here: forms.gle/pBsSo54VpmuQC4…
Paper: metr.org/Early_2025_AI_…
Blog: metr.org/blog/2025-07-1…
Paper: metr.org/Early_2025_AI_…
Blog: metr.org/blog/2025-07-1…







