@METR_Evals: METR tested pre-release versio...
@METR_Evals
20 views
May 05, 2025
1
METR tested pre-release versions of o3 + o4-mini on tasks involving autonomy and AI R&D. For each model, we examined how capable it is on our tasks & how often it tries to “hack” them. We detail our findings in a new report, a summary of which is included in OpenAI's system card.
View Tweet
2
On an updated version of our task suite, we estimate that o3 and o4-mini reach 50% time horizons which are 1.8x and 1.5x that of Claude 3.7 Sonnet, respectively. This is longer than all other public models we’ve tested.
View Tweet
4
METR received several weeks of access to query these models for our evaluations. As models become more capable, it will become important for external evaluators to inspect chain-of-thought traces in addition to outputs. We look forward to future work in this direction.
5
Check out the METR website for our full report: metr.github.io/autonomy-evals…


