@METR_Evals: METR tested pre-release versio...

1

METR tested pre-release versions of o3 + o4-mini on tasks involving autonomy and AI R&D. For each model, we examined how capable it is on our tasks & how often it tries to “hack” them. We detail our findings in a new report, a summary of which is included in OpenAI's system card.

View Tweet

2

On an updated version of our task suite, we estimate that o3 and o4-mini reach 50% time horizons which are 1.8x and 1.5x that of Claude 3.7 Sonnet, respectively. This is longer than all other public models we’ve tested.

View Tweet

3

We observed o3 in particular has a propensity to try to “hack” our tasks to get a higher score. Importantly, we saw this arise naturally from the model without explicit nudging. Behaviors like these have required us to be more careful in how we evaluate model capabilities.

4

METR received several weeks of access to query these models for our evaluations. As models become more capable, it will become important for external evaluators to inspect chain-of-thought traces in addition to outputs. We look forward to future work in this direction.

5

Check out the METR website for our full report: metr.github.io/autonomy-evals…

@METR_Evals: METR tested pre-release versio...

Actions

What You Can Do