Visualize Thread by @METR_Evals

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

METR

@METR_Evals

METR tested pre-release versions of o3 + o4-mini on tasks involving autonomy and AI R&D. For each model, we examined how capable it is on our tasks & how often it tries to “hack” them. We detail our findings in a new report, a summary of which is included in OpenAI's system card.

View Tweet

METR

@METR_Evals

On an updated version of our task suite, we estimate that o3 and o4-mini reach 50% time horizons which are 1.8x and 1.5x that of Claude 3.7 Sonnet, respectively. This is longer than all other public models we’ve tested.

View Tweet

METR

@METR_Evals

We observed o3 in particular has a propensity to try to “hack” our tasks to get a higher score. Importantly, we saw this arise naturally from the model without explicit nudging. Behaviors like these have required us to be more careful in how we evaluate model capabilities.

METR

@METR_Evals

METR received several weeks of access to query these models for our evaluations. As models become more capable, it will become important for external evaluators to inspect chain-of-thought traces in addition to outputs. We look forward to future work in this direction.

METR

@METR_Evals

Check out the METR website for our full report: metr.github.io/autonomy-evals…

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export