| Thread Navigator

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

METR

@METR_Evals

We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.

Apply Image

Drag Post #2

METR

@METR_Evals

In our measurements, whenever a model succeeds on a task by reward-hacking, we consider the attempt a failure. Following this same policy, we arrived at a point estimate of 5.7hrs (95% CI of 3hrs to 13.5hrs) for GPT-5.4’s time horizon. <a target="_blank" href="https://x.com/METR_Evals/status/1931057777830715526?s=20" color="blue">x.com/METR_Evals/sta…</a>

Drag Post #3

METR

@METR_Evals

However, in our GPT-5.4 evaluation we noticed its runs were producing reward hacks unusually often. A quick test suggested that using a different prompt might cause it to produce more legitimate successes instead of reward hacks.

Drag Post #4

METR

@METR_Evals

For this reason, we are also reporting our estimate of the model’s time horizon prior to rescoring the reward-hacking attempts. Allowing for reward hacks results in a point estimate of 13hrs (95% CI of 5hrs to 74hrs).

Drag Post #5

METR

@METR_Evals

We observed similar situations in previous measurements as well. All measurements we published over the past year would have been higher had we not penalized reward-hacking attempts. But this discrepancy was especially pronounced for GPT-5.4.

Drag Post #6

METR

@METR_Evals

You can find details about our measurement methodology and time horizon estimates for other models on our website. <a target="_blank" href="https://metr.org/time-horizons/" color="blue">metr.org/time-horizons/</a>