@METR_Evals: We ran GPT-5.4 (xhigh) on our ...
@METR_Evals
66 views
Apr 10, 2026
Advertisement
2
In our measurements, whenever a model succeeds on a task by reward-hacking, we consider the attempt a failure. Following this same policy, we arrived at a point estimate of 5.7hrs (95% CI of 3hrs to 13.5hrs) for GPT-5.4โs time horizon.
View Tweet
3
However, in our GPT-5.4 evaluation we noticed its runs were producing reward hacks unusually often. A quick test suggested that using a different prompt might cause it to produce more legitimate successes instead of reward hacks.
4
For this reason, we are also reporting our estimate of the modelโs time horizon prior to rescoring the reward-hacking attempts. Allowing for reward hacks results in a point estimate of 13hrs (95% CI of 5hrs to 74hrs).
5
We observed similar situations in previous measurements as well. All measurements we published over the past year would have been higher had we not penalized reward-hacking attempts. But this discrepancy was especially pronounced for GPT-5.4.
6
You can find details about our measurement methodology and time horizon estimates for other models on our website. metr.org/time-horizons/
