Visualize Thread by @METR_Evals

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

METR

@METR_Evals

We’re updating the way we measure model time horizons on software tasks (TH 1.0→1.1). The updated methodology incorporates more of the tasks from HCAST, expanding our total from 170 to 288. This produces tighter estimates, especially at longer horizons.

METR

@METR_Evals

Our new time horizon estimates are a bit lower for GPT-4-era models and a bit higher for recent models. This doesn’t change the long-run trend (2019-2025), but it does make the growth since 2023 appear significantly steeper.

METR

@METR_Evals

We’re also replacing Vivaria, our original evaluation infrastructure. Our tasks now run on Inspect, an open-source evaluation framework developed & maintained by @AISecurityInst.

METR

@METR_Evals

We are exploring additional ways to raise the ceiling for our measurements. Even this updated suite has relatively few long tasks (ones that take humans 8+ hours to complete), while model capabilities are continuing to rapidly improve.

METR

@METR_Evals

We've updated our interactive graphs and data to include estimates from time horizon 1.1 in addition to 1.0.

For more details on the TH 1.0→1.1 update, check out our blog:
metr.org/blog/2026-1-29…

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export