Carousel Studio

Repurpose X Threads into LinkedIn & Instagram Carousels

Canvas & Ratio

Choose your destination platform format


Layout Template

Choose a content structure for your slides


Preset Themes


Typography & Sizing

Title Font Size36px
Body Font Size18px
Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)
AGENCY
SAVE PRESETS (AGENCY)

Outro Slide CTA

Customize your closing call-to-action slide

#1
#2
#3

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1
METR
@METR_Evals

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

Apply Image
Drag Post #2
METR
@METR_Evals

At a high level, our method is simple: 1. We ask both skilled humans and AI systems to attempt tasks in similar conditions. 2. We measure how long the humans take. 3. We then measure how AI success rates vary depending on how long the humans took to do those tasks.

Apply Image
Drag Post #3
METR
@METR_Evals

We measure human and AI performance on a variety of software tasks, some sourced from existing METR benchmarks like HCAST and some brand new. Human completion times on these tasks range from 1 second to 16 hours. <a target="_blank" href="https://twitter.com/idavidrein/status/1901647558839353363" color="blue">x.com/idavidrein/sta…</a>

Drag Post #4
METR
@METR_Evals

We then fit a curve that predicts the success rate of an AI based on how long it took humans to do each task. This curve characterizes how capable an AI is at different task lengths. We then summarize the curve with the task length at which a model’s success rate is 50%.

Apply Image
Drag Post #5
METR
@METR_Evals

This metric - the 50% task completion time horizon - gives us a way to track progress in model autonomy over time. Plotting the historical trend of 50% time horizons across frontier AI systems shows exponential growth.

Apply Image
Drag Post #6
METR
@METR_Evals

These results appear robust. Although our model could be wrong, we are relatively confident about its fit to the data. While our initial data only covered the most recent systems, we found we could retrodict back to GPT-2.

Apply Image
Apply Image
Drag Post #7
METR
@METR_Evals

We ran experiments on SWE-bench Verified and found a similar trend. We also ran a small experiment on internal METR pull requests, and found results consistent with our other datasets. We are excited for researchers to extend this and measure time horizons on other benchmarks.

Apply Image
Drag Post #8
METR
@METR_Evals

We are fairly confident in the rough trend of 1-4 doublings in horizon length per year. That is fast! Measures like these help make the notion of “degrees of autonomy” more concrete and let us quantify when AI abilities may rise above specific useful (or dangerous) thresholds. <a target="_blank" href="https://twitter.com/richardmcngo/status/1643310525697105935" color="blue">x.com/richardmcngo/s…</a>

Drag Post #9
METR
@METR_Evals

We give more high-level information about these results and what they might imply on the METR blog: <a target="_blank" href="https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/" color="blue">metr.org/blog/2025-03-1…</a>

Drag Post #10
METR
@METR_Evals

For the details, read “Measuring AI Ability to Complete Long Tasks,” now on available on arXiv: <a target="_blank" href="https://arxiv.org/abs/2503.14499" color="blue">arxiv.org/abs/2503.14499</a>

Drag Post #11
METR
@METR_Evals

If you are interested in contributing to more research like this, on quantitative evaluation of frontier AI capabilities: METR is hiring! <a target="_blank" href="https://hiring.metr.org" color="blue">hiring.metr.org</a>