Visualize Thread by @Research_FRI | Thread Navigator

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Forecasting Research Institute

@Research_FRI

Is it possible to spot a good forecast by its rationale?

We used LLMs to score the reasoning behind 55,000+ forecasts and test the link between forecast accuracy and written rationales.

We found that:

• Causal reasoning is much more prevalent than statistical argumentation
• It's easier to identify poor forecasters rather than excellent ones
• Human ratings of rationale quality can be unreliable.

🧵A thread on the results:

03:04 PM · Jun 04, 2026

Thread image

Forecasting Research Institute

@Research_FRI

✍️ Good forecasts often come with rationales—written explanations of the reasoning behind a number.

In our studies, we've collected millions of words of rationales, where forecasters explain their logic, cite evidence, and weigh competing considerations.

But there are crucial things we don't know about rationales. For example—which features of a rationale are good predictors of forecasting accuracy?

03:04 PM · Jun 04, 2026

Forecasting Research Institute

@Research_FRI

To figure this out, we took the following approach:

1) Defined 60 Explanation Quality Markers (EQMs): features like statistical or fact-based reasoning, guessing, confirmation bias, or extreme confidence.

2) Used an LLM to score rationales against each of the 60 EQMs.

3) Collapsed those 60 scores into a single composite number per rationale.

4) Correlated that composite score with forecasting accuracy.

03:04 PM · Jun 04, 2026

Thread image

Forecasting Research Institute

@Research_FRI

💻 We ran this pipeline on 55,000 forecast-rationale pairs from the ACE geopolitical forecasting tournament—the IARPA-funded competition that led to @PTetlock's original work on superforecasters.

Our key findings were...

03:04 PM · Jun 04, 2026

Forecasting Research Institute

@Research_FRI

1️⃣ Statistical reasoning is rare

In the ACE tournament, forecasters typically expressed their reasoning in causal, not statistical, terms.

Share of rationales that featured each EQM: 77% of rationales featured causal reasoning, but only 19% contained statistical reasoning, a 4x difference.

03:04 PM · Jun 04, 2026

Thread image

Forecasting Research Institute

@Research_FRI

2️⃣ EQMs predict accuracy (with some caveats)

We tested whether a forecast's EQM score was predictive of actual forecast accuracy, comparing our new approach with earlier work on scoring rationale quality.

We found that the EQM composite score correlated more strongly with forecasting accuracy than a pre-LLM benchmark did.

03:04 PM · Jun 04, 2026

Forecasting Research Institute

@Research_FRI

👍 The following EQMs were positive indicators of forecast accuracy (upper-right quadrant):

• Forecast and rationale align
• Fact based
• Concrete reasoning

👎 These EQMs were negative indicators of accuracy (lower-left quadrant):

• Forecast and rationale misalign
• Confirmation bias
• Extreme confidence

03:04 PM · Jun 04, 2026

Thread image

Forecasting Research Institute

@Research_FRI

Important caveat: EQMs are more reliable for flagging weak forecasts and forecasters than picking out excellent ones.

In other words, EQMs are mostly a screen, not a talent detector.

The graph below sorts rationales into nine bins by EQM score. You can see that the biggest jump in accuracy happens across the bottom third of rationales (far left), with minimal gains towards the top.

03:04 PM · Jun 04, 2026

Thread image

Forecasting Research Institute

@Research_FRI

3️⃣ What looks good to humans isn't always what's accurate

We compared human ratings of rationales from the ACE tournament with our EQM scores for the same rationales to find out which ratings were a better predictor of forecast accuracy.

We found that EQM scores had a stronger correlation with forecast accuracy than human ratings.

03:04 PM · Jun 04, 2026

Forecasting Research Institute

@Research_FRI

Why was this?

As you can see below, human ratings correlate strongly with rationale length, but length is essentially uncorrelated to forecast-level accuracy.

Human raters weren’t wrong directionally, but they appeared to place undue weight on some features, such as underweighting "red flags" like extreme confidence.

03:04 PM · Jun 04, 2026

Thread image

Forecasting Research Institute

@Research_FRI

Thank you to the study's authors: Chris Karvetski, @sicong_huang, @simas_kucinskas, Nadja Flechner, Jingyu Hu, @PTetlock, and @EzraKarger

Read more on our Substack: forecastingresearch.substack.com/p/can-you-judg…

Read the full working paper on SSRN: papers.ssrn.com/sol3/papers.cf…

03:04 PM · Jun 04, 2026

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export