Visualize Thread by @rohanpaul_ai

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Rohan Paul

@rohanpaul_ai

Beautiful research from @Apple

More thoughts stop helping once tasks cross critical depth.

Thinking tokens rise, then crash, revealing compute inefficiency.

So Standard LLMs beat LRMs on easy puzzles, unexpectedly.

Researchers stress-test them on puzzles whose difficulty can be dialed up step by step.

Thinking pull ahead mid-way, but every model collapses once the puzzle grows past a critical depth. Even stranger, near that point the thinker writes fewer thoughts despite plenty of allowed tokens, hinting at a built-in ceiling on current inference-time reasoning. Key findings below.

🧩 Controlled puzzles

Four simulators (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) raise complexity smoothly while rules stay fixed. Exact grading of each move stops data leakage.

📈 Three regimes

Low depth: non-thinking LLMs solve faster and spend fewer tokens. Medium depth: thinking variants win by searching longer. High depth: both hit zero accuracy. The boundary shifts with model size but exists for all.

🤖 Token scaling limit

As puzzles harden, thinkers initially emit more tokens. Near collapse their token output drops, even though the budget is far from the 64k cap. Reasoning effort fails to scale with problem depth.

🔍 Thought patterns

On easy tasks the correct plan appears early, but the model keeps exploring and sometimes changes its mind, wasting compute. At medium depth the right plan surfaces late. After the threshold no correct plan appears at all.

⚠️ Exact step limits

Supplying the Tower of Hanoi algorithm in the prompt should turn reasoning into straight execution. Accuracy still collapses. Large Reasoning Models struggle with straightforward symbolic sequences, hinting at fundamental gaps beyond search.

Rohan Paul

@rohanpaul_ai

Puzzles expose a hidden ceiling in thinking models.

Rohan Paul

@rohanpaul_ai

More disks make the Tower of Hanoi puzzle harder. At 1-3 disks the plain model is both accurate and brief; at 4-7 disks the thinking version gains accuracy by spending many extra tokens; past about 8 disks both crash to 0% and the thinking model even writes fewer thoughts.

This shows current chain-of-thought scaling breaks beyond a small depth.

Rohan Paul

@rohanpaul_ai

PAPER - "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity"

machinelearning.apple.com/research/illus…

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export