@rohanpaul_ai: Beautiful research from @Apple...

@rohanpaul_ai
53 views Jun 12, 2025
1
Beautiful research from @Apple

More thoughts stop helping once tasks cross critical depth.

Thinking tokens rise, then crash, revealing compute inefficiency.

So Standard LLMs beat LRMs on easy puzzles, unexpectedly.

Researchers stress-test them on puzzles whose difficulty can be dialed up step by step.

Thinking pull ahead mid-way, but every model collapses once the puzzle grows past a critical depth. Even stranger, near that point the thinker writes fewer thoughts despite plenty of allowed tokens, hinting at a built-in ceiling on current inference-time reasoning. Key findings below.

🧩 Controlled puzzles

Four simulators (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) raise complexity smoothly while rules stay fixed. Exact grading of each move stops data leakage.

📈 Three regimes

Low depth: non-thinking LLMs solve faster and spend fewer tokens. Medium depth: thinking variants win by searching longer. High depth: both hit zero accuracy. The boundary shifts with model size but exists for all.

🤖 Token scaling limit

As puzzles harden, thinkers initially emit more tokens. Near collapse their token output drops, even though the budget is far from the 64k cap. Reasoning effort fails to scale with problem depth.

🔍 Thought patterns

On easy tasks the correct plan appears early, but the model keeps exploring and sometimes changes its mind, wasting compute. At medium depth the right plan surfaces late. After the threshold no correct plan appears at all.

⚠️ Exact step limits

Supplying the Tower of Hanoi algorithm in the prompt should turn reasoning into straight execution. Accuracy still collapses. Large Reasoning Models struggle with straightforward symbolic sequences, hinting at fundamental gaps beyond search.
Media image
2
Puzzles expose a hidden ceiling in thinking models.
Media image
3
More disks make the Tower of Hanoi puzzle harder. At 1-3 disks the plain model is both accurate and brief; at 4-7 disks the thinking version gains accuracy by spending many extra tokens; past about 8 disks both crash to 0% and the thinking model even writes fewer thoughts.

This shows current chain-of-thought scaling breaks beyond a small depth.
Media image
4
PAPER - "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity"

machinelearning.apple.com/research/illus…
Actions
Visual Editor Carousel Maker NEW
Update Thread
What You Can Do
  • Download as PDF
  • Save to Notion
  • Export as Markdown
  • Visual Editor
  • LinkedIn & Instagram Carousel Maker
Create Free Account

Includes 7-day Premium trial