Beautiful research from @Apple
More thoughts stop helping once tasks cross critical depth.
Thinking tokens rise, then crash, revealing compute inefficiency.
So Standard LLMs beat LRMs on easy puzzles, unexpectedly.
Researchers stress-test them on puzzles whose difficulty can be dialed up step by step.
Thinking pull ahead mid-way, but every model collapses once the puzzle grows past a critical depth. Even stranger, near that point the thinker writes fewer thoughts despite plenty of allowed tokens, hinting at a built-in ceiling on current inference-time reasoning. Key findings below.
🧩 Controlled puzzles
Four simulators (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) raise complexity smoothly while rules stay fixed. Exact grading of each move stops data leakage.
📈 Three regimes
Low depth: non-thinking LLMs solve faster and spend fewer tokens. Medium depth: thinking variants win by searching longer. High depth: both hit zero accuracy. The boundary shifts with model size but exists for all.
🤖 Token scaling limit
As puzzles harden, thinkers initially emit more tokens. Near collapse their token output drops, even though the budget is far from the 64k cap. Reasoning effort fails to scale with problem depth.
🔍 Thought patterns
On easy tasks the correct plan appears early, but the model keeps exploring and sometimes changes its mind, wasting compute. At medium depth the right plan surfaces late. After the threshold no correct plan appears at all.
⚠️ Exact step limits
Supplying the Tower of Hanoi algorithm in the prompt should turn reasoning into straight execution. Accuracy still collapses. Large Reasoning Models struggle with straightforward symbolic sequences, hinting at fundamental gaps beyond search.

Puzzles expose a hidden ceiling in thinking models.

More disks make the Tower of Hanoi puzzle harder. At 1-3 disks the plain model is both accurate and brief; at 4-7 disks the thinking version gains accuracy by spending many extra tokens; past about 8 disks both crash to 0% and the thinking model even writes fewer thoughts.
This shows current chain-of-thought scaling breaks beyond a small depth.
This shows current chain-of-thought scaling breaks beyond a small depth.

PAPER - "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity"
machinelearning.apple.com/research/illus…
machinelearning.apple.com/research/illus…
Generated by Thread Navigator
Press ⌘ + S to quick-export
