✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Rohan Paul
@rohanpaul_ai
This is really BAD news of LLM's coding skill. ☹️

The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel.

LiveCodeBench Pro, a benchmark composed of
problems from Codeforces, ICPC, and IOI (“International Olympiad in Informatics”) that are continuously updated to reduce the likelihood
of data contamination.
Thread image
Rohan Paul
@rohanpaul_ai
📌 The Gap Targeted

Earlier reports claimed frontier LLMs now top human grandmasters, but a cost-versus-rating plot proves otherwise.

Even the best model o4-mini-high sits near 2 100 Elo once tool calls are blocked, far from the 2 700 legend line that marks real grandmasters
Thread image
Rohan Paul
@rohanpaul_ai
🗂️ Building the Benchmark

A medal-winner team harvests each Codeforces, ICPC, and IOI problem as soon as a contest ends, before editorials appear, wiping out training leakage.
They store 584 tasks and tag each one as knowledge, logic, or observation heavy, producing a balanced skill matrix .
Thread image
Rohan Paul
@rohanpaul_ai
📊 Rating Models Fairly

Every submission is treated as a chess game against the task’s official difficulty.

A Bayesian MAP Elo fit assigns the only rating that matches human percentiles and strips out typing-speed bias
Thread image
Rohan Paul
@rohanpaul_ai
🎯 Where Models Shine and Fail

Figure 2 shows models sail through template zones like segment trees or dynamic programming yet plunge below 1 500 Elo on game theory, greedy tricks, and messy case work .
Zero hard-tier solves confirm the cliff.
Thread image
Rohan Paul
@rohanpaul_ai
🔍 Why Submissions Fail

A treemap comparison finds o3-mini commits many wrong algorithms and missed insights while humans mainly slip on implementation details.
Models also trip on samples they never run locally, something human coders catch instantly
Thread image
Rohan Paul
@rohanpaul_ai
🔁 More Tries, Better Outcomes

Letting o4-mini fire ten attempts lifts its rating by about 540 points and doubles medium-tier pass rate, but hard problems remain untouched at 0 %
Thread image
Rohan Paul
@rohanpaul_ai
🧠 Does Reasoning Help

Adding explicit chain-of-thought boosts combinatorics by up to 1 400 Elo and lifts knowledge tags, yet barely moves observation tags such as greedy or ad-hoc, hinting current reasoning traces miss the aha moment
Thread image
Rohan Paul
@rohanpaul_ai
💰 Terminal Power Matters

The authors estimate around 400 Elo of the published 2 700 score comes from terminal access that lets a model compile, unit-test, and brute-force patterns during inference
Thread image
Rohan Paul
@rohanpaul_ai
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export