Visualize Thread by @rohanpaul_ai

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Rohan Paul

@rohanpaul_ai

This is really BAD news of LLM's coding skill. ☹️

The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel.

LiveCodeBench Pro, a benchmark composed of
problems from Codeforces, ICPC, and IOI (“International Olympiad in Informatics”) that are continuously updated to reduce the likelihood
of data contamination.

Rohan Paul

@rohanpaul_ai

📌 The Gap Targeted

Earlier reports claimed frontier LLMs now top human grandmasters, but a cost-versus-rating plot proves otherwise.

Even the best model o4-mini-high sits near 2 100 Elo once tool calls are blocked, far from the 2 700 legend line that marks real grandmasters

Rohan Paul

@rohanpaul_ai

🗂️ Building the Benchmark

A medal-winner team harvests each Codeforces, ICPC, and IOI problem as soon as a contest ends, before editorials appear, wiping out training leakage.
They store 584 tasks and tag each one as knowledge, logic, or observation heavy, producing a balanced skill matrix .

Rohan Paul

@rohanpaul_ai

📊 Rating Models Fairly

Every submission is treated as a chess game against the task’s official difficulty.

A Bayesian MAP Elo fit assigns the only rating that matches human percentiles and strips out typing-speed bias

Rohan Paul

@rohanpaul_ai

🎯 Where Models Shine and Fail

Figure 2 shows models sail through template zones like segment trees or dynamic programming yet plunge below 1 500 Elo on game theory, greedy tricks, and messy case work .
Zero hard-tier solves confirm the cliff.

Rohan Paul

@rohanpaul_ai

🔍 Why Submissions Fail

A treemap comparison finds o3-mini commits many wrong algorithms and missed insights while humans mainly slip on implementation details.
Models also trip on samples they never run locally, something human coders catch instantly

Rohan Paul

@rohanpaul_ai

🔁 More Tries, Better Outcomes

Letting o4-mini fire ten attempts lifts its rating by about 540 points and doubles medium-tier pass rate, but hard problems remain untouched at 0 %

Rohan Paul

@rohanpaul_ai

🧠 Does Reasoning Help

Adding explicit chain-of-thought boosts combinatorics by up to 1 400 Elo and lifts knowledge tags, yet barely moves observation tags such as greedy or ad-hoc, hinting current reasoning traces miss the aha moment

Rohan Paul

@rohanpaul_ai

💰 Terminal Power Matters

The authors estimate around 400 Elo of the published 2 700 score comes from terminal access that lets a model compile, unit-test, and brute-force patterns during inference

Rohan Paul

@rohanpaul_ai

arxiv.org/pdf/2506.11928

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export