| Thread Navigator

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

Rohan Paul

@rohanpaul_ai

This is really BAD news of LLM's coding skill. ☹️ The best Frontier LLM models achieve 0% on hard real-life Programming Contest problems, domains where expert humans still excel. LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI (“International Olympiad in Informatics”) that are continuously updated to reduce the likelihood of data contamination.

Apply Image

Drag Post #2

Rohan Paul

@rohanpaul_ai

📌 The Gap Targeted Earlier reports claimed frontier LLMs now top human grandmasters, but a cost-versus-rating plot proves otherwise. Even the best model o4-mini-high sits near 2 100 Elo once tool calls are blocked, far from the 2 700 legend line that marks real grandmasters

Apply Image

Drag Post #3

Rohan Paul

@rohanpaul_ai

🗂️ Building the Benchmark A medal-winner team harvests each Codeforces, ICPC, and IOI problem as soon as a contest ends, before editorials appear, wiping out training leakage. They store 584 tasks and tag each one as knowledge, logic, or observation heavy, producing a balanced skill matrix .

Apply Image

Drag Post #4

Rohan Paul

@rohanpaul_ai

📊 Rating Models Fairly Every submission is treated as a chess game against the task’s official difficulty. A Bayesian MAP Elo fit assigns the only rating that matches human percentiles and strips out typing-speed bias

Apply Image

Drag Post #5

Rohan Paul

@rohanpaul_ai

🎯 Where Models Shine and Fail Figure 2 shows models sail through template zones like segment trees or dynamic programming yet plunge below 1 500 Elo on game theory, greedy tricks, and messy case work . Zero hard-tier solves confirm the cliff.

Apply Image

Drag Post #6

Rohan Paul

@rohanpaul_ai

🔍 Why Submissions Fail A treemap comparison finds o3-mini commits many wrong algorithms and missed insights while humans mainly slip on implementation details. Models also trip on samples they never run locally, something human coders catch instantly

Apply Image

Drag Post #7

Rohan Paul

@rohanpaul_ai

🔁 More Tries, Better Outcomes Letting o4-mini fire ten attempts lifts its rating by about 540 points and doubles medium-tier pass rate, but hard problems remain untouched at 0 %

Apply Image

Drag Post #8

Rohan Paul

@rohanpaul_ai

🧠 Does Reasoning Help Adding explicit chain-of-thought boosts combinatorics by up to 1 400 Elo and lifts knowledge tags, yet barely moves observation tags such as greedy or ad-hoc, hinting current reasoning traces miss the aha moment

Apply Image

Drag Post #9

Rohan Paul

@rohanpaul_ai

💰 Terminal Power Matters The authors estimate around 400 Elo of the published 2 700 score comes from terminal access that lets a model compile, unit-test, and brute-force patterns during inference

Apply Image

Drag Post #10

Rohan Paul

@rohanpaul_ai

<a target="_blank" href="https://arxiv.org/pdf/2506.11928" color="blue">arxiv.org/pdf/2506.11928</a>