Visualize Thread by @rohanpaul_ai

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Rohan Paul

@rohanpaul_ai

💊 New study finds that clinical LLMs can ace medical exams yet still perform weakly on realistic clinical tasks and safety.

models scored 84%-90% on knowledge exams but only 45%-69% on practice tasks and 40%-50% on safety assessments.

The authors analyze 39 benchmarks with about 2.3 million questions across 45 languages and 172 specialties, and see knowledge-style exams largely saturated, with top models near 84%-90% accuracy.

On practice-focused benchmarks such as DiagnosisArena, MedAgentBench, and HealthBench, success falls to about 45%-69%, showing that models often fail when asked to pick diagnoses, management plans, or recommendations in full cases.

Looking at task types, factual lookup stays near 85%-93%, but clinical reasoning drops to 50%-60%, diagnostic accuracy to 45%-55%, and safety checks reach only 40%-50%.

The authors argue that exam-style benchmarks are misleading proxies for clinical readiness and that deployment must rely on practice-based evaluation with strict human-in-the-loop oversight instead of autonomous use.

---

pubmed.ncbi. nlm.nih .gov/41325597/

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export