Visualize Thread by @ryaneshea

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Ryan Shea

@ryaneshea

Introducing AI IQ Bio: the most comprehensive set of biotech benchmarks in the world

...& Bio IQ: the most comprehensive "biotech capabilities index" ever produced

Benchmark sources include benchmarksdotbio, FutureHouse, SecureBio, Anthropic, OpenAI & more

03:36 PM · Jun 24, 2026

Ryan Shea

@ryaneshea

You can find the full set of benchmarks as well as the composite Bio IQ score here: aiiq.org/bio/

03:36 PM · Jun 24, 2026

Ryan Shea

@ryaneshea

Thanks to @jperla for his assistance in thinking through the curation of Bio IQ and for his TrustedRouter product which came in handy when putting it together.

03:40 PM · Jun 24, 2026

Ryan Shea

@ryaneshea

One very interesting result is that GPT-5.5, Opus-4.8 and Mythos 5 are each about as capable in biotech as one another if refusals are not counted as wrong answers.

However, in the real world, scoring refusals as wrong answers is a more accurate and useful measure of capabilities as it reflects your actual experience when trying to use the model to accomplish a given task.

And when you count refusals as wrong answers, GPT-5.5 is the best model by a pretty wide margin, while Opus-4.8 and Mythos 5 drop way down in performance.

Refusals are particularly hard to handle because on the one hand you want an accurate reflection of the model's true capabilities when a trusted partner is using the model, so you don't want to penalize the refusals, otherwise you'll be underselling the capabilities of the model. But on the other hand, you don't want to just give models a free pass for not answering a question because then they can easily train on not answering the hardest questions and get higher scores, which is something you don't want to incentivize. Benchmarks should not be easily gameable.

All in all this shows that refusals matter quite a bit within sensitive domains like biotechnology. Neither of the two ways that benchmarks handle refusals in scoring are ideal or without issues. And we actually need both to get a complete picture of model capabilities and limits.

View Tweet

04:58 PM · Jun 24, 2026

Ryan Shea

@ryaneshea

Shoutout to @kenbwork and @LatchBio for the fantastic work on benchmarksdotbio (excellent benchmarks and beautiful site).

Shoutout to @SGRodriques and the @FutureHouseSF team for producing an exquisite set of benchmarks with a very wide range.

And shoutout to the entire @SecureBio team for their incredible work on benchmarks measuring and advancing biosecurity and biosafety.

09:43 PM · Jun 24, 2026

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export