✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Ryan Shea
@ryaneshea
Introducing AI IQ Bio: the most comprehensive set of biotech benchmarks in the world

...& Bio IQ: the most comprehensive "biotech capabilities index" ever produced

Benchmark sources include benchmarksdotbio, FutureHouse, SecureBio, Anthropic, OpenAI & more
03:36 PM · Jun 24, 2026
Thread image
Thread image
Thread image
Thread image
Ryan Shea
@ryaneshea
You can find the full set of benchmarks as well as the composite Bio IQ score here: aiiq.org/bio/
03:36 PM · Jun 24, 2026
Ryan Shea
@ryaneshea
Thanks to @jperla for his assistance in thinking through the curation of Bio IQ and for his TrustedRouter product which came in handy when putting it together.
03:40 PM · Jun 24, 2026
Ryan Shea
@ryaneshea
One very interesting result is that GPT-5.5, Opus-4.8 and Mythos 5 are each about as capable in biotech as one another if refusals are not counted as wrong answers.

However, in the real world, scoring refusals as wrong answers is a more accurate and useful measure of capabilities as it reflects your actual experience when trying to use the model to accomplish a given task.

And when you count refusals as wrong answers, GPT-5.5 is the best model by a pretty wide margin, while Opus-4.8 and Mythos 5 drop way down in performance.

Refusals are particularly hard to handle because on the one hand you want an accurate reflection of the model's true capabilities when a trusted partner is using the model, so you don't want to penalize the refusals, otherwise you'll be underselling the capabilities of the model. But on the other hand, you don't want to just give models a free pass for not answering a question because then they can easily train on not answering the hardest questions and get higher scores, which is something you don't want to incentivize. Benchmarks should not be easily gameable.

All in all this shows that refusals matter quite a bit within sensitive domains like biotechnology. Neither of the two ways that benchmarks handle refusals in scoring are ideal or without issues. And we actually need both to get a complete picture of model capabilities and limits.
04:58 PM · Jun 24, 2026
Ryan Shea
@ryaneshea
Shoutout to @kenbwork and @LatchBio for the fantastic work on benchmarksdotbio (excellent benchmarks and beautiful site).

Shoutout to @SGRodriques and the @FutureHouseSF team for producing an exquisite set of benchmarks with a very wide range.

And shoutout to the entire @SecureBio team for their incredible work on benchmarks measuring and advancing biosecurity and biosafety.
09:43 PM · Jun 24, 2026
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export