✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Anthropic
@AnthropicAI
New Anthropic Fellows Research: a new method for surfacing behavioral differences between AI models.

We apply the “diff” principle from software development to compare open-weight AI models and identify features unique to each.

Read more: anthropic.com/research/diff-…
Anthropic
@AnthropicAI
If a new model shares a feature with a trusted model, that area probably doesn't need scrutiny.

Model diffing isolates the features unique to the new model—where new risks are most likely to be located.
Anthropic
@AnthropicAI
For example, when we compared Alibaba's Qwen to Meta's Llama, we found a "CCP alignment" feature unique to Qwen and an "American exceptionalism" feature unique to Llama.
Thread image
Anthropic
@AnthropicAI
This technique isn't perfect—it can be oversensitive, sometimes flagging analogous features as distinct. But by focusing only on differences, it allows us to audit AI models more efficiently.
Anthropic
@AnthropicAI
This research is a product of our Anthropic Fellows program, led by @tomjiralerspong and supervised by @TrentonBricken.

See the full paper here: arxiv.org/abs/2602.11729
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export