@AnthropicAI: New Anthropic Fellows Research...

1

New Anthropic Fellows Research: a new method for surfacing behavioral differences between AI models.

We apply the “diff” principle from software development to compare open-weight AI models and identify features unique to each.

Read more: anthropic.com/research/diff-…

2

If a new model shares a feature with a trusted model, that area probably doesn't need scrutiny.

Model diffing isolates the features unique to the new model—where new risks are most likely to be located.

3

For example, when we compared Alibaba's Qwen to Meta's Llama, we found a "CCP alignment" feature unique to Qwen and an "American exceptionalism" feature unique to Llama.

4

This technique isn't perfect—it can be oversensitive, sometimes flagging analogous features as distinct. But by focusing only on differences, it allows us to audit AI models more efficiently.

5

This research is a product of our Anthropic Fellows program, led by @tomjiralerspong and supervised by @TrentonBricken.

See the full paper here: arxiv.org/abs/2602.11729

@AnthropicAI: New Anthropic Fellows Research...

Actions

What You Can Do