New Anthropic Fellows Research: a new method for surfacing behavioral differences between AI models.
We apply the “diff” principle from software development to compare open-weight AI models and identify features unique to each.
Read more: anthropic.com/research/diff-…
If a new model shares a feature with a trusted model, that area probably doesn't need scrutiny.
Model diffing isolates the features unique to the new model—where new risks are most likely to be located.
Model diffing isolates the features unique to the new model—where new risks are most likely to be located.
For example, when we compared Alibaba's Qwen to Meta's Llama, we found a "CCP alignment" feature unique to Qwen and an "American exceptionalism" feature unique to Llama.

This technique isn't perfect—it can be oversensitive, sometimes flagging analogous features as distinct. But by focusing only on differences, it allows us to audit AI models more efficiently.
This research is a product of our Anthropic Fellows program, led by @tomjiralerspong and supervised by @TrentonBricken.
See the full paper here: arxiv.org/abs/2602.11729
See the full paper here: arxiv.org/abs/2602.11729
Generated by Thread Navigator
Press ⌘ + S to quick-export
