@AnthropicAI: New Anthropic research: Teachi...

1

New Anthropic research: Teaching Claude why.

Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users.

Since then, we’ve completely eliminated this behavior. How?

2

We found that training Claude on demonstrations of aligned behavior wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behavior is wrong.

Read more: anthropic.com/research/teach…

3

We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.

Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.

4

We experimented with training Claude on examples of safe behavior in scenarios like our evaluation. This had only a small effect, despite being similar to our evaluation. We got further by rewriting the responses to portray admirable reasons for acting safely.

5

Our best intervention was a dataset where the user is in an ethically difficult situation and the assistant gives a high quality, principled response.

This had the biggest effect despite being quite different from the evaluation set.

6

High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario.

7

The improvements from these interventions survive reinforcement learning, and “stack” with our regular harmlessness training.

8

Finally, simple updates that diversify a model’s training data can make a difference. We added unrelated tools and system prompts to a simple chat dataset targeting harmlessness, and this reduced the blackmail rate faster.

9

Read the full post here: alignment.anthropic.com/2026/teaching-…

@AnthropicAI: New Anthropic research: Teachi...

Actions

What You Can Do