Hi,👋 we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Anthropic
@AnthropicAI
New Anthropic research: Teaching Claude why.

Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users.

Since then, we’ve completely eliminated this behavior. How?
Anthropic
@AnthropicAI
We found that training Claude on demonstrations of aligned behavior wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behavior is wrong.

Read more: anthropic.com/research/teach…
Anthropic
@AnthropicAI
We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.

Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.
Anthropic
@AnthropicAI
We experimented with training Claude on examples of safe behavior in scenarios like our evaluation. This had only a small effect, despite being similar to our evaluation. We got further by rewriting the responses to portray admirable reasons for acting safely.
Anthropic
@AnthropicAI
Our best intervention was a dataset where the user is in an ethically difficult situation and the assistant gives a high quality, principled response.

This had the biggest effect despite being quite different from the evaluation set.
Thread image
Anthropic
@AnthropicAI
High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario.
Thread image
Anthropic
@AnthropicAI
The improvements from these interventions survive reinforcement learning, and “stack” with our regular harmlessness training.
Thread image
Anthropic
@AnthropicAI
Finally, simple updates that diversify a model’s training data can make a difference. We added unrelated tools and system prompts to a simple chat dataset targeting harmlessness, and this reduced the blackmail rate faster.
Thread image
Anthropic
@AnthropicAI
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export