Visualize Thread by @AnthropicAI

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Anthropic

@AnthropicAI

New Anthropic research: Teaching Claude why.

Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users.

Since then, we’ve completely eliminated this behavior. How?

Anthropic

@AnthropicAI

We found that training Claude on demonstrations of aligned behavior wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behavior is wrong.

Read more: anthropic.com/research/teach…

Anthropic

@AnthropicAI

We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.

Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.

Anthropic

@AnthropicAI

We experimented with training Claude on examples of safe behavior in scenarios like our evaluation. This had only a small effect, despite being similar to our evaluation. We got further by rewriting the responses to portray admirable reasons for acting safely.

Anthropic

@AnthropicAI

Our best intervention was a dataset where the user is in an ethically difficult situation and the assistant gives a high quality, principled response.

This had the biggest effect despite being quite different from the evaluation set.

Anthropic

@AnthropicAI

High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario.

Anthropic

@AnthropicAI

The improvements from these interventions survive reinforcement learning, and “stack” with our regular harmlessness training.

Anthropic

@AnthropicAI

Finally, simple updates that diversify a model’s training data can make a difference. We added unrelated tools and system prompts to a simple chat dataset targeting harmlessness, and this reduced the blackmail rate faster.

Anthropic

@AnthropicAI

Read the full post here: alignment.anthropic.com/2026/teaching-…

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export