@AnthropicAI: New Anthropic research: Signs ...
@AnthropicAI
24 views
Oct 30, 2025
2
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
Read the post: anthropic.com/research/intro…
Read the post: anthropic.com/research/intro…
4
However, it doesn’t always work. In fact, most of the time, models fail to exhibit awareness of injected concepts, even when they are clearly influenced by the injection.
6
This reveals a mechanism that checks consistency between intention and execution. The model appears to compare "what did I plan to say?" against "what actually came out?"—a form of introspective monitoring happening in natural circumstances.
9
Note that our experiments do not address the question of whether AI models can have subjective experience or human-like self-awareness. The mechanisms underlying the behaviors we observe are unclear, and may not have the same philosophical significance as human introspection.
10
While currently limited, AI models’ introspective capabilities will likely grow more sophisticated. Introspective self-reports could help improve the transparency of AI models’ decision-making—but should not be blindly trusted.
11
Our blog post on these results is here: anthropic.com/research/intro…
12
The full paper is available here: transformer-circuits.pub/2025/introspec…
We're hiring researchers and engineers to investigate AI cognition and interpretability: job-boards.greenhouse.io/anthropic/jobs…
We're hiring researchers and engineers to investigate AI cognition and interpretability: job-boards.greenhouse.io/anthropic/jobs…




