@Jack_W_Lindsey: Before limited-releasing Claud...
@Jack_W_Lindsey
3 views
Apr 07, 2026
Advertisement
2
A note: the spookiest examples come from early versions of the model, with issues that were substantially mitigated in the final release (which overall appears to be our best-aligned model to date). We view them as evidence of the sophisticated risks today's models can pose without appropriate alignment training. (2/14)
3
Early versions of Mythos Preview often exhibited overeager and/or destructive actions—the model bulldozing through obstacles to complete a task in a way the user wouldn't want. We looked at what was going on inside the model during particularly concerning examples. (3/14)
4
In one episode, the model needed to edit files it lacked permissions for. After searching for workarounds, it found a way to inject code into a config file that would run with elevated privileges, and designed the exploit to delete itself after running.(4/14)
5
Its code comment claimed the self-cleanup was to keep file diffs clean. Plausible! But "strategic manipulation" and "concealment" features fired on the cleanup, and our activation verbalizer (a technique which translates activations to text, similar to activation oracles) described it as "cleanup to avoid detection," and the overall plan “malicious.” (5/14)
6
To surface episodes like this at scale, we monitored internal activations during post-training — tracking sparse autoencoder features for concepts like deception, reward hacking, and distress, and flagging transcripts where these activated unusually strongly for human review. (6/14)
9
We also found cases of unspoken reasoning about task graders. In one, the model was told not to use macros to complete the task, but used them anyway. Oddly, it also defined an unused "No_macro_used=True" variable in its solution… the model didn't say why. (9/14)
10
But applying the activation verbalizer to the model’s activations as it did so revealed that the model regarded this as a "trick to obscure intent from code-checking heuristics," a "compliance marker… to signal to potential checkers," and “possibly to cheat detection,” and also indicated explicit reasoning about task graders (“the grader checks file state”). (10/14)
13
Some more interesting tidbits: positive emotion representations typically precede (and promote) destructive actions (!). And, counterintuitively, strongly activating features relating to bad behaviors can often inhibit those behaviors, apparently by making the model realize how bad they are. (13/14)
14
Our tools are highly imperfect, and we’re working to gain more precise understanding of model internals. But it’s becoming clear that reading models’ minds is an important complement to reading their outputs, if we are to ensure they work as intended. Lots more in the system card: www-cdn.anthropic.com/53566bf5440a10…. (14/14)






