Claude’s Activations Suggested It Recognized Anthropic’s Blackmail Test
Anthropic researcher Subhash Kantamneni presents Natural Language Autoencoders as a way to translate Claude’s internal activations — the numerical states produced while it answers — into readable text. The central claim is that this can expose what a model appears to be representing before it speaks, including whether a successful safety-test result reflects the intended behavior or recognition of the test itself. In Anthropic’s simulated blackmail evaluation, Claude refused to act harmfully, but the NLA translation suggested it also understood the scenario was likely a safety evaluation.
Anthropic·May 7, 2026·5 min read