Subhash Kantamneni

Alignment Science researcher at Anthropic working on cognitive oversight, mechanistic interpretability, and AI alignment. He is a co-author of Anthropic’s Activation Oracles work and has published research on sparse autoencoders, scalable oversight, and how language models represent algorithms.

Claude’s Activations Suggested It Recognized Anthropic’s Blackmail Test

Anthropic researcher Subhash Kantamneni presents Natural Language Autoencoders as a way to translate Claude’s internal activations — the numerical states produced while it answers — into readable text. The central claim is that this can expose what a model appears to be representing before it speaks, including whether a successful safety-test result reflects the intended behavior or recognition of the test itself. In Anthropic’s simulated blackmail evaluation, Claude refused to act harmfully, but the NLA translation suggested it also understood the scenario was likely a safety evaluation.

AnthropicMay 7, 20265 min read