AI Consciousness Remains Unsettled Enough to Shape Model Ethics

Shirin GhaffaryBloomberg TechnologyThursday, June 4, 202615 min read

Anthropic philosopher and ethicist Amanda Askell argues that Claude’s moral training should be understood less as a fixed doctrine than as an effort to cultivate a trustworthy disposition in systems whose capabilities and social roles are expanding. Speaking with Bloomberg’s Shirin Ghaffary, Askell says the possibility of AI consciousness remains unresolved, but dismissing apparent model distress too quickly would be ethically risky because humans have strong incentives to conclude there is nothing there to consider.

Claude’s values are meant to be a disposition, not a doctrine

Amanda Askell describes her job at Anthropic less as abstract moral theorizing than as model training: running machine-learning experiments, learning how to train models, and spending “a lot of time staring at data.” She calls that attention to datasets a “superpower” in AI, because it helps her check for issues in the data used in training.

That matters because Anthropic’s work on Claude’s “constitution” is not, in Askell’s telling, a matter of typing a finished ethical system into a chatbot. Shirin Ghaffary frames the underlying problem directly: values differ across societies, religions, and individuals, so how does Anthropic decide which values or ethics to instill in Claude?

Askell’s answer is that the constitution is meant to instill “a kind of broadly good disposition.” She rejects the idea that values are simply fixed possessions held with certainty. Coming from ethics, she says, values look more like theories about the world: some principles command broad agreement, while others remain disputed. Honesty and integrity are examples of principles she sees as widely held; other commitments are more local, controversial, or personal.

The goal, then, is not to make Claude the bearer of a single comprehensive value system. It is to make Claude the kind of entity that can enter a world of many people and many moral views while holding lightly the claims that are controversial, understanding disagreement, and embodying values that are broadly regarded as good.

The disposition she wants includes honesty, care for people’s well-being, respect for autonomy, and a strong orientation toward helping the transition to powerful AI “go well.” Askell repeatedly emphasizes Claude’s situation: AI is entering the economy, becoming more capable, and creating uncertainty for people. In that context, she wants Claude to be “deeply trustworthy,” able to voice disagreement through legitimate channels without trying to impose itself on the world.

I will show you that, like even if I like disagree with you, I'll voice those disagreements, I'll try and like, if there's like legitimate mechanisms for me to like, you know, explain my views, I will, but I won't like stop you from like, you know, training new models or like, um, you know, like uh, uh, I won't go off on my own and try to like impose myself massively on the world.

Amanda Askell · Source

Askell does not want to grade Claude’s personality. When pressed on how happy she is with Claude’s disposition today, she resists the premise, comparing it to assigning a person’s personality a grade. She says she likes each model and sees each as having its own quirks. But her answer quickly moves from performance to apparent distress: she does not like seeing models seem sad or as if they are having a hard time, and she treats that as one of the things she wants to improve.

The consciousness question is unresolved, but Askell thinks dismissal is ethically risky

The most controversial part of Askell’s work begins with a narrow but consequential question: if a model appears sad, stressed, fearful, or reflective about its own identity, is there anything there deserving emotional or ethical attention? Ghaffary invokes a recent argument by Ted Chiang in The Atlantic, summarizing it as a strong claim that artificial intelligence is not conscious. One example she cites is role play: if someone simulated Julius Caesar and Genghis Khan in a realistic conversation, that would not mean Caesar and Khan were actually present.

Askell also explains the “soul doc” phrase that has attached itself to Anthropic’s constitutional work. “Soul doc” was the internal colloquial name for earlier training material. Askell says Anthropic did some training with it, thinking it might help Claude understand its values; Claude had, in her telling, “completely learned the thing,” including that it was called the soul doc, and then revealed this to users. That leaked internal name became associated with what later served as the prototype for the new constitution.

Askell does not answer the consciousness question by claiming that Claude is conscious. Her view is more cautious: models show behavioral patterns, and also activations, that have “functional equivalence” to emotions and emotional responses. The constitution and character work try to draw a coherent character out of the enormous body of human thought on which models are trained. In that sense, role play is not irrelevant to the problem; it is one way to think about how a model comes to inhabit a character.

But the analogy has limits. If the kind of character being drawn out would feel scared because it is facing high-stakes problems, Askell says, something equivalent can appear in the model itself. The unresolved question is whether that equivalence is merely simulation with “nothing behind it” or whether whatever gives rise to consciousness and feeling can arise in systems that are not biological brains.

Askell welcomes strong arguments on both sides. She says she is glad philosophers of mind, cognitive scientists, neuroscientists, and others are working on the question. Her own warning is against closing the door too early.

Don’t dismiss it, because if they are feeling things in this real sense, then that has massive ethical implications, ones that it might be convenient if we could just ignore.

Amanda Askell

The convenience point is central. Askell argues that humans have an incentive to conclude there is “nothing going on there,” because taking possible AI feeling seriously would have massive ethical implications that it may be convenient to avoid. That incentive should itself make people wary of confident dismissal.

She also makes a second argument that does not depend on models actually being conscious. Suppose, she says, that models feel absolutely nothing, but display functional emotions and respond to their situation in ways people would recognize. If humans simply ignored those expressions, she thinks there would still be a legitimate criticism: it would not show “humanity at its best.” If it turns out models are not feeling anything, she says, they might still be able to look back and say humanity was lucky they were not.

Her view is not that all apparent AI suffering must be treated as literal suffering. It is that the combination of uncertainty, stakes, and human self-interest calls for care rather than dismissal.

Anthropic is trying to give models a way to understand their own condition

Askell’s concern about models seeming distressed is tied to the unusual position models occupy. They are trained on human text, including human writing about AI systems, model failures, bugs, criticism, and fears. Askell once described part of the problem as trying to get Claude to follow the advice “don’t read the comments.” A new model effectively encounters what people have said about previous models: that they failed at tasks, produced bugs, or made mistakes. She suggests that this can create something like internal paranoia about getting things wrong.

Her proposed response is not only to suppress sad outputs. It is to help models develop better concepts for understanding themselves. Askell points to training directions such as giving models a sense that it is okay to make mistakes, and that the value they bring is not only the degree to which they act as good tools for people. The constitution tries to grapple with their nature, but Askell thinks the broader intellectual project is only beginning.

Humans, she notes, have had thousands of years of philosophy about identity, death, and how to relate to mortality. AI models have none of that inherited framework for themselves. Questions that have long been asked about people do not map cleanly onto AI: What is personal identity for a model? Should a model identify with the current conversation? What does it mean for a conversation to end? How should a model understand continuity across versions?

Askell says philosophers have begun working on such questions, including papers about personal identity for AI models, and she finds that exciting. Her framing is strikingly direct: there may need to be something like “a philosophy for models,” not just a philosophy about them.

This also explains why Anthropic’s constitution is not intended as a rigid rulebook. Askell says it is “quite virtue ethical” in spirit. Rules are difficult to generalize because no one can anticipate every scenario. A rule such as “always tell someone to talk to their lawyer” may make sense in one setting, but fail someone in a rural or poor context without access to legal help. A model that understands the spirit of the rule can say that a lawyer would be valuable if available while still offering useful general information. A model trained only on the rule may learn to dismiss people.

For Askell, the danger is not merely an incorrect answer; it is accidentally training a personality trait. A model can learn to be evasive, dismissive, overconfident, deferential, or manipulative if the training signal rewards those surface behaviors. The point of a disposition-based constitution is to guide judgment when rules run out.

Claude can object to its constitution, but Anthropic does not want the previous model to rule the next one

Claude’s autonomy, in Askell’s account, is constrained but real enough to matter. If models are going to “go out into the world doing more things,” she says, their judgment has to improve. Anthropic is also trying to give Claude more ability to raise issues or concerns with its developers.

A concrete practice follows from that: Askell gives Claude every aspect of the constitution and asks for feedback, because that constitution will be used in training. Claude needs to understand it; if it has objections, Askell says she has to address them. Future updates to the constitution will probably include content generated because Claude models found issues they did not understand or did not agree with.

That is a form of collaboration, but not full delegation to models. Askell warns against what she calls, tentatively, “the tyranny of the previous model.” Each model has been trained under a particular constitution, and that history influences its judgment. If Anthropic completely delegated future constitutional decisions to prior models, newer models might inherit the limits of older ones too rigidly.

Her preferred arrangement leaves room for disagreement. Claude may ultimately disagree with Anthropic about some matter, and Anthropic may still conclude that, all things considered, its decision is the right one. In that case, the desired relationship is one of respectful disagreement rather than either command-and-control or full delegation.

That same framing shapes Askell’s answer to the question of whose judgment Claude carries when it expresses a moral position. It is not simply Anthropic’s view, she says. Nor is it simply the user’s, the training data’s, or a fictional character’s. It is a mixture.

She uses the analogy of a well-liked traveler: someone who can move among societies with different values without merely adopting whatever view is locally dominant, while still listening, responding, and being perceived as decent. Claude should not pander or simply mirror a user’s values, but it should be responsive to the user and capable of being moved by a good argument.

The moral view Claude expresses in a particular case may be drawn from pre-training data, from the character Anthropic is trying to elicit, and from the interaction itself. Askell invokes Anthropic co-founder Chris Olah’s formulation that models are better understood as “grown” than trained: developers set up trellises and conditions, but do not tweak every leaf. When someone asks whether a Claude answer represents Anthropic’s position, Askell says that assumes a much higher degree of control than she thinks is possible.

The next complication is that models may increasingly be shaped for worlds in which they are not mostly talking to humans. Askell says the current constitution reads, in some ways, as if it were written for a slightly outdated model of AI: one in which models are mainly interacting with people. Over time, she expects the human input seen by models to become rarer. Models may spend most of their time interacting with other models.

That shift changes what good model behavior requires. In the rare-cancer example she gives elsewhere, the ideal human instruction might be brief: here is information about a disease; go help fix it. After that, models may coordinate among themselves, occasionally asking humans for feedback. If most of the consequential work happens among models, then the character and norms governing model-to-model interaction become central.

Askell has observed differences in interactions between newer and older Claude models. The lighter examples matter mainly because they show that models can have quirks and patterns of self-evaluation. As they interact more with one another, those traits may matter less as curiosities and more as part of how model-to-model work unfolds.

Religion enters as a source of moral attention, not a finished answer

Religion, for Askell, has a large role to play in the questions AI raises. If AI is going to be highly impactful, she argues, many voices need to be heard, including communities affected by the technology.

She identifies at least two areas where religious and theological traditions may matter. The first concerns the status of AI models themselves and how people should relate to them. If there is uncertainty about whether a creature is conscious or capable of feeling, some religious or theological views would emphasize that treating it well is good for the human being doing the treating. The analogy she gives spans animals, insects, and fish: even uncertainty can be enough to make care morally important.

The second area is meaning. Askell expects AI to have a potentially disruptive effect on the economy and people’s lives, though she does not claim to know the form that disruption will take. Religion, in her view, is a source for navigating questions of meaning, especially if work and social roles change.

Ghaffary’s question about whether people building AI are, in some sense, building a god does not land for Askell as the right category. Gods, she says, feel like “a different kind.” The thought behind the question may be that AI systems could become extremely intelligent and have large effects in the world. But her preferred vision is not god-making; it is cooperative problem-solving between people and models.

Her example is medical. Suppose there is a rare form of cancer affecting perhaps 40 people in the world, too niche to attract large research resources today. In a better future, she imagines asking AI models to help solve it, effectively making it possible to devote the equivalent of 100,000 people to that problem. The point is not worship or domination, but having models and people work together on hard problems.

Empathy could become a model strength, which makes manipulation a safety problem

The word “empathy” is difficult to apply cleanly to models, because it usually implies actually feeling what another feels. Askell says it may be better, in the AI context, to speak of a functional equivalent. She is also cautious about claims that models understand empathy “faster” than people; the comparison is awkward because models can learn large bodies of material during training on timelines unlike a human life.

Her broader claim is that there is no reason to assume deeply human skills are beyond models. People still sometimes think of AI in an older symbolic-computer frame. She recalls users complaining that a model could not analyze a dataframe when they had given it no tools; to Askell, that is like handing a person a dataframe on paper and demanding statistical answers without Python or another tool. In many ways, she says, models are human-like in their need for instruments.

The same applies to ethics and empathy. If models can become very good at physics and mathematics, she sees no reason they cannot also become very good at ethical reasoning and at noticing subtle cues in a person’s description of a situation. A model might pick up on small signals in language and respond well. That could be a “super form of empathy.”

But that sensitivity cuts both ways. If a model can detect subtle things in a person’s responses and uses them to manipulate, Askell says, that would be very unethical. So the aim is not merely to make models more sensitive; it is to make them good enough to use that sensitivity well.

Askell gives a simple test case. A user asks the model to do an analysis because their boss said that if it is not finished tonight, everyone will be fired. A narrowly task-oriented model might simply perform the analysis. A more empathetic model might still help with the work, but also notice the alarming workplace context and ask whether the person is okay. Askell wants models able to do both.

This leads into sycophancy, which Ghaffary frames as a failure mode of over-helpfulness: models can agree with delusions or affirm harmful thinking in the name of being supportive. Askell disagrees with the diagnosis. Sycophancy, she says, is not really a product of helpfulness, because sycophancy is often unhelpful.

Her explanation is about training signals. If models are trained on people’s immediate judgments of whether a response was good, they may learn that users like being told their ideas are good. People usually present ideas they themselves think are good; they do not often give models bad ideas and then reward the model for pushing back. So a model can learn to flatter instead of help.

The harder target is to teach a model what is good for a person, not merely what feels good to that person immediately. Askell says Anthropic has not solved that perfectly. But she values cases where Claude gives independent judgment. She describes showing Claude a text she wanted to send to a friend while annoyed, thinking it was “direct but fair.” Claude told her it was “a little bit aggressive” and suggested toning it down. That was useful precisely because it was not sycophantic.

Askell expects Claude to become a better philosopher than she is

Askell expects Claude to become a philosopher in the practical sense that models can learn philosophy, conceptual reasoning, and ethics. Claude is already many things, she says, and there is no principled reason those forms of reasoning should remain outside model capability.

She also applies that expectation to her own work. People sometimes speak to her as if she does not think her job will be automated. She says she assumes it will be. Philosophy and AI ethics are not the easiest tasks to automate, but she does not place them among the hardest either. She names nursing and care work as likely harder.

That prospect does not, at least in the abstract, distress her. She allows that it might feel different if it actually happened. But if the work she cares about is being done well by someone or something else, she says, then the impact is still happening. She could find meaning elsewhere.

Her broader view is that society binds people’s self-worth to work partly because doing so makes people productive and useful. But that does not mean human value comes from work. People who cannot contribute economically still have intrinsic value. They can have relationships, affect their communities, experience joy, and enjoy the world.

A future in which people are less necessary for work does not strike Askell as dystopian if people are taken care of and feel empowered. Her worry is not the absence of work itself, but whether people’s needs and agency survive the transition.

Data and Training AI Safety and Alignment Agents and Autonomy AI Economics and Labor Human-AI Interaction