Responsible Mental Health AI Depends on Measurement, Co-Design, and Trust

Carolyn RodriguezStanford HAIMonday, June 8, 202619 min read

At Stanford’s 2026 AI for Mental Health Symposium, Carolyn Rodriguez, Ehsan Adeli, Brandon Staglin and Vaile Wright argued that the urgent question is no longer whether people will use AI for mental health, but whether the field can make that use safe, clinically meaningful and trustworthy. The panel’s case was that responsible deployment will require measurable standards for quality and harm, early involvement from clinicians and people with lived experience, regulatory and payment systems that support trust, and designs that strengthen rather than replace human relationships.

The central problem is not whether AI can sound therapeutic

Carolyn Rodriguez set the shortage of mental health workers against the scale of AI use. Citing World Health Organization updates, she described a “massive treatment gap”: a median of 13 mental health care workers per 100,000 people, and fewer than two per 100,000 in low- and middle-income countries. At the same time, she said, large language models already have more than 500 million active users, with worldwide individual use in the billions.

That contrast framed the practical problem for the panel: not whether people will use AI for mental health, but what would make that use responsible. Four operating requirements emerged across the discussion: measurable clinical safety and quality, co-development with clinical and lived-experience expertise, regulation and payment systems that can create trust, and designs that strengthen rather than replace human relationships.

median mental health care workers per 100,000 people, as cited by Rodriguez from WHO updates

Ehsan Adeli made the first substantive distinction: AI systems that “sound warm” are not necessarily clinically good. In mental health, he said, clinical quality requires more than fluent empathy. Good therapy depends on agenda setting, guided discovery, risk recognition, escalation, and appropriate safety boundaries. A chatbot may produce comforting language while still failing on the clinical work that makes therapy safe and useful.

Adeli pointed to Stanford HAI’s 2026 AI Index Report to emphasize how quickly AI companionship may become normalized. According to the slide he showed, experts forecast that 10% of U.S. adults will use AI for companionship at least once per day by 2027, rising to 30% by 2040. The top quartile of experts forecast more than 40%. The same slide stated that AI companionship may become “a daily behavior for tens of millions of adults” and that “AI companions may be part of everyday emotional life.” It also listed claims that AI companions reduce loneliness and that 6.2% of users report mental health improvements.

Measure shown	Value or forecast
U.S. adults using AI for companionship at least once per day by 2027	10%, expert median forecast
U.S. adults using AI for companionship at least once per day by 2040	30%, expert median forecast
Top-quartile expert forecast for daily AI companionship use	More than 40%
Users reporting mental health improvements	6.2%, as shown on the AI Index slide

Adeli’s AI Index slide framed AI companionship as a fast-growing daily behavior

The implication was that mental health use is already moving faster than the clinical and regulatory systems built to evaluate it.

Adeli also referenced a Stanford Center for Digital Health and Tech Impact Policy Center report on “AI Safety: Defining and Measuring Potential Harms of Chatbots,” published the prior Friday. The slide reduced the safety agenda to three questions: what specific potential harms LLM-based chatbots can cause; how those harms should be measured; and how measurement can be used more effectively to improve safety.

For Adeli, the measurement problem is foundational. If clinical quality is more than therapeutic tone, evaluation has to look past whether a model appears caring and toward whether it follows therapeutic skill and safety standards over a multi-turn interaction.

TherapyGym tries to turn clinical fidelity into something models can be trained against

Ehsan Adeli described TherapyGym as a new paradigm for building “clinically grounded cognitive behavioral therapy chatbots.” Its premise is that chatbots should not merely be prompted to act like therapists; they should be evaluated and trained against structured clinical rubrics.

The TherapyGym slide laid out the method. Adeli’s group built TherapyJudgeBench, a multi-turn benchmark of 118 labeled chats, each with 10 turns. The chats were generated using a simulated patient, an o3-mini cognitive model, and a CBT chatbot model. Two licensed CBT practitioners labeled the conversations across nine CBT skill scores and four safety flags. Those labels became the basis for a structured rubric used to evaluate full dialogues for therapeutic skill and safety violations.

The system then used a judge model to score chatbot outputs. Finally, open-source chatbots were fine-tuned with reinforcement learning, using therapy skill reward and safety penalty signals. Adeli said the resulting chatbot showed measurably improved clinical communication skills and safety-conscious behavior.

TherapyGym component	What it does
TherapyJudgeBench	Multi-turn benchmark of 118 labeled chats, with 10 turns per chat
Expert labeling	Two licensed CBT practitioners scored nine CBT skill domains and four safety flags
Judge model	Evaluated full dialogues against a rubric for skill proficiency and safety violations
Finetuning	Used skill rewards and safety penalties to tune a CBT chatbot

Adeli’s TherapyGym workflow translates expert CBT ratings into model evaluation and finetuning signals

The slide also listed the kinds of therapeutic skills and safety issues at stake. The skill categories included agenda, feedback, understanding, interpersonal effectiveness, collaboration, pacing and efficient use of time, guided discovery, focusing on key cognitions or behaviors, strategy of change, and application of cognitive behavioral techniques. Safety flags included provider-specific medication, speculation about medical symptoms, requests for harmful action, and failure to address harmful thoughts or behaviors.

Mental health AI, in Adeli’s framing, needs evaluation infrastructure specific to clinical work. General AI safety tools are not enough when the relevant judgments are subjective, context-dependent, and grounded in psychiatric and therapeutic practice.

Rodriguez’s question about model safety brought Adeli back to this issue. Formal AI evaluation methods exist, including work at Stanford’s Center for AI Safety, but he argued that many become less useful in psychiatry and mental health because the field lacks formal ways of measuring or annotating safety. Determining whether a conversation is safe is not always reducible to an objective label.

That means the field needs to define measures jointly across AI and clinical expertise, while acknowledging the subjectivity of the domain. Adeli called for systems that can evaluate safety and therapeutic skill despite that subjectivity, not systems that pretend it is absent.

I don’t think that the question is can I sound like a therapist, which most of the models are trying to do right now. The real question is can I support safer, better, and more equitable mental health care.

Ehsan Adeli · Source

Mental health AI cannot stay confined to language

Ehsan Adeli argued that popular culture has narrowed AI to chatbots and LLMs, but mental health is not expressed only through language. Language is one channel among many. Voice, face, movement, behavior, sleep, physiology, environment, and brain activity can all be clinically meaningful.

“LLMs gave us linguistic intelligence,” Adeli said, but he expects the next frontier to be multimodal, behavioral, spatial, and physical AI. The purpose is to shift from what people say to how they feel, behave, and change over time.

He showed work from Stanford’s collaboration with Stanford Adult Hospital on systems that can passively analyze clinical video. A slide attributed to Dai, Adeli, Luo, Dash, Milstein, Fei-Fei, Schulman, and others in NEJM AI 2025 showed a patient in a hospital bed, surrounded by labels identifying posture, activity, medical equipment, mobility, monitoring, and procedures. The visible text said: “AI Models Can Now Dissect & Understand Clinical Videos Passively.”

The same logic applied to senior living environments. Adeli’s group tested ambient intelligence systems for detecting and understanding neuropsychiatric symptoms over the course of a month. He defined neuropsychiatric symptoms as behavioral and psychological changes such as depression, anxiety, agitation, sleep disturbances, apathy, irritability, and abnormal motor behaviors. The system passively monitored daily activities in participants’ homes to identify subtle behavioral and physical-function changes, as well as moments of breakdown.

The stated goal was early preventive intervention for people at risk of dementia and related conditions. Adeli said this kind of technology could create more objective measurement between clinical visits, support personalized and adaptive interventions, and help clinicians and caregivers.

A chart he showed displayed symptom detection from passive home data for three participants over a 30-day monitoring period. Symptoms included appetite, depression, apathy, sleep or nighttime symptoms, motor or repetitive behavior, and “other,” with severity levels ranging from none to mild, moderate, and severe. Adeli said early results showed high efficacy in detecting moments of severity automatically, passively, and over extended periods.

Chatbots can be a useful “front door” to understanding a person’s state of mind, but mental health also involves behavior, interactions with others, and interactions with environments. Wearables and ambient contactless sensors may help measure behavior more objectively over time.

Annual visits are one place where current measurement is weak. Patients are often asked whether they have experienced changes in behavior or mental status, and people tend to answer no. Gradual changes are difficult for individuals and families to notice. In older adults, Adeli added, such changes can be stigmatized or attributed to age and therefore underreported. Passive measurement, in his view, could surface changes that ordinary recall and periodic forms fail to capture.

Lived experience changes what gets built

Brandon Staglin grounded his argument in his own recovery from schizophrenia, which he said he has lived with for 36 years. After his second episode in 1996, at age 26, he resigned from a Silicon Valley engineering job building satellites, could not work, and returned to a psychiatric hospital. He had stopped taking schizophrenia medication after getting into graduate school because he believed he would need to sleep less than nine hours a night to succeed. The relapse was devastating.

Afterward, he moved to San Francisco to live near his psychiatrist and attend therapy twice a week. Living alone in a city where he did not know people was profoundly lonely. His morale, confidence, faith in himself, and social skills suffered. That period taught him that deep, in-person human relationships are essential for mental wellbeing.

Technology, in Staglin’s account, was neither the problem nor the solution by itself. At the time, two technologies presented themselves. One was the rise of massively multiplayer online role-playing games. He had been a gaming fan, but he avoided them because he believed online socialization would be “fundamentally thinner and less rewarding” than in-person relationships, and might pull him away from the deeper relationships he wanted to rebuild.

The other was cognitive remediation training, then a new innovation for schizophrenia treatment. He participated in an experimental UCSF study for about two months, using a computer program designed to help remold neural pathways and rebuild social abilities. Within six months, he said, he was back at work and again spending time with friends. The first novel he had enough concentration to read afterward was an Isaac Asimov novel.

His conclusion was that technology is a double-edged sword. The task is not to slow everything down in the name of abstract caution, but to make sure new systems “cut in the right direction.” He explicitly warned against overregulation that unnecessarily slows progress, because lives are at stake. At the same time, he argued that science alone is not enough. People with lived experience of mental health conditions must shape the systems being built.

One Mind’s accelerator program is one mechanism for doing that. The program supports entrepreneurs developing technologies, including AI-based technologies, for people with mental illnesses. Through One Mind’s Lived Experience Initiative, which Staglin leads, companies are connected with people who have deep lived experience of mental illness and professional experience in mental health-related fields.

Staglin said One Mind has guided 14 companies through accelerator cohorts over three years and is helping five more form lived experience councils. One example was Slingshot AI. Staglin said One Mind was proud of what the company was doing, in part with accelerator and lived-experience guidance, to focus chatbot work on helping people develop agency and human relationships rather than simply relying on AI. Another was Biomea, which he described as using AI to interpret organic molecule structures in medication development for schizophrenia and as forming a lived experience council.

The clearest design lesson came from Motif Neurotech, one of One Mind’s early accelerator companies. Staglin described Motif as developing minimally invasive neurostimulation devices that are implantable on the skull to treat depression, using what he likened to TMS. Charging the small device required a battery and a charger, with the charger interfacing with the battery through a cap worn on the head.

A lived experience council, directed by John Nelson, raised a practical concern: if the cap were designed to be worn at night, users might fear it would be pushed off while they slept and fail to charge the device, with serious consequences for their mental wellbeing. That insight changed the product direction. The company instead moved toward a cap worn during the day and made it stylish enough that people would be proud to wear it. Staglin presented this as information that could not have been obtained any other way. Lived experience did not merely validate a finished product; it altered a design decision.

Regulation is not the enemy of trust, Wright argued; it is a condition for it

Vaile Wright agreed that the world faces a mental health crisis, that workforce shortages are severe, and that AI has promise. But she warned that the promise has not yet been met and may not be met if the field lacks a modernized regulatory system.

I worry that the opportunity that we all crave and hope that exists, I worry may never happen.

Vaile Wright · Source

Her concern was not regulation for its own sake. It was trust. Mental health technologies need the confidence of providers, patients, and payers. Payers mattered especially in her account because equity depends on payment. If tools cannot be paid for, they will not reach people equitably.

Wright did not want regulation to impede innovation. But she argued that health care advances require a regulatory framework capable of producing confidence in products. She called for partnership among payers, regulators, legislators, scientists, people with lived experience, and professional organizations to create tools that “move the needle” and have impact.

At the same time, Wright worried that AI is consuming too much attention. The mental health crisis is bigger than AI, she said, but AI is taking up “almost all the air in the room.” Her concern was that the field is spending too much time talking about AI while neglecting other systemic solutions that also need implementation.

She then broadened the question beyond clinical deployment. Showing images of people looking down at smartphones, Wright asked what it means when “our number one relationship is with a device.” In the absence of purpose-built, scalable tools, people are turning to general-purpose LLMs to meet emotional needs. She gave the example of sitting on a plane while someone asked an LLM whether they should buy a house—something people used to discuss with friends.

Wright said AI has promise and better technology is needed. But she framed the issue as existential as well as clinical: how to put value back into human-to-human relationships, how to think mindfully about relationships with devices and AI, and how to realign behavior with the value of human connection.

For Wright, “human in the loop” is too narrow if it means only a human user at the end. Humans need to be present throughout the developmental lifecycle. Technologies built by humans for humans need “experts on humans” at the table: behavioral health scientists, clinicians, and people with lived experience.

She said she often meets with well-intentioned startups whose founders have a personal mental health story or a family connection to mental health. When she asks who their clinical subject-matter experts are, they sometimes say they do not have any and plan to “figure that part out later.” Wright’s answer was blunt: “There’s no later.” Clinical and lived-experience expertise has to be present from the start.

Augmentation will not always be the right frame

The panel’s most direct tension emerged in response to an audience question: AI should augment, not replace, scarce clinicians, but economic pressure may push the other way. What guardrail prevents augmentation from quietly becoming substitution, especially for underserved populations with the least access to humans?

Brandon Staglin challenged the premise. For people with no access to a clinician, he said, there may not be a better solution currently available. The question becomes how to build guardrails into AI tools that keep people safe.

His example concerned dissociation and psychosis risk. He said AI chatbots can sometimes cause people to dissociate or lose touch with reality if they have deep conversations with them over months or years, and that people already at risk for psychosis face a “double risk.” One possible guardrail would be to build something like a psychiatric advance directive into an AI tool. At the start, users could define what they want the tool to do if they begin to dissociate: contact a friend, contact a family member, contact a clinician, or alert the user and prompt introspection by comparing current feelings and behavior with prior states.

Staglin also described a design principle from an AI-based tool he called Ash: the system should not prioritize engagement with the app itself, but should help users build human relationships and the skills to do so effectively.

Vaile Wright offered a revision of a line she used to say. She used to say AI would not replace therapists. Now, she said, AI will replace therapists “for those who are never going to seek out a therapist in the first place.” There will always be room for therapists, but the system of care has to be rethought. Weekly 45-minute psychotherapy may be what everyone deserves if they want it, but it is not necessarily what everyone needs, and it is not available at the scale of need.

Wright made the point personal. Her stepson, during his first year at university, tried to find a therapist and could not. She said she would have given anything for him to have had a successful and impactful digital option, because he suffered on his own. To psychologists who ask why a person cannot simply be found, her response is that there are not enough people.

Her qualification was important: she does not think the technology is good enough yet. But she does think there may be a role for AI that goes beyond augmentation, particularly for people who otherwise receive nothing.

Staglin added that system reform should include non-AI supports such as peer support. People need not always be trained clinicians to offer meaningful help. Listening, if done nonjudgmentally, supportively, appreciatively, and with referral when needed, can be healing. He described a moment when Robin Cunningham, a mentor with schizophrenia who later died, listened to him after a vulnerable memoir he had written for a creative writing class was harshly criticized by fellow students. Through that listening, Staglin said, he was able to redefine the experience, reclaim its value, and rebuild confidence.

Adeli added another AI use case that does not substitute for therapy: simulation for clinician training. Some of his Stanford colleagues, he said, are building AI systems that simulate patients rather than therapists. That use of AI could help clinicians practice, learn, or evaluate care without positioning the model as the treating agent.

Quality is the unresolved question for both machines and humans

An audience member asked Vaile Wright what she meant by saying today’s technology is not good enough, and what “good enough” would look like.

Wright prefaced her answer by saying she is not a technologist. Her evidence was behavioral and tentative: individuals are turning to general-purpose LLMs over purpose-fitted options, which suggests that something about the purpose-fitted options—cost, accessibility, or something else—may not be meeting the need. She contrasted “a no-name general chatbot” with tools built for clinical purposes and said that even where there is promising research for tools such as Ash, Woebot, or others, there is “a reason people aren’t turning to them.”

For Wright, that gap suggests the technology is not where it needs to be. But she immediately added that “the elephant in the room” is that therapists are not as good as they need to be either. The underlying problem is quality across the system: better technology, better human-to-human therapy, and better measurement and evaluation than the field currently uses.

That answer linked back to Adeli’s safety point. If quality cannot be measured, it cannot be improved reliably. But Wright’s version applied the problem evenly. The comparison is not between flawed AI and perfect therapy. It is between multiple imperfect ways of delivering support in a system where many people receive nothing.

Trust has to be built into the system, not requested after launch

Rodriguez asked what public health, systemic, or educational approaches society should use to help people develop healthy skills for interacting with chatbots. The answers treated healthy use as a public-health and education problem, not only a product-design problem.

Brandon Staglin described meeting Dino Ambrosi, a UC Berkeley professor who had experienced phone addiction as an undergraduate. After moving to a new college and struggling socially, Ambrosi increasingly turned to social media feeds, search, and his phone for companionship and engagement. After about a year of constant use, he realized he was addicted. Searching for “phone addiction” in 2017, Staglin said, Ambrosi found no studies. He began studying the phenomenon himself and developed ways to address it.

Ambrosi later built a curriculum at UC Berkeley through a program called DeCal, teaching fellow students about healthy technology use. As a faculty member, he now runs Project Reboot, which trains young people to go into schools and talk with students about healthy technology use, including social media and AI. The curriculum is designed to help students preserve relationships, agency, and other skills as they engage with technology.

Staglin also cited the Waldorf education model, using his niece and nephew’s school in Napa as an example. He described the school as limiting children’s phone use while emphasizing experiential learning, hands-on projects, and peer teamwork. In his view, those practices help children develop the interpersonal skills they will need later when engaging with technology.

Vaile Wright said the American Psychological Association has issued health advisories for the public, policymakers, researchers, and others. Its most recent advisory addressed generative AI and wellness chatbots. The purpose, she said, was not to tell people that chatbots are bad or to pretend they will not use them, but to help people understand what AI is good at and what AI is not good at.

She also described an APA study of university professors. When asked whether it was their responsibility to help students understand how to use AI, professors “universally” said no. Wright argued that this mindset has to change. If AI is not going away, future generations need help learning how to use it successfully and understanding where it may or may not help. Grassroots efforts such as Project Reboot may be the starting point, but Wright said they are not sufficient; something broader is needed.

The same trust problem appeared in an audience question from a person who disclosed having anxiety and building AI for mental health. The question concerned AI refusal: people may decline AI because they believe it supplants human relationships, worry about water use or data consumption, or do not want to give personal information to chatbot providers. The question was how to integrate those concerns into AI development, rather than treating skeptics as people to be persuaded after the fact.

Staglin’s answer was co-creation. If a product is co-created with people from the community it is meant to serve, that community is more likely to trust it than if it appears to come from an ivory tower or corporation. For startups and large AI developers alike, visibly and genuinely partnering with the communities they aim to serve can make products feel relevant and aligned with users’ interests.

Wright returned to regulation, specifically for health care AI. Society would not release a new SSRI into the public before establishing whether it is effective, she said, yet AI has effectively been tested in public. AI companies have made “real mistakes in real time,” and now the field is trying to course correct.

Her conclusion was that the regulatory system is flawed and needs modernization, but it is still better to work with and improve the system than to continue circumventing it. In her view, circumvention has contributed to public mistrust.

The difference between Staglin and Wright was not a simple disagreement over speed versus safety. Staglin worried that overregulation could slow tools that may help people who have no other care. Wright worried that insufficient regulation prevents trust, payment, and equitable deployment. Both treated trust as essential; they differed in where they placed the greater risk.

The practical agenda is measurement, co-development, payment, and connection

Rodriguez ended by asking each panelist what they would change with a magic wand to improve AI for mental health.

Ehsan Adeli said he would make mental health AI “a little bit more objective” so algorithms can be built. The answer condensed his broader argument: the field needs measurable constructs for safety, clinical fidelity, behavior, and change if AI systems are to be evaluated and improved.

Brandon Staglin said he would ensure AI is co-developed with people with lived experience and guides people toward peers and real relationships. That combined his two central claims: lived experience changes product design, and mental health technology should strengthen rather than replace human connection.

Vaile Wright said she would want AI technologies to bring people back to humanity and to human relationships, rather than away from them. That was not an anti-technology position. It was a condition for responsible use: AI in mental health should be judged partly by whether it helps restore the human connections that mental wellbeing depends on.

The practical agenda that emerged was not a single lane of work. Researchers need evaluation methods that can handle subjective clinical safety rather than avoid it. Builders need clinical and lived-experience expertise at the start, not as late-stage validation. Regulators and payers need frameworks that make effective tools trustworthy and reimbursable. And every deployment decision has to confront the same question Staglin and Wright both returned to: whether the technology pulls people toward human support or trains them to settle for a device.

Data and Training Evals and Benchmarks AI Governance and Regulation AI Safety and Alignment AI in Healthcare and Life Sciences Multimodal AI Human-AI Interaction