AI Defaults Can Become Clinical Decisions in Digital Health

Peter WashingtonStanford OnlineWednesday, May 20, 202616 min read

UCSF clinical informatics professor Peter Washington argues in a Stanford HCI seminar that AI-enabled digital health systems fail or succeed on decisions that often look like engineering defaults: metrics, thresholds, prompts, labels and workflow placement. Using examples from wearables, substance-use interventions, sepsis alerts, Apple Watch hypertension detection and Parkinson’s assessment, he makes the case that human-centered design is not a layer added after modeling, but part of how the model is trained, evaluated and made usable.

The hard part is not only predicting the event; it is making the prediction usable

Consumer health data streams from wearables and mobile devices can feed machine-learning models that predict intervenable health events. Peter Washington distinguishes that use case from the more familiar healthcare-AI pattern of diagnosis or screening. His focus is repeated health events within the same person, where prediction can trigger a timely digital intervention.

The example he returns to is substance use. A behavior scientist may already have an evidence-based intervention for young adults who use e-cigarettes for nicotine, cannabis, or both. The missing technical question is not whether the intervention can work in principle, but whether AI can deliver it at a moment when the person is able and willing to use it. As Washington puts it, an efficacious intervention delivered when “no one wants to use it or can use it” has little practical efficacy because adherence collapses.

That changes what the model is for. Wearable biosignals such as heart rate, heart-rate variability, electrodermal activity, skin temperature, SpO2, and accelerometry are treated as continuous inputs. The labels are events such as vaping cravings, meth use, stress, or blood-pressure spikes. In one hypertension-related project, Washington says his group is predicting not the exact blood-pressure number but whether a spike occurs. From a clinical-action perspective, he argues, a reading of 155/102 and 148/92 may lead to broadly similar action; the exact value is less important than detecting an event that can be acted on.

The core modeling premise is personalization. Instead of training one general model and applying it to every patient, Washington describes training a separate model per person. The training data are earlier data from that person; the test set, or real-world deployment data, comes later from the same person. This is unusual relative to many healthcare datasets, and it immediately creates a label problem: each person has far fewer labeled events than a general pooled dataset would provide.

His proposed answer is personalized self-supervised learning. He uses language models as the analogy: GPT-style foundation models learn by predicting the next word from prior words, making labels effectively free because the next word is already present in the text. His lab applies the same idea to biosignals. Rather than predicting the next word, the model predicts a missing portion of a wearable signal from surrounding portions or from other modalities.

A person wearing a device can generate a large unlabeled dataset with no explicit burden beyond wearing it. Those unlabeled signals can train a personalized foundation model of that person’s physiology: how their heart rate changes when they move, how skin temperature relates to accelerometry and heart-rate variability, and how those relationships differ from someone else’s. That model can then be fine-tuned with relatively few labels to predict harder outcomes such as stress, blood-pressure spikes, or craving.

He says this approach performed better than generalized models in public stress datasets and converged with relatively few labeled examples. But the talk’s real emphasis is not the success case. It is what happens when the human data-generation process breaks down.

In a dataset of nurses wearing sensors and annotating stress during the beginning of the COVID-19 pandemic, Washington says the model encountered low adherence, inconsistent labeling across nurses, inconsistent labeling within the same nurse across days or even within a day, and noisy data. Personalization can help with between-person differences. It cannot rescue a task if the same person’s labels are unstable in ways the model cannot interpret.

Just AI innovation by itself is not going to solve these fundamental issues with human behavior when we're trying to build models that serve people.

Peter Washington

That failure point is where he locates the HCI agenda. The model’s inputs are not just biosignals; they are also the results of human availability, attention, self-report, burden, context, and changing interpretation. In digital therapeutics, those factors are not peripheral implementation details. They shape the training data, the deployed predictions, and whether the intervention is used at all.

Reducing burden cannot mean ignoring context

Personalization needs calibration data, but asking users for too many annotations or assessments can drive them away. That is the first human-centered challenge Peter Washington names: reducing the burden of personalizing AI models while preserving both model performance and user engagement.

One technical strategy is active learning. In a binary classifier, the model does not simply output 0 or 1; it outputs a probability. If the model predicts 98% stress or 5% stress, it is highly confident. If it predicts 56%, the result is near the decision boundary and therefore a useful moment to ask the person for an EMA label. The algorithm would query the user only when the answer is likely to improve calibration.

That is not enough. The model’s uncertainty is only one layer. The person may be driving, picking up a child from school, working, with family, or otherwise unavailable. A good system has to combine model confidence with patient preferences, schedules, and passively sensed context such as location or inferred driving. The design problem is to obtain information when it is useful without interrupting when the person cannot respond.

Washington presented this as a layered decision process: first AI confidence, then patient rules about being busy or available, with the resulting system sometimes suppressing an EMA even when model confidence is low. The right time for the model to ask is not necessarily the right time for the person to answer.

He describes a causal picture, developed with Santosh Kumar at the MD2K Center, in which the timing and content of an intervention affect user burden; burden affects receptivity; receptivity affects adherence and engagement. Washington distinguishes those last two terms. Engagement means opening or using the app. Adherence means actually following the behavior-change recommendation.

This distinction matters because many digital systems can log engagement but struggle to measure adherence. In the question period, Washington says engagement is “kind of easy” to measure through app logs: did the person open the app, did they do things in it? Adherence is harder. In current work, one simple approach is asking participants after the fact whether they did what the intervention requested and how burdensome it was. He also says his group has been exploring automatic adherence measurement, without giving details because the work is not yet public.

The burden problem also has a cultural and situational dimension that cannot be inferred from uncertainty scores alone. Asked whether community engagement had produced insights that clinicians would not have surfaced, Washington cites a substance-use study in Hawaii using Fitbits and real-time prompts. The study emphasized Native Hawaiian, Pacific Islander, and Filipino populations. Some participants objected to being asked to disclose substance use around 5:30 p.m., because that overlapped with pau hana, after-work family time. They did not want to discuss substance use when family members were nearby.

That example sharpens the HCI issue. A prompt can be statistically informative and still be socially wrong. A system that asks at the “best” machine-learning moment may generate worse data, lower trust, and lower adherence if it ignores the person’s lived context.

Default metrics can optimize the wrong experience

A model’s reported performance depends on which question the metric asks. Peter Washington spends substantial time on a modeling choice that looks technical but becomes experiential: which evaluation metric the developer optimizes and reports. His target is the common habit of using whatever scikit-learn makes easy — precision, recall, F1-score, and a classification report — and treating those numbers as the model’s performance.

He does not argue that those metrics are useless. He argues that they answer different questions, and some are highly sensitive to the population in which the model is evaluated.

Precision is true positives divided by true positives plus false positives. In clinical language, it is also positive predictive value: among the people the model marks as positive, what fraction actually have the condition? Washington notes that true positives are bounded by the number of actual positives, while false positives are driven by the number of actual negatives. Therefore precision depends on prevalence. The same model can have different precision in a low-prevalence screening population and a high-prevalence specialty clinic, even if the underlying model behavior has not changed.

Sensitivity, also called recall, is true positives divided by true positives plus false negatives: among all actual positives, what fraction does the model catch? Specificity is true negatives divided by true negatives plus false positives: among all actual negatives, what fraction does the model correctly mark negative? Washington says these are more robust to prevalence because their denominators are the actual positives and actual negatives respectively.

The practical lesson is not “never use precision.” It is that the metric must match the clinical and human use case. Washington says that when he explains the prevalence issue to clinicians, they often still prioritize precision. The reason becomes clear in his examples: clinicians often experience low precision as alert fatigue.

A question from the audience suggests simply reporting the whole confusion matrix so others can compute any metric they need. Washington agrees more confusion matrices would help, but says people like metrics because they are interpretable and map to meaningful concepts. A confusion matrix gives a gestalt view; a metric such as sensitivity names a particular property. His objection is not to summarization. It is to unreflective summarization.

Don't just use the default metrics that scikit-learn gives you, um, unless you know why.

Peter Washington · Source

He extends that warning to AI-assisted coding. If a developer uses Codex or another model to write machine-learning code, the generated defaults may silently encode choices about evaluation, thresholds, and reporting. In health systems, those choices are not neutral. They can determine whether patients are missed, whether clinicians are flooded, and whether the model earns or loses trust.

A decision threshold is a human-centered hyperparameter

The same trained model can behave like a different product when its decision threshold changes. Peter Washington uses a toy cancer-classification example to show how much changes when the learned model stays the same but the threshold moves. A model outputs a probability from 0 to 1. The common default is 0.5: a score of 0.52 becomes cancer, 0.48 becomes no cancer. But the threshold can be lowered to 0.3, catching more true cancer cases while increasing false positives, or raised toward a high-specificity setting, reducing false positives while missing more true cases.

The learned model weights have not changed. The deployed behavior has. Washington’s visual kept the same red and green data points along a 0-to-1 probability line while only the threshold marker moved. The apparent “model” changed from a balanced classifier to a high-sensitivity classifier to a high-specificity classifier, even though the underlying scores were unchanged.

For cancer screening, Washington says students often prefer the high-sensitivity threshold: catch everyone with cancer, even if some people without cancer are referred for follow-up. He says he agrees in that case. But the broader question is whether high sensitivity at the cost of specificity is always preferable. His sepsis example is meant to show that it is not.

He describes an Epic sepsis model, an AI model for predicting sepsis risk that people used in real hospital settings, including UCSF and other places. A later validation study reported sensitivity of 86.0%, specificity of 80.8%, positive predictive value of 33.8%, and negative predictive value of 98.11% in 11,512 inpatient encounters, with 10.2% coded as sepsis.

Metric	Reported value
Sensitivity	86.0%
Specificity	80.8%
Positive predictive value / precision	33.8%
Negative predictive value	98.11%

Washington highlighted these reported validation metrics for the Epic sepsis model.

On sensitivity and specificity alone, the model appears broadly strong. But the precision means that when the model alerts, most alerts are not true sepsis cases. Washington says this produced alert fatigue: clinicians were overwhelmed by sepsis alerts and eventually ignored them. He reports that UCSF clinician-scientist colleagues describe ignoring the model because they believe it “sucks,” even though the sensitivity and specificity look respectable.

That is why clinicians may prioritize precision despite its prevalence sensitivity. In an alerting system, the clinician’s lived experience is not an ROC curve; it is interruption after interruption. A model with low precision can damage trust even if it catches many true cases.

Washington then connects the threshold issue to Apple Watch’s hypertension feature. He says Apple publicly reported low overall sensitivity, about 41.2%, and high overall specificity, about 92.3%. He interprets this as Apple having opted for specificity. In Washington’s account, Apple consulted clinicians while developing the feature, clinicians warned about alert fatigue, and Apple made specificity high from the onset.

Apple Watch hypertension metric	Reported value
Sensitivity for Stage 1	29.6%
Sensitivity for Stage 2	53.7%
Overall sensitivity	41.2%
Specificity for normal	95.3%
Specificity for elevated	82.8%
Overall specificity	92.3%

The Apple Watch hypertension slide showed a specificity-oriented operating point.

Washington’s broader claim is that “easy to implement” AI decisions can dramatically affect end users. A threshold can be a one-line change. In a health product, it can decide who is reassured, who is warned, who is missed, and how often clinicians are interrupted.

This comes up again in the legal and regulatory question period. An audience member asks whether optimizing for specificity over sensitivity could create malpractice exposure if patients are missed. Washington answers that for regulated products, FDA approval may require locking the model and its hyperparameters: the developer reports the chosen settings and performance, and changing them can require re-approval. For models deployed inside an institution without FDA approval, he says there are ongoing discussions about liability — whether responsibility falls on the AI model, the clinician, or someone else — and he does not claim to know the answer.

Fairness metrics create the same optimization problem under another name

The metric-choice problem reappears in algorithmic fairness. Peter Washington makes that point through his lab’s Parkinson’s assessment work. The group built a web-based digital assessment for tremor-related symptoms using mouse-tracing tasks, keyboard-pressing tasks, and cognitive assessments. The project began when Washington was a computer-science professor at the University of Hawaii, and the data collection was shaped by community engagement with the Hawaii Parkinson’s Association.

The team engaged people with Parkinson’s, spouses, and community members, and partnered with Jerry Boster, then president of the Hawaii Parkinson’s Association, who Washington says was deeply involved in development before later dying of Parkinson’s. Washington describes the collaboration plainly: Boster told them what to build, and the team built it.

The models performed well in aggregate. But subgroup analysis showed fragility. Washington says performance was better on Mac devices than Windows devices, despite having less MacBook data. He suspects MacBooks are more homogeneous, whereas Windows devices vary more. That difference could matter for social determinants of health based on where MacBooks and Windows are accessible globally. The model also performed better for right-handed than left-handed individuals, a concerning issue for motor assessments.

The lab’s future robustness analysis expands the factors: screen size, operating system, mouse input modality, device form factor, webcam resolution, network bandwidth, handedness, technological proficiency, and age. Washington describes these as factors that can act as proxies for social determinants of health or are directly measurable in these assessments.

Washington also describes algorithmic mitigation through adversarial debiasing and a multi-adversarial debiasing approach presented at AMIA 2025. The method details are not his focus. The result is: the model can be engineered to be less biased, depending on how bias is defined.

That caveat is the point. Washington says the algorithmic-fairness field has many quantitative ways to measure bias, including metrics people may know as disparate impact, equalized odds, and equal opportunity. He also describes examples such as differences in sensitivity between groups, differences in precision between groups, and differences in positive prediction rates between groups. These measures can have different implications and can conflict. Optimizing one fairness target may worsen another, just as optimizing sensitivity can reduce specificity.

In the Parkinson’s dataset slide, the baseline model had higher accuracy and F1-score than debiased versions, while adversarial configurations reduced some fairness gaps at the cost of overall performance and sometimes other fairness metrics. Washington summarizes the tradeoff with the slide title “You Get What You Pay For.”

Model	Accuracy	F1-score	DP gap	DM/EO gap
BASE	0.8927 ± 0.0419	0.8711 ± 0.0465	0.4985 ± 0.0469	0.3395 ± 0.0391
ADVDM	0.8540 ± 0.0387	0.7997 ± 0.0494	0.4047 ± 0.0740	0.2790 ± 0.1189
ADVDP	0.8064 ± 0.0348	0.7382 ± 0.0768	0.2167 ± 0.0755	0.4025 ± 0.1285
BOTH	0.7850 ± 0.1214	0.7171 ± 0.1523	0.2702 ± 0.1286	0.3156 ± 0.0636

The Parkinson’s debiasing slide showed performance and fairness tradeoffs across adversarial conditions.

The HCI question is therefore not only whether a model can be made “less biased.” Washington’s claim is that models can often be optimized toward many targets if engineers specify them. The harder question is which target should be optimized in which clinical and social context. That is an empirical, normative, and design question, not merely a modeling question.

Patient engagement did not substitute for clinician engagement

Talking with patients and community members was necessary in the Parkinson’s work, but it was not sufficient. Peter Washington uses that project to make a different point about stakeholder engagement. After moving to UCSF, where many colleagues are clinicians, he showed the Parkinson’s assessment to them and says he “learned a lot” after the study had already run.

The first lesson was medication timing. Many people diagnosed with Parkinson’s take Levodopa to control motor symptoms. Symptoms improve after medication, then gradually return as the medication wears off, creating cycles of “on” and “off” periods. Washington’s study did not account for medication timing. That created a confounder because the intended use case was early screening, but early undiagnosed patients are not yet taking Levodopa, whereas the study data came from diagnosed participants who were being treated.

Clinicians also shifted the outcome question. Washington’s team had focused on diagnosis, but clinicians told them that predicting Parkinson’s symptoms would be more actionable. A common clinical workflow involves patients returning every six months for an MDS-UPDRS assessment, which takes clinician time and contributes to appointment burden. If a digital assessment could help predict symptom measures, it might better fit clinical needs. Targeting symptoms rather than diagnosis could also avoid some of the medication-timing problem.

Other clinical lessons included asymmetry and disease stage. Parkinson’s often begins asymmetrically, affecting one side of the body before the other. Washington’s team had analyzed right-handed versus left-handed performance, but had not accounted for asymmetric onset as a clinical phenomenon. Disease stage also matters, from one-sided symptoms to bilateral symptoms, balance impairment, more severe symptoms, and eventual need for assistance in daily activities.

The final clinician-centered lesson is workflow fit. A model that “works well” still has to appear somewhere clinicians can use it. Washington returns to Epic and electronic health record extensions, which he says are easy to code relative to many apps. Clinicians can specify how they want an AI output displayed in the workflow, but they are already overloaded. The design challenge is to provide more granular patient information without increasing clinician burden or stress.

He is careful in the Q&A not to overclaim what the Parkinson’s paper showed. Asked how later-discovered limitations get attached to published work, Washington says many limitations are already added during peer review and can make limitations sections long. He also says the point of the Parkinson’s paper was not to claim a clinically usable diagnostic assessment; it was to show that models were not robust to device type and handedness, and that those factors are understudied in AI robustness. He describes the study as a formal, large-sample “lo-fi prototype” that generated preliminary data and lessons for future NIH proposals and follow-up assessments.

Asked what patient engagement revealed that clinician-only engagement might have missed in the Parkinson’s study, Washington says the Parkinson’s community work mostly surfaced accessibility concerns: older users and less technologically proficient users needed large fonts and usable web design. The deeper example, he says, came from the Hawaii substance-use study and the pau hana timing issue. The implication is not that one stakeholder group is more important. It is that different stakeholders reveal different failure modes.

The deployment environment is part of the model

Deployment details are not afterthoughts in Washington’s account; they are part of whether the AI system works. Peter Washington repeatedly points to the user’s family context, the clinician’s alert stream, the electronic health record workflow, the operating system, the mouse or trackpad, the screen size, the network connection, and the timing of a medication cycle. Each can change what the model learns, how it performs, or whether anyone uses its output.

He also marks several boundaries. He does not answer whether private companies such as Apple are subject to standardized model-specification requirements; he says he does not know because he does not work at those companies. He does not detail his group’s automatic adherence-measurement work because it is not yet public. Asked about interventions beyond prediction-driven nudging, such as cognitive or behavioral approaches that do not rely on prediction, he says his group is also working on LLM and chatbot-related safety questions, but that was outside the scope of the talk.

The through-line is concrete: in AI-enabled digital health, small modeling defaults can become large clinical and user-experience decisions. The decision threshold determines alert burden. The metric determines what “good” means. The label prompt determines whether a person can safely or comfortably disclose. The device distribution determines who receives reliable predictions. The clinical workflow determines whether a useful signal becomes actionable information or another source of burnout.

Data and Training Evals and Benchmarks AI in Healthcare and Life Sciences Human-AI Interaction