Is AI in Hospitals Really Helping Patients?

Watercolor illustration: clinician and older patient in a calm face-to-face conversation in a hospital setting, with subtle digital signals in the background.

The Uncomfortable Question

On April 21, 2026, Nature Medicine published a short commentary by Anna Goldenberg and Jenna Wiens. Short, but sharp. Its title says exactly what many people avoid asking out loud: “Is AI actually improving healthcare?”.

Their answer is equally plain — and unsettling: In many cases, we do not know.

That lands harder than it should. AI is already in routine care: risk scores on wards, ambient scribes in clinics, vision models in imaging workflows. Health systems are investing heavily. Vendors promise efficiency. Teams are under staffing pressure and need help. But one question is still too often left hanging: are patients measurably better off?

The Evaluation Problem in Two Sentences

Goldenberg and Wiens describe two problems that feed into each other:

Problem 1: We measure what is easy to measure. AUROC, F1, and accuracy matter, but they are not the endpoint in clinical care. In plain terms: AUROC tells you how well a model separates sick from not-sick patients, accuracy is the overall share of correct predictions, and F1 balances precision and recall for positive cases. Those numbers can still look strong even when bedside care barely changes. A model can look great in retrospective validation and still change almost nothing at the bedside.

Problem 2: Even when outcomes improve, attribution is often weak. If sepsis mortality falls after rollout, what actually caused it: the model, a Hawthorne effect, concurrent workflow changes, staffing shifts, better training? Clean causal attribution is hard and still uncommon in real deployments.

This is not a side issue. It is the methodological center of the field.

The Epic Sepsis Model as a Case Study

If you want a concrete example, look at the Epic Sepsis Model (ESM): integrated into Epic EHRs, deployed across many US hospitals, and widely presented as a flagship clinical AI tool.

In 2021, a University of Michigan team published an external validation in JAMA Internal Medicine with more than 27,000 patients. Results were sobering: only 7% additional sepsis cases were identified, 67% of sepsis patients were missed, and alerts fired in 18% of all hospitalizations — a recipe for alert fatigue.

A follow-up analysis from the same research group in NEJM AI pointed to a deeper issue: some apparent performance seemed to come from variables that already reflected clinician suspicion (orders, medication patterns, and related signals) (Evaluation of Sepsis Prediction Models before Onset of Treatment). Put plainly, part of the model signal looked like “the team already suspects sepsis.”

Epic has updated the model since then. The broader lesson still stands: a tool can be widely deployed for years without robust prospective evidence of patient benefit.

The Ambient Scribe Wave: Better Evidence, Still Incomplete

Evidence is more encouraging for ambient AI scribes. These tools capture encounters and draft notes, and this area now includes randomized studies.

In December 2025, Lukac et al. published a randomized trial in NEJM AI with 238 physicians, comparing DAX, Nabla, and usual care (DOI). Nabla significantly reduced documentation time. Both platforms showed favorable movement in secondary burden scores.

A parallel multicenter QI study by Olson et al. in JAMA Network Open reported a drop in burnout prevalence from 51.9% to 38.8% among 263 clinicians across six health systems after 30 days of use (DOI).

These are genuinely positive signals. But the central question is still open: what is the effect on hard patient outcomes? Burnout reduction matters a lot, but it is not the same as lower mortality, fewer readmissions, or better diagnostic performance. Parts of this evidence base are short follow-up and non-randomized, so for now most gains are still surrogate or process outcomes.

What Goldenberg and Wiens Are Actually Asking For

This is not anti-AI criticism. Both authors build clinical AI themselves. Their argument is methodological:

Move beyond an accuracy obsession toward clinical relevance. A model that shines offline but changes nothing in workflow has little clinical value.
Improve outcome attribution. If outcomes shift after implementation, study design must support causal inference.
Use prospective, ideally randomized evaluation. Retrospective validation alone is not enough.
Treat post-deployment monitoring as mandatory. Models drift, populations change, and performance degrades.

What This Means in Emergency Care

As an emergency physician, I read this with one optimistic eye and one skeptical eye.

Optimistic, because AI can clearly help with practical tasks: reducing documentation burden, supporting triage consistency, and speeding up pattern recognition.

Skeptical, because it is too easy to install a polished tool and assume impact. We already know this trap from clinical scores: well published does not automatically mean outcome improving. AI can follow the same script — deploy, present, move on — without ever testing whether patients benefit.

Three practical checkpoints:

Before implementation: Which concrete clinical endpoint should improve, and how will you measure it beyond model metrics?

During use: How do we avoid alert fatigue (when too many pop-up warnings make teams start ignoring them)? That includes consistent threshold tuning: setting and regularly adjusting trigger cutoffs so important warnings are caught without flooding staff with false alarms.

After implementation: Audit continuously. Does performance still hold in current workflows and current populations?

Conclusion

The Goldenberg-Wiens commentary is a useful reality check in a field that swings between excitement and overclaiming. It is neither anti-technology nor cynical. It is methodologically disciplined.

Its core point is simple: we have become comfortable evaluating AI with benchmarks that do not answer the one question that matters most — does this help patients?

That answer requires evidence, not just AUROC (put simply: a metric for how well a model separates sick from not-sick patients).

Sources and further reading

Goldenberg A, Wiens J. Is AI actually improving healthcare? Nat Med 32, 1182–1183 (2026). DOI: 10.1038/s41591-026-04329-2
Wong A et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern Med (2021). DOI: 10.1001/jamainternmed.2021.2626
Kamran F et al. Evaluation of Sepsis Prediction Models before Onset of Treatment. NEJM AI (2024). DOI: 10.1056/AIoa2300032
Lukac PJ et al. Ambient AI scribes in clinical practice: A randomized trial. NEJM AI 2(12), 2025. DOI: 10.1056/AIoa2501000
Olson KD et al. Use of Ambient AI Scribes to Reduce Administrative Burden and Professional Burnout. JAMA Netw Open 8(10), 2025. DOI: 10.1001/jamanetworkopen.2025.34976
Joshi S et al. AI as an intervention: improving clinical outcomes relies on a causal approach to AI development and validation. JAMIA 32, 589–594 (2025). DOI: 10.1093/jamia/ocae301