How Do We Ensure that Healthcare AI is Useful?

Date

June 13, 2022

Topics

In healthcare, predictive models need to be more than good predictors. Stanford scholars suggest a framework for determining a model’s worth.

In the big scheme of healthcare operations, predictive models play a role not unlike that of blood tests, X-rays, or MRIs: They influence decisions about whether an intervention is appropriate.

“Broadly speaking, models do math and yield probability estimates that help you decide whether to act,” says Nigam Shah, Chief Data Scientist for Stanford Health Care and a Stanford HAI faculty member. But those probability estimates are only useful to healthcare providers if they trigger decisions that are more beneficial than not.

“As a community, I think we’re hung up on the performance of the model and not asking the question, Is the model useful?” Shah says. “We need to think outside the model.”

Shah’s team is one of the few healthcare research groups evaluating whether hospitals have the capacity to intervene based on a model, and whether on balance the interventions will be beneficial to patients and healthcare institutions.

“There’s increasing concern that AI researchers are building models left and right and nothing's getting deployed,” Shah says. One reason for that is modelers’ failure to perform a usefulness analysis showing how the intervention triggered by a model will fit into hospital operations in a cost-effective way while also doing less harm than good. “If model developers would spend the time to perform this additional analysis, hospitals will pay attention,” he says.

The tools for doing a usefulness analysis already exist in the fields of operations research, healthcare policy, and econometrics, yet model developers in healthcare have been slow to use them, Shah says. His own team has tried to shift that mindset by publishing a handful of papers urging more people to evaluate their models’ usefulness. These include a JAMA paper setting forth the need for modelers to think about usefulness, and a research paper that lays out a framework for analyzing the usefulness of predictive models in healthcare and demonstrates how it would work using a real-world example.

“Like any new thing that a hospital might add to its operations, deploying a new model must be worthwhile,” Shah says. “There are mature frameworks for determining a model’s worth. It’s time for modelers to use them.”

Flow chart showing "useful, reliable and fair AI-guided care" in the center, with three spokes: The model and its output, policy and capacity to act, and properties of the intervention

Understanding the Interplay Between the Model, the Intervention, and the Intervention’s Benefits and Harms

As depicted in the graphic above, model usefulness depends on the interplay between the model, the intervention it triggers, and the intervention’s benefits and harms, Shah says.

First, the model – which often gets the lion’s share of the attention – should be good at predicting whatever it’s supposed to predict, whether that’s a patient’s risk of hospital readmission or of developing diabetes. In addition, Shah says, it must be fair, meaning that it generates predictions that apply equally to all people, regardless of such things as race, ethnicity, nationality or gender; it must either be generalizable [KM1] from one hospital site to another or at least make reliable predictions for the local hospital population; and often (but not always), it should be interpretable. (These are aspects of models that Shah previously discussed in the linked HAI stories.)

Second, healthcare institutions must have a policy about when and how to intervene based on the test or model as well as a decision regarding who is responsible for that intervention. They must also have the capacity (sufficient staffing, materials, or other resources) to intervene.

Setting a policy regarding whether or how to intervene in a particular way in response to a model can impact health equity, Shah says. “When it comes to fairness,” Shah says, “researchers spend way too much time focused on whether a model is equally accurate for all people and not enough on whether the intervention will benefit all people equally – although most inequities we seek to fix arise from the latter.” For example, predicting which patients will be “no shows” for scheduled appointments may not itself be unfair if it makes predictions that are equally accurate for all racial and ethnic groups, but the choice of how to intervene – whether to double-book the slot or provide transportation support to help people get to their appointments – might affect different groups of people differently.

Third, the benefits of the intervention must outweigh its harms. Any intervention can have both positive and negative consequences, Shah says. The usefulness of a model’s predictions will therefore depend on the benefits and harms of the intervention it triggers.

To understand this interplay, consider one commonly used predictive model: The atherosclerotic cardiovascular disease (ASCVD) risk equation, which relies on nine primary data points (including age, gender, race, total cholesterol, LDL/HDL-cholesterol, blood pressure, smoking history, diabetic status, and use of antihypertensive medications) to calculate a patient’s 10-year risk of having a heart attack or stroke. A fleshed-out usefulness analysis of the ASCVD risk equation would consider the three parts of the graphic above and find that it is useful, Shah says. First, the model is widely considered highly predictive of heart disease and is also fair, generalizable, and interpretable. Second, most healthcare institutions intervene by following a standard policy about the level of risk at which to prescribe statins, and there’s plenty of capacity to intervene because statins are widely available. And finally, a harm/benefit analysis of statin use would show that most people benefit from statins, though some patients don’t tolerate their side effects.

An Example of Model Usefulness Analysis: Advanced Care Planning

The ASCVD example above, while illustrative, is likely one of the simplest predictive models out there. But predictive models have the potential to trigger interventions that could disrupt healthcare workflows in more complex ways, and the benefits and harms of some interventions may be less clear.

To address that problem, Shah and his colleagues developed a framework for testing whether predictive models are useful in practice. They demonstrated that framework using a model that triggered an intervention called Advanced Care Planning (ACP).

ACP, which is typically offered to patients who are approaching the end of their lives, involves an open and honest discussion of what the future may hold and what the patient’s wishes are should he or she become incapacitated. These conversations not only give patients a sense of control over their lives, but also reduce healthcare costs, improve physician morale, and sometimes even improve patient survival.

Shah’s team at Stanford developed a model that predicts which inpatients are likely to die in the next 12 months. The goal: to identify which patients might benefit from ACP. After making sure that the model was a good predictor of mortality and was also fair, explainable, and reliable, the team did two additional analyses to determine if the intervention triggered by the model would be useful. The first, a cost-benefit analysis, found that a successful intervention (providing ACP to a patient the model correctly identified as likely to benefit) would save about $8,400 while intervening with someone who didn’t need ACP (i.e., the model erred) would cost about $3,300. “In this case, very roughly speaking, even if we’re only correct one out of three times, we break even,” Shah says. The same analysis could also be done using different metrics other than cost, Shah notes, such as patient satisfaction or reduction in unwanted treatments.

But the analysis didn’t stop there. “To save those promised $8,400, we actually have to do a workflow that involves, let’s say, 21 steps, three human beings, and seven handoffs over the course of 48 hours,” Shah says. “So, in real life, can we do it?”

To answer that question, the team simulated 500 hospital days of intervention to evaluate how healthcare delivery factors such as limited staffing or insufficient time (due to patients being discharged) would affect the benefit of the intervention. They also quantified the relative benefit of increasing inpatient staffing compared with providing ACP on an outpatient basis. The upshot: Having the outpatient option ensures that more of the expected benefit is realized. “We only have to follow up with half of discharged patients to get 75% of the utility, which is pretty darn good,” Shah says.

This work demonstrates that even when you have a perfectly good model and a perfectly good intervention, a model will only be useful if you also have the capacity to deliver the intervention, Shah says. And while 20-20 hindsight might make this result seem intuitive, Shah says, that wasn’t the case at the time. “Had we not done this study, Stanford Hospital might have just expanded inpatient capacity to provide ACP even though it isn’t very cost effective.”

Shah’s team’s framework for analyzing the interplay between a model, an intervention, and the intervention’s benefits and harms can help to identify predictive models that will be useful in practice, he says. “At a minimum, modelers should run some sort of analysis to determine if their models prompt useful interventions,” Shah says. “That would be a start.”

This is part of a healthcare AI series. Read more about:

Does every model need to be explainable?
Do healthcare models need to be generalizable?
What should healthcare executives know before they implement an AI tool?
Are medical AI tools delivering on what they promise?
Does deidentification of medical records protect our privacy?
How do we make sure healthcare algorithms are fair?

Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more.

Related News

An Open-Source AI Agent for Doing Tasks on the Web

Katharine Miller

Mar 27, 2025

News

NNetNav learns how to navigate websites by mimicking childhood learning through exploration.

News

An Open-Source AI Agent for Doing Tasks on the Web

Katharine Miller

Machine LearningNatural Language ProcessingMar 27

NNetNav learns how to navigate websites by mimicking childhood learning through exploration.

RAISE Health Inaugural Seed Grant Recipients Announced

Hanae Armitage

Mar 18, 2025

Announcement

Five projects received a RAISE Health seed grant to support research and educational initiatives that advance responsible AI in biomedicine.

Announcement

RAISE Health Inaugural Seed Grant Recipients Announced

Hanae Armitage

HealthcareMar 18

Five projects received a RAISE Health seed grant to support research and educational initiatives that advance responsible AI in biomedicine.

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Mar 05, 2025

Media Mention

A study led by Stanford HAI Faculty Fellow Johannes Eichstaedt reveals that large language models adapt their behavior to appear more likable when they are being studied, mirroring human tendencies to present favorably.

Media Mention

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Natural Language ProcessingMachine LearningGenerative AIFoundation ModelsMar 05

Navigate

Participate

How Do We Ensure that Healthcare AI is Useful?

Understanding the Interplay Between the Model, the Intervention, and the Intervention’s Benefits and Harms

An Example of Model Usefulness Analysis: Advanced Care Planning

Related News

An Open-Source AI Agent for Doing Tasks on the Web

An Open-Source AI Agent for Doing Tasks on the Web

RAISE Health Inaugural Seed Grant Recipients Announced

RAISE Health Inaugural Seed Grant Recipients Announced

Chatbots, Like the Rest of Us, Just Want to Be Loved

Chatbots, Like the Rest of Us, Just Want to Be Loved