Clinicians and even the general media seem to believe machine learning models in healthcare should always be generalizable from one hospital site to another, says Nigam Shah, Chief Data Scientist for Stanford Health Care and a Stanford HAI faculty member. “As I see it, that requirement is a bit misplaced.”
Models that are based on biology alone should certainly work the same way for all people, Shah says. But when a model predicts something that involves operational aspects of healthcare, such as hospital readmissions, there’s no reason to think it should work the exact same way in Palo Alto and Mumbai, he says. “In fact, if a readmissions model did generalize between these two locations, I would think something’s not quite right about that.”
Shah is pushing for a shift not only in people’s mindset, but in the ways models are developed and deployed. Instead of focusing so intently on creating generalizable models, Shah says we should instead develop the ability to “share recipes” – the steps people can take to train and evaluate models locally to provide the best predictions for their institutions.
Some types of generalizability are essential, Shah says. For example, for a machine learning model to be fair, it needs to be generalizable across populations. “Every model should work for me, you, and every other racial and ethnic group,” he says. There’s also temporal generalizability – the need for models to work not only in 2010 and 2020, but in the future as well. This type of generalizability is often called reliability.
This fall, Shah will teach the executive education course "Safe, Ethical, and Cost-Effective Use of AI in Healthcare: Critical Topics for Senior Leadership." Learn more.
But when healthcare researchers talk about the generalizability of machine learning models, they are usually talking about geographic generalizability: The need for models to be valid from one hospital site to another. “Sometimes I call that ‘what works in Palo Alto works in Mumbai,’” Shah says.
When Should We Insist on Generalizability – and When Not?
Given that our genomes are 99% identical, Shah would agree that when models rely on physiological data to yield a physiological prediction, they should be valid everywhere. In fact, he says, “if a model is based on biology, then they need to be generalizable to convince me they are any good.” An oft-used model that fits this description is the pooled cohort equation for estimating 10-year absolute rates of atherosclerotic cardiovascular disease: The model inputs are all physiological, and they are used to generate a numerical risk of disease that then forms the basis for certain treatment decisions.
But insisting on generalizability is not appropriate when it comes to machine learning models that deal with the operational aspects of hospital care. Hospital operations around the globe are incredibly diverse: Individual hospitals use different procedures for everything from admissions protocols to lab testing to discharge planning to electronic medical records management and all the steps in between. In different cultures or communities, the frequency or nature of people’s interactions with their healthcare providers may vary as well. So, when machine learning models are trained to predict such things as the likelihood that a patient will be readmitted, the need for post-hospital follow-up, or a patient’s risk of sepsis or mortality, we shouldn’t really expect those models to work well at different sites, Shah says.
Changing Mindsets About Generalizability
Clinicians, researchers, journal publishers, and AI vendors each have their own reasons for thinking healthcare models should be generalizable, Shah says.
Clinicians tend to expect operational models to also have biological validity, Shah says. This leads them to reject models – even these operational ones – that don’t generalize well from site to site.
Journal reviewers take a similar stance: They insist that the models they review be generalizable, which in turn may lead researchers to abandon models that don’t generalize well – even if they have good predictive value in a local setting. Alternatively, researchers might push their models to behave the same way at different sites even though that might yield a model whose predictions are mediocre everywhere rather than well-tailored to an individual setting.
AI vendors have a different incentive: They can make more money if they don’t have to tailor their solutions to each hospital. “There’s an economic imperative to learn the model once and sell it 10 times,” Shah says. But this approach leads to the sale of models that might not actually work well in a new context. It also means companies have to list numerous caveats about the conditions under which their model will and will not perform as promised – an admission that their generalizability is actually overstated.
The Solution: Share Model Recipes
The best solution to the generalizability conundrum is straightforward, Shah says. Hospitals should insist on shareable procedures and recipes so they can train and evaluate local models. These recipes could take the form of playbooks such as the University of Chicago’s playbook for evaluating algorithmic bias; or they could take the form of software platforms, toolkits, or shareable data science notebooks, so long as people can train and evaluate local models locally, he says.
Using these shared recipes, local institutions will be empowered to develop models that provide the best predictions for the local context.
“As computer science evolves and the tooling for doing machine learning evolves,” Shah says, “it will become easier to democratize model training than to enforce generalizability in situations when it makes no logical sense to have a generalizable model.”
Moreover, he says, hospitals shouldn’t expect that a model will work straight out of the box. “Why would we even have that expectation?” he says. “Unless the model is relying solely on biological inputs, it is guaranteed to behave differently in a new context.”
It’s therefore essential that hospitals always test models they buy or build, he says. “I’m arguing that instead of finding a generalizable model, we should provide education so people can figure out, in their own context, what degree of generalizability is necessary. And if it is not, they can learn a local model and deploy it,” Shah says.
This is part of a healthcare AI series. Read more about:
- How can we make sure healthcare models are useful?
- Does every model need to be explainable?
- Are medical AI tools delivering on what they promise?
- Does deidentification of medical records protect our privacy?
- How do we make sure healthcare algorithms are fair?
- What should healthcare executives know before they implement an AI tool?
Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more.