Earlier this year, hospitals around the United States learned shocking news: A widely used artificial intelligence model for spotting early signs of sepsis, the deadly hospital-born infection, was wrong more often than it was right.
Researchers at the University of Michigan estimated that the AI model missed about two-thirds of actual cases when they applied it to data from 30,000 patients at the university’s hospital. On top of that, it generated large numbers of false alarms.
Though startling in itself, the study pointed to a deeper problem: Artificial intelligence models often score well on statistical tests of predictive accuracy but perform surprisingly poorly in real-time medical settings.
Some models are more accurate for affluent white male patients, often because they were trained on data that came from that demographic, than they are for black, female, or low-income patients. Some models work well in one geographic region but not in others. Many AI models also have a tendency to become less accurate over time, sometimes generating increasing numbers of false alarms. Researchers call it “calibration drift.”
In a new study, researchers at Stanford document one likely reason for this “AI chasm” between algorithms’ promise and reality: Many models aren’t being documented with anywhere near the rigor or transparency that medical and AI professionals say are necessary. The study is not yet peer-reviewed.
In particular, the study finds that most of the background documentation on widely used models reveals little about whether they were tested for fairness, unintended bias, longer-term reliability, or even genuine usefulness.
“The expert community has a lot to say on what should be reported, but there is precious little about how to report it,” says Nigam H. Shah, a co-author of the study who is a professor of medicine and a member of the Stanford Center for Biomedical Informatics Research and the Stanford Institute for Human-Centered AI. “No wonder we see useless models, such as the one about sepsis, getting deployed.”
Falling Short of Expectations
The Stanford team examined the documentation for a dozen AI models for clinical decision making, all of them in commercial use, and compared them with 15 different sets of guidelines that experts have recommended in recent years.
The models were all developed by EPIC Systems, a major provider of electronic record services that has become a leading developer of AI tools for health care providers. EPIC developed the sepsis model that the Michigan researchers found to be flawed, although the company has disputed the findings.
The good news is that 90 percent of the models examined largely adhered to the dozen most common recommendations. Those pertained to basic information about the intended purpose of the tool, the data on which it was trained, and the statistical methodology for measuring its accuracy.
However, the models complied with barely 40 percent of the total 220 individual recommendations across all 15 guidelines. Typically, a model fulfilled about half the recommendations coming from any particular guideline.
Developers were especially weak on documenting evidence that their models were fair, reliable, and useful. In addition to raising red flags about such gaps, the researchers say the lack of transparency makes it difficult for health care providers to compare different tools or to independently reproduce and confirm a model’s purported benefit.
“If you look up all the COVID drugs in clinical trials, you can see the design of the study and the kind of trial it was,” says Jonathan H. Lu, a third-year medical student at Stanford who co-authored the study. “You can’t do that for machine-learning models. In some cases, health care systems and providers are literally flying in the dark.”
The researchers found a litany of shortfalls. Among them:
- Only one-third of the models were tested in a setting that was different from the one in which they were trained. That poses a serious risk that a model developed in Boston will be more error-prone in Cleveland or California. Indeed, another Stanford team recently documented exactly that problem with an AI-powered device that analyzes X-rays for signs of collapsed lungs.
- Much of the model documentation has limited information on the demographic makeup of the patients whose data was used to develop the models. If the model trained itself only on data from people who had health insurance, for example, it may be less accurate for uninsured patients who avoid doctor visits whenever possible. The same goes for issues like blood pressure or even immigration status.
- Most of the model documentation had no information on whether it was tested for potential biases tied to race, ethnicity, and sex. Health inequities between different ethnic groups and genders are well documented, and lack of testing for those differences increases the likelihood that algorithms will prolong the same biases.
- Few of the models provided information about whether their performance changes over time. In addition to the “drift” that causes some AI tools to start over-flagging risks, models can also fall behind changes in the population that result from events like the COVID-19 pandemic.
On a more basic level, the Stanford team also found that most of the documentation came up short on concrete evidence of usefulness. Almost none, for example, offered an analysis of net benefits that balanced the benefits of accurate early warnings against the harms of false alarms.
A Case for Transparency
The Stanford team makes a case for greater transparency and for creating more incentives to be more thorough in reporting on the models. Lu suggests creating a public dashboard that summarizes the disclosures, or absence of documentation, for every health care AI tool on the market.
“We should be able to say whether a model developed at Duke University is better than a model developed by someone else because it meets more of the recommendations from guidelines,” says Lu. Professional associations could decide whether or not to endorse a product based on how thoroughly it has been documented.
Over the long haul, Lu suggests, the heightened transparency could spark a virtuous competition among AI developers to do the right thing — even though it takes additional effort at the outset.
In addition to Lu and Shah, the study’s other co-authors are Alison Callahan, a research scientist at the Center for Biomedical Informatics Research; Birju S. Patel, an internal medicine physician and a research fellow at Stanford; Dev Dash, an emergency medicine physician and a research fellow at Stanford; and Keith E. Morse, an assistant professor of pediatric medicine at the Stanford School of Medicine.
Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.