Medical AI tools offer to improve patient diagnoses, lighten physician workload, and improve hospital operations. But do these tools deliver on their promises? To answer that, Stanford scholars have developed an open-source framework that allows hospital systems to determine whether an AI technology would provide more pros than cons to their workflows and patient outcomes.
Often, health care providers that deploy off-the-shelf AI tools don’t have an effective process for monitoring their utility over time. That is where Stanford’s framework can step in: The Fair, Useful, and Reliable AI Models (FURM) guidance, already in use at Stanford Health Care, assesses the utility of technology ranging from early detectors for peripheral arterial disease to a risk prediction model for cancer patients to a chest scan analysis model that could assess whether someone may benefit from a statin prescription.
“One of the key insights we have on our campus is that the benefit we get from any AI solution or model is inextricably tied to the workflow in which it operates and whether we have the time and resources to actually use it in a busy health care setting,” said FURM framework co-creator Dr. Nigam Shah, Stanford professor of medicine and of biomedical data science, as well as Stanford Health Care’s chief data scientist.
Other scholars and developers are creating guidance to ensure AI is safe and equitable, Shah said, but a critical gap lies in assessing the technology’s usefulness and its realistic implementation, since what works for one health care system might not work for another.
How FURM Works
The FURM assessment has three steps:
- The what and why: Understanding what problems the AI model would solve, how its output would be used, and the impact on patients and the health care system. This part of the process also projects financial sustainability and assesses ethical considerations.
- The how: Determining if it is realistic to deploy the model into the heath care system’s workflows as envisioned.
- The impact: Planning for initial verification of benefits and for how to monitor the model’s output once it is live and evaluate how it is working.
Just as it has done at Stanford, Shah believes, FURM could help health care systems better use their time to focus on technologies that are worth pursuing, instead of just experimenting with everything to see what sticks. “You could end up with what is nicknamed ‘pilotitis,’ a ‘disease’ that affects the organization, where you just end up doing pilot after pilot that goes nowhere,” Shah said.
Additionally, Shah says it is important to consider the scale of impact: A model might be good but only help 50 patients.
Beyond ROI
AI also has ethical implications that must not be ignored, emphasizes Michelle Mello, Stanford professor of law and of health policy. Mello and Danton Char, Stanford associate professor of anesthesiology, perioperative and pain medicine, and empirical bioethics researcher, created the ethical assessment arm of the FURM framework with the aim of helping hospitals proactively get ahead of potential ethical problems. For example, the ethics team recommends ways that implementers can develop stronger processes to monitor the safety of new tools, evaluates whether and how new tools should be disclosed to patients, and considers how use of AI tools may widen or narrow health care disparities among patient subgroups.
Dr. Sneha Jain, Stanford clinical assistant professor in cardiovascular medicine and co-creator of FURM, has been involved in developing the methodology to prospectively evaluate AI tools once live, as well as designing ways to make the FURM framework more accessible for systems outside Stanford. She is currently building the Stanford GUIDE-AI lab, which stands for Guidance for the Use, Implementation, Development, and Evaluation of AI. The goal, Jain said, is twofold: to ensure we continue to improve our AI evaluation processes, and to ensure that not just highly resourced health systems can responsibly use AI tools, but also hospitals with lower tech budgets. Mello and Char are pursuing similar work for the ethics assessment process, with funding from the Patient-Centered Outcomes Research Institute and Stanford Impact Labs.
“AI tools are quickly being deployed in health care systems with varying degrees of oversight and evaluation,” Jain explained. “Our hope is that we can democratize robust yet feasible evaluation processes for these tools and associated workflows to improve the kind of care that patients get across the United States – and hopefully one day across the world.”
Moving forward, this interdisciplinary group of Stanford researchers wants to continue to adapt the FURM framework to meet the needs of changing AI technologies, including generative AI, which is rapidly changing and growing by the day.
“If you develop standards or processes that are not workable for people, they’re just not going to do it,” Mello added. “A key part of our work is figuring out how to implement tools effectively, especially in a field where everyone is pushing to move quickly.”