When AI Reads Medical Images: Regulating to Get It Right

Date

December 03, 2020

Topics

China Daily/via REUTERS

Stanford researchers propose a framework for regulating diagnostic algorithms that will ensure world-class clinical performance and build trust among clinicians and patients.

When a woman gets a mammogram, the resulting image might well be “read” by an AI algorithm instead of a doctor. In fact, AI is now capable of performing many diagnostic tasks traditionally done by radiologists – from spotting tumors or blocked arteries to classifying them by type or severity.

But that doesn’t mean these algorithms have been fully vetted by the FDA to make sure they are safe, accurate, and reliable. The truth is that the FDA is much more familiar with regulating drugs and the technology associated with hardware such as pacemakers than it is with AI algorithms. And there’s an urgent need for the performance of these algorithms to be carefully evaluated because the number of diagnostic tasks they can carry out is exploding, says David Larson, Stanford University professor of radiology.

“This software needs to reliably do what it is purportedly designed to do and signal to users when it cannot do so – such as when image quality is poor,” he says. “That level of reliability hasn’t really been required to date.”

In a new paper, Larson and Curtis Langlotz, professor of radiology at Stanford and an affiliated faculty member of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), propose a framework for FDA regulation of AI-based diagnostic imaging software.

To make sure these algorithms can be meaningfully regulated and compared to one another, Larson and Langlotz say that definitions of the diagnostic tasks that the algorithms perform need to be standardized. And these clinical standards need to be developed by the medical community – not algorithm manufacturers.

Algorithm manufacturers should also be required to go through four phases of development and evaluation, the authors say. Larson compares this framework to the FDA’s multi-phased clinical trials approach to making sure pharmaceuticals are safe and reliable. Specifically, each algorithm should progress through feasibility testing (on a small test set); capability testing (in a controlled environment simulating real-world conditions); effectiveness testing (in a real clinic); and durability testing (including monitoring and improvement over time).

“We need algorithms that are rigorously designed, well built, and stress-tested before they are implemented in the clinical environment,” Larson says. “It’s much easier to make an algorithm that’s usually right than one that’s almost never wrong. We should strive for world-class clinical performance at every site.”

A Novel Regulatory Arena

The FDA has a long history of regulating drugs and devices, including the software that controls hardware such as pacemakers, insulin pumps, or pulse oximeters. In recent years, they’ve also begun reviewing proposals for regulating software that is not associated with a piece of medical hardware but is used for diagnosing, preventing, monitoring, preventing, or treating disease (so-called “software as a medical device,” or SaMD).

AI-based diagnostic algorithms for medical imaging differ from most SaMD in fundamental ways that merit a different regulatory framework, Larson says. For example, the head of the FDA acknowledged in 2019 that because AI algorithms continually learn from the medical images they review, the traditional approach to reviewing and approving SaMD upgrades may not be appropriate. Instead, the FDA may need to implement a protocol for regulating the entire lifecycle of AI-based algorithms.

So far, the FDA has issued a discussion paper on regulating AI-based SaMD that Larson and Langlotz say makes a good start, but it still leaves a number of important questions unanswered – which Larson and Langlotz’s proposed framework attempts to address.

Avoiding Hardwired Chaos

One problem with the FDA’s proposed regulatory scheme is its failure to separate the definition of the diagnostic task from the algorithm. The diagnostic task, Larson says, should be defined and standardized by the medical community before it is enshrined in software. Failing to do so will yield algorithms that rely on unstated proprietary diagnostic criteria and that are difficult to compare to one another.

As an example, Larson points to the early days of the coronavirus pandemic when radiologists worldwide proposed multiple different scoring systems to categorize what they were seeing in lung scans. They floated classifications of disease severity on scales from 0 to 4, 0 to 6, or even 0 to 25. And that’s typical for clinical scoring and classification systems: They are tried in clinical practice, published in the literature, modified and revised over time, and ultimately reconciled to become a standard. “That chaos is healthy early on, as long as it self-corrects over time,” Larson says.

But if an early standard for the severity of coronavirus disease had been incorporated into a proprietary AI-based algorithm, that would have been problematic, Larson says. Once manufacturers start to hard-code heterogeneous and proprietary diagnostic task definitions, it will be much more difficult to compare algorithms’ performance and reconcile different standards. “We’ll be at risk for hardwiring the chaos,” Larson says.

Instead, the medical community should be responsible for standardizing, maintaining, and updating the schema for classifying the severity of lung damage in coronavirus patients’ scans. Once that is done, diagnostic algorithm developers should be required to use those community standards and keep their algorithms up-to-date as standards evolve.

“A manufacturer should refine the algorithm, but they shouldn’t be changing the clinical scoring system. That should be done by medical professionals,” Larson says. “The manufacturer should just be held responsible for accurately and reliably applying the clinical scoring system.”

Requiring the diagnostic test to be independent of the algorithm will ensure that, from the get-go, regulators can compare how well various algorithms apply the scoring system to a set of images. “If we can’t even compare the algorithms to one another in a controlled environment, we will have no idea how they will compare in the real world,” Larson says. “It’s important that we get this right soon, so we don’t have to go back and clean it up later.”

Stress Testing Diagnostic Algorithms

Among other things, Larson and Langlotz propose that diagnostic algorithms should be well defined, capable of doing what they claim to do, proven effective in ideal settings and the real world, and durable over time.

“We should try to break these things before putting them out in the real world,” Larson says.

In their paper, Larson and Langlotz list 12 measures of performance that should be applied to diagnostic algorithms.

One key proposal: Algorithms should include fail-safes so that they know when they are wrong and can signal that to the clinician. “They have to have the ability to be self-aware,” Larson says, “and right now that is not required.”

Algorithms will also need to be compared to one another using a common dataset that includes images that are representative of different pathologies and patient demographics as well as images obtained from various types and brands of scanners. To accomplish this, algorithm manufacturers might be required to submit their AI algorithms to an independent entity, which would run them on the dataset and prepare reports on their performance.

Manufacturers will also need to deal with differences in how well algorithms perform at different clinical sites – a problem that has been ubiquitously observed, Larson says. “It should be up to the manufacturer to make sure that the algorithm’s clinical performance at a given site closely matches the clinical performance both in their test settings and in their early implementation sites,” he says.

Ensuring Trust Is a Win-Win

This might seem like a lot, Larson says, but poor performance by an algorithm could cost lives. “If that’s my mother’s chest X-ray it’s reading, I would want it to pass these tests,” he says.

Larson likens the proposed regulatory framework to the safety requirements for automobiles or aircraft. “Products that can impact people’s lives have to work reliably in stressful conditions. And if they stop working, they need to fail in a way that’s not going to hurt people.”

Better regulation could also broaden their use. Right now, the deployment of AI-based diagnostic imaging algorithms has been slow in part because clinicians don’t know if they can be trusted, Larson says. Greater transparency regarding their performance will help clinicians gain that trust, which will help manufacturers deploy their algorithms more widely.

“We think this is a win-win for pretty much everybody,” Larson says. “When you combine transparency and standards, it creates trust in the product.”

Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.

Related News

Stanford Study Exposes Major Flaw in AI Mental Health Safety Testing

Andrew Myers

Jul 13, 2026

News

mental health ai illustration head with binary code

With increased use of chatbots in mental health contexts, AI developers now rely on human experts to evaluate AI’s responses for “safety” – but experts rarely agree on what’s safe.

News

Stanford Study Exposes Major Flaw in AI Mental Health Safety Testing

Andrew Myers

HealthcareGenerative AIPrivacy, Safety, SecurityJul 13

With increased use of chatbots in mental health contexts, AI developers now rely on human experts to evaluate AI’s responses for “safety” – but experts rarely agree on what’s safe.

Today's AI Talks Like “Nobody.” New Research Gives It Real Personality.

Jun 08, 2026

News

3D illustration of mirrored human profiles in blue and yellow layers

PsychAdapter lets researchers dial in on personality traits, age, and mental health characteristics to generate text that sounds like real individuals, opening the door to training simulations and personalized content.

News