Skip to main content Skip to secondary navigation
Page Content

Peering into the Black Box of AI Medical Programs

To realize the benefits of AI in detecting diseases such as skin cancer, doctors need to trust in the decisions rendered by AI. That requires better understanding of its internal reasoning.

Close up of hands of female doctor dermatologist oncologist holding new generation dermatoscope, examining birthmarks and moles of female patient.

Researchers have developed a new way to reveal how artificial intelligence programs called classifiers arrive at decisions when making medical diagnoses. Although these AI classifiers offer immense promise in health care, their “black box” nature, where the reasoning behind their decision-making is opaque to humans, has hindered trustworthiness. 

To peel back the curtain on inscrutable classifiers, researchers at Stanford University and the University of Washington have leveraged human expertise alongside another kind of artificial intelligence, generative AI. First, the researchers tasked dermatology AI algorithms with characterizing images of skin lesions as either likely malignant — indicative of melanoma, the deadliest form of skin cancer — or likely benign. Then, the researchers trained a generative AI model paired to each dermatology AI algorithm to churn out thousands of modified lesion images that appeared either “more benign” or “more malignant” to the algorithm. Finally, two human dermatologists assessed the images to gauge what sorts of features the AI classifiers had factored into their decision-making. Assessing features that caused classifiers to flip from benign to malignant was especially informative. 

In this way, the researchers created an auditing framework for making AI reasoning more understandable to humans. Such efforts at explainable AI, or XAI, in medicine could help developers discover when their models are relying on spurious correlations in training set data, thus providing an opportunity to fix those issues before deployment to doctors and patients.

Our motivation for this study was understanding the factors in an image that might be impacting the decision-making of an AI model,” says senior study co-author Roxana Daneshjou, an assistant professor of biomedical data science and of dermatology at Stanford University and a faculty affiliate of the Stanford Institute for Human-Centered Artificial Intelligence (HAI). “Thanks to our auditing framework, we can now see what’s going on under the hood in medical AI models.” 

For the study, published in Nature Biomedical Engineering, Daneshjou and colleagues evaluated five dermatology AI classifiers used in academia and commercially by consumers. Although the U.S. Food and Drug Administration has not approved any image-based computer vision models for dermatology, some of these models have already received a regulatory green light in Europe. Furthermore, quite a number of AI dermatology tools are widely available in Apple and Android app stores. 

“These direct-to-consumer apps are concerning because consumers really don’t know what they’re getting at this point,” says Daneshjou, who is also the assistant director of the Center of Excellence for Precision Heath & Pharmacogenomics and director of informatics for the Stanford Skin Innovation and Interventional Research Group (SIIRG). “Understanding AI algorithm’s decisions will be valuable for showing if these models are making decisions based on clinically important features.”

To get the proverbial peek under the hood, the research team used a training set of images for the five classifiers incorporating two common forms of visual dermatological data: dermoscopic images, which are taken through a magnifying medical device that visualizes deeper layers of the skin, and clinical images, snapped by ordinary digital cameras. 

Each class of image provides different information while also presenting unique artifacts that dermatologists (and well-developed AI algorithms) must account for. For instance, zoomed-in dermoscopic images better reveal the fine details of a lesion, but can also include ruler markings and other device displays. Wider-view clinical images, meanwhile, can offer additional context about lesions, such as what the surrounding skin looks like, but as a result can also more easily capture body hair and patients’ clothing. 

Ultimately, in reviewing the diagnostic decisions offered by the AI classifiers on both the real and the counterfactual generative-AI-tweaked images, the researchers were able to peer into the “black box,” as it were, of each classifier. Reassuringly, many medically meaningful features of lesions were considered by the AIs, consistent with human dermatologists. Examples for diagnosing melanoma include atypical darker pigmentation patterns and so-called blue-white veils, where blue pigmentation appears under a glazing white layer. Yet in other instances, the models used medically dubious or debatably relevant attributes, like the amount of hair on the background skin. 

“It could be that the training set for a particular dermatology AI classifier contained a very high number of images of true, biopsy-confirmed melanomas that happened to appear on hairy skin, so the classifier has made an incorrect association between melanoma likelihood and hairiness,” says Daneshjou. “Bringing this kind of issue to light through our auditing framework would give developers the chance to correct the problem.”

Helpfully, the auditing approach devised by the Stanford and University of Washington researchers can also readily apply to other computer vision-based, medical AI apps, for instance in radiology and pathology. Overall, these approaches to XAI should help medical AI developers boost the accuracy of their products and instill greater user confidence.

“It’s important that medical AI classifiers receive proper vetting by interrogating their reasoning processes and making them as understandable as possible to human users and developers," says Daneshjou. "If fully realized, AI has the power to transform certain areas of medicine and improve patient outcomes.”

The study was led by Alex J. DeGrave in the Allen School of Computer Science & Engineering at the University of Washington. Co-authors include Joseph D. Janizek and Su-In Lee, also at the University of Washington, and Zhuo Ran Cai in the Department of Dermatology at the Stanford University School of Medicine. 

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more