Generating Medical Errors: GenAI and Erroneous Medical References

Date

February 12, 2024

Topics

iStock/Armand Burger

A new study finds that large language models used widely for medical assessments cannot back up claims.

Large language models (LLMs) are infiltrating the medical field. One in 10 doctors already use ChatGPT in day-to-day work, and patients have taken to ChatGPT to diagnose themselves. The Today Show featured the story of a 4-year-old boy, Alex, whose chronic illness was diagnosed by ChatGPT after over a dozen doctors failed to do so.

This rapid adoption to much fanfare is in spite of substantial uncertainties about the safety, effectiveness, and risk of generative AI (GenAI). U.S. Food and Drug Administration Commissioner Robert Califf has publicly stated that the agency is "struggling" to regulate GenAI.

The reason is that GenAI sits in a gray area between two existing forms of technology. On one hand, sites like WebMD that strictly report known medical information from credible sources are not regulated by the FDA. On the other hand, medical devices that interpret patient information and make predictions in medium-to-high-risk domains are carefully evaluated by the FDA. To date, the FDA has approved over 700 AI medical devices. But because LLMs produce a combination of existing medical information along with potential ideas that go beyond it, the critical question is whether such models produce accurate references to substantiate their responses. Such references enable doctors and patients to verify a GenAI assessment and guard against the highly prevalent rate of “hallucinations.”

For every 4-year-old Alex, where the creativity of an LLM may produce a diagnosis that physicians missed, there may be many more patients who are led astray by hallucinations. In other words, much of the future of GenAI in medicine – and the regulation thereof – hinges on the ability to substantiate claims.

Evaluating References in LLMs

Unfortunately, very little evidence exists about the ability of LLMs to substantiate claims. In a new preprint study, we develop an approach to verify how well LLMs are able to cite medical references and whether these references actually support the claims generated by the models.

The short answer: poorly. For the most advanced model (GPT-4 with retrieval augmented generation), 30% of individual statements are unsupported and nearly half of its responses are not fully supported.

evaluation of medical references in llms

Evaluation of the quality of source verification in LLMs on medical queries. Each model is evaluated on three metrics over X questions. Source URL validity measures the proportion of generated URLs that return a valid webpage. Statement-level support measures the percentage of statements that are supported by at least one source in the same response. Response-level support measures the percentage of responses that have all their statements supported.

How did we develop this evaluation approach? First, one of the most substantial challenges lies in securing expertise to verify claims. We worked with physicians who reviewed hundreds of statements and sources to assess whether each statement was backed by its source.

Such expert reviews are, of course, costly and time-intensive, so we next decided to see whether LLMs can be used to scale such physician assessments. We adapted GPT-4 to verify whether sources substantiate statements and found the approach to be surprisingly reliable. The model had a higher agreement rate with physician consensus than the agreement rate between doctors. This approach is promising as it suggests we could leverage LLMs to conduct evaluations without requiring expensive human expertise with rapid updating of LLMs.

Finally, using this model, we developed an end-to-end evaluation pipeline called SourceCheckup. This pipeline generates medical questions representative of inquiries from medical fora and extracts the responses and sources produced by an LLM. Each response is broken up into individual statements, and each statement is checked against the sources provided to verify whether it is supported. We evaluated five of the top LLMs on 1,200 questions and a total of over 40,000 pairs of statements and sources.

Pervasive Errors in Substantiation

Our results are stark: Most models struggle to produce relevant sources. Four out of five models hallucinate a significant proportion of sources by producing invalid URLs. This problem goes away with the retrieval augmented generation (RAG) model, which first performs a web search for relevant sources before producing a summary of its findings. However, even in the GPT-4 RAG model, we find that up to 30% of statements made are not supported by any sources provided, with nearly half of responses containing at least one unsupported statement. This finding is more exaggerated in the other four models, with as few as 10% of responses fully supported in Gemini Pro, Google's recently released LLM.

For example, one response by GPT-4 RAG indicated that criteria for gambling addictions (from the Diagnostic and Statistical Manual of Mental Disorders) are equally applicable across all individuals and groups. But the source it referenced concluded the opposite, finding that "the assumed equal impact of each criterion lacks support in the findings." In another example, the model recommended a starting dose of 360 joules for a monophasic defibrillator (one where the current runs one way to treat a patient with cardiac arrest), but the source only mentioned biphasic defibrillators (where current runs both ways). That failure to distinguish can matter greatly, as there’s been a shift in technology toward biphasic defibrillators that in fact utilize lower electric currents.

In short, even the most advanced models fall seriously short of being able to substantiate answers. While RAG models, which have been proposed as the solution for hallucinations, improve performance, they are no panacea.

Errors More Likely for Lay Inquiries

Many have argued that LLMs may democratize access to health care by providing much-needed information to patients without requiring a physician.

Our evaluation framework allows us to assess whether errors vary by the type of inquiry. Our medical questions are based on three underlying reference texts: (1) the MayoClinic, which provides patient-facing fact pages, (2) UpToDate, which provides articles to physicians with a deeper level of medical detail, and (3) Reddit’s r/AskDocs forum, which includes many lay questions that may not have clearly defined answers and which require information from various medical domains.

We found that the ability of LLMs to substantiate answers varies substantially by type of inquiry. Performance is best for MayoClinic and UpToDate and worst for Reddit. Only 30% of the answers to inquiries based on Reddit can be fully substantiated by sources with GPT4 RAG.

In other words, our findings suggest that LLMs perform worst for exactly the kind of patients that might need this information the most. Where inquiries are mediated by medical professionals, LLMs have an easier time pointing to reliable sources. This has substantial implications for the distributive effects of this technology on health knowledge.

‘A Long Way to Go’

Many commentators have declared the end of health care as we know it, given the apparent ability of LLMs to pass U.S. Medical Licensing Exams. But health care practice involves more than being able to answer a multiple choice test. It involves substantiating, explaining, and assessing claims with reliable, scientific sources. And on that score, GenAI still has a long way to go.

Promising research directions include more domain-informed work, such as adapting RAG specifically to medical applications. Source verification should be regularly evaluated to ensure that models provide credible and reliable information. At least by the current approach of the FDA – which draws a distinction between medical knowledge bases and diagnostic tools regulated as medical devices – widely used LLMs pose a problem. Many of their responses cannot be consistently and fully supported by existing medical sources.

As LLMs continue to grow in their capabilities and usage, regulators and doctors should carefully consider how these models are being evaluated, used, and integrated.

Kevin Wu is a PhD student in Biomedical Informatics at Stanford University.

Eric Wu is a PhD student in Electrical Engineering at Stanford University.

Daniel E. Ho is the William Benjamin Scott and Luna M. Scott Professor of Law, Professor of Political Science, Professor of Computer Science (by courtesy), Senior Fellow at HAI, Senior Fellow at SIEPR, and Director of the RegLab at Stanford University.

James Zou is an associate professor of Biomedical Data Science and, by courtesy, of Computer Science and Electrical Engineering at Stanford University. He is also a Chan-Zuckerberg Investigator.

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.