A new study from Stanford researchers has highlighted the untapped potential of large language models, a form of artificial intelligence, to improve the accuracy of medical diagnoses and clinical reasoning.
The researchers presented a series of cases based on actual patients to the popular model ChatGPT-4 and to 50 physicians and asked for a diagnosis. Half of the physicians used conventional diagnostic resources, such as medical manuals and internet search, while the other half had ChatGPT available as a diagnostic aid.
Overall, ChatGPT on its own performed very well, posting a median score of about 92—the equivalent of an “A” grade. Physicians in both the non-AI and AI-assisted groups earned median scores of 74 and 76, respectively, meaning the doctors did not express as comprehensive a series of diagnoses-related reasoning steps.
Read the full study, Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study
The counterintuitive finding suggests physicians have room to better learn and utilize these sorts of AI tools to their fullest, the scholars say. With effective training and clinical integration, they believe large language models in health care settings could ultimately benefit patients.
“Our study shows that ChatGPT has potential as a powerful tool in medical diagnostics, so we were surprised to see its availability to physicians did not significantly improve clinical reasoning,” says study co-lead author Ethan Goh, a postdoctoral scholar in Stanford’s School of Medicine and research fellow at Stanford’s Clinical Excellence Research Center. “The findings suggest there are opportunities for further improvement in physician-AI collaboration in clinical practice and health care more broadly.”
“What is very possible is that once a human feels like they’ve got a diagnosis, they don’t ‘waste time or space’ on explaining more of the steps for why,” added Jonathan H. Chen, Stanford assistant professor at the School of Medicine and the paper’s senior author. “There’s also a real phenomenon that often human experts cannot themselves explain exactly why they made correct decisions.”
The study was recently published in JAMA Network Open and accepted by the American Medical Informatics Association 2024 symposium in November.
Delivering Diagnoses
Large language models, or LLMs, have exploded in prominence since the arrival of ChatGPT in November 2022 from San Francisco-based OpenAI. LLMs are programs trained on massive amounts of data containing natural human language, such as websites and books. Based on this training, LLMs can respond to natural-language query inputs with fluent, cogent answer outputs.
Already, LLMs have made significant inroads in numerous fields including finance and content generation, with health care likewise expected to be a major adopter. One of the most promising recognized applications, Goh says, is lowering diagnostic errors that remain all too common and harmful in modern medicine. To date, many studies have demonstrated LLMs’ capable handling of multiple-choice and open-ended medical reasoning examination questions, but the AI tools’ use beyond education and into actual clinical practice has not been as well examined.
With their new multisite study, Goh and his colleagues sought to address this gap. The researchers recruited 50 physicians from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia. Most physicians specialized in internal medicine, though emergency medicine and family medicine were represented as well.
Over the course of an hour, the participating physicians reviewed up to six complex clinical vignettes like those that appear in diagnostic reasoning tests and which are based on actual patient histories, physical exams, and lab results. In response to the clinical cases, the physician participants offered diagnoses that they considered plausible, along with additional patient evaluation steps.
Just as in normal health care settings, the participants relied on their own medical knowledge and experience, as well as the reference materials made available to them. Of participants randomly assigned to use ChatGPT in their clinical assessment, about a third reported frequent or occasional prior use of the tool. Based on the disparate results of ChatGPT alone versus physicians who had access to the tool, many of the physicians in the ChatGPT-access group did not agree with or factor in the model’s diagnostic prediction.
Even though ChatGPT access did not improve diagnostic accuracy for physicians, those with access did complete their individual case assessments more than a minute faster on average than those physicians without ChatGPT as an aid. These findings—which will need further validation through additional research targeted at this time-saving aspect—suggest that ChatGPT and similar tools at this early stage of professional uptake can at least improve diagnostic turnaround in time-constrained clinical environments.
“ChatGPT can help make doctors’ lives more efficient,” says Goh. “Those time savings alone could justify the use of large language models and could translate into less burnout for doctors in the long run.”
Enhancing Human-AI Teamwork
Through its results, the study also points to ways that physician-AI collaboration in clinical practice can be improved. Goh suggests that physician trust is a fundamental element, meaning in practice that physicians would carefully consider the AI perspective as valid and potentially correct. This sort of earned trust could come in part from physicians understanding how an AI model was trained and on what materials. Accordingly, a health care-tailored LLM, rather than a ChatGPT-like generalized AI, might instill more confidence. In addition, physicians—just like everyone else—will need to gain familiarity with and experience using LLMs. Professional development to learn best practices could also pay dividends.
Above all, patient safety must remain at the core of any AI clinical applications, Goh notes. Guardrails need to be in place on the physicians’ side to ensure AI responses are vetted and not treated as the final diagnostic verdict, he advises, and patients will continue to expect and want the intermediary of a trusted human professional. “AI is not replacing doctors,” Goh says. “Only your doctor will prescribe medications, perform operations, or administer any other interventions.”
Nevertheless, AI is here to help, Goh says.
“What patients care about more than their diagnosis is making sure that whatever the condition they have is getting treated right,” Goh says. “Human physicians handle the treatment side of things, and the hope is that AI tools can help them perform their jobs even better.”
Following up on this groundbreaking study, Stanford University, Beth Israel Deaconess Medical Center, the University of Virginia, and the University of Minnesota have also launched a bi-coastal AI evaluation network called ARiSE (AI Research and Science Evaluation) to further evaluate GenAI outputs in healthcare. Find out more information at the ARiSE website.
Other Stanford-affiliated authors of the study include Jason Hom, Eric Strong, Yingjie Weng, and Neera Ahuja at the Stanford University School of Medicine; Eric Horvitz at Microsoft and the Stanford Institute for Human-Centered Artificial Intelligence (HAI); Arnold Milstein at the Stanford Clinical Excellence Research Center; and co-senior author Jonathan Chen at the Stanford Center for Biomedical Informatics Research and the Stanford Clinical Excellence Research Center.
Other authors of the study are Robert Gallo, co-lead author at the Center for Innovation to Implementation at the VA Palo Alto Health Care System; Hannah Kerman, Joséphine Cool, and Zahir Kanjee at Beth Israel Deaconess Medical Center and Harvard Medical School; Andrew S. Parsons at the University of Virginia School of Medicine; Daniel Yang at Kaiser Permanente; and co-senior authors Andrew P.J. Olson at the University of Minnesota Medical School and Adam Rodman at Beth Israel Deaconess Medical Center and Harvard Medical School.