Doctors Receptive to AI Collaboration in Simulated Clinical Case without Introducing Bias
While many health care practitioners believe generative language models like ChatGPT will one day be commonplace in medical evaluations, it’s unclear how these tools will fit into the clinical environment. A new study points the way to a future where human physicians and generative AI collaborate to improve patient outcomes.
In a mock medical environment with role-playing patients reporting chest pains, doctors accepted the advice of a prototype ChatGPT-like medical agent and even willingly adapted their diagnoses based on the AI’s advice. The upshot was better outcomes for the patients.
In the trial, 50 licensed doctors reviewed videos of white male and Black female actors describing their chest pains symptoms and electrocardiograms to make triage-, risk-, and treatment-based assessments of the patients. In the study’s next step, the doctors were then presented with ChatGPT-based recommendations derived from the same conversations and asked to reevaluate their own assessments.
Unconventional Wisdom
The study found that the doctors were not just receptive to AI advice but willing to reconsider their own analyses based on that advice. Most important, this willingness led to a “significant improvement” in accuracy of the doctors’ clinical decisions. Notable, as well, is that the racial and gender makeup of the patient pool was not happenstance but carefully structured into the study to ensure that AI did not introduce or intensify existing racial or gender biases — which the study found it did not.
The study’s findings go against the conventional wisdom that doctors may be resistant, or even antagonistic, to the introduction of AI in their workflows.
“This study shows that doctors who work with AI do so collaboratively. It’s not at all adversarial,” said Ethan Goh, a health care AI researcher in Stanford’s Clinical Excellence Research Center (CERC) and the first author of the study. “And, when the AI tool is good, the collaboration produces better outcomes.”
The study was published in preprint by medRxiv and has been formally accepted by a peer-reviewed conference, AMIA Informatics Summit in Boston this March.
Milestone Moment
Goh is quick to point out that the AI tools used in the study are only a prototype and not yet ready or approved for clinical application. The results are nonetheless encouraging about the prospects for future collaborations between doctors and AI, he said.
“The overall point is when we do have those tools, someday, they could prove useful in augmenting the doctors and improving outcomes. And, far from being resistant to such tools, physicians seem willing, even welcoming, of such advances,” Goh said. In a survey following the trial, a majority of the doctors confirmed that they fully anticipate large language model-based (LLM) tools to play a significant role in clinical decision-making.
As such, the authors write that this particular study is “a critical milestone” in the progress of LLMs in medicine. With this study, medicine moves beyond evaluating whether generative LLMs belong in the clinical environment to exactly how they will fit in that environment and how they will support human physicians in their work, not replace them, Goh said.
“It’s no longer a question of whether LLMs will replace doctors in the clinic — they won’t — but how humans and machines will work together to make medicine better for everyone,” Goh said.
Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.