AI can Outperform Humans in Writing Medical Summaries

Date

June 03, 2024

Topics

A new study adapts large language models to summarize clinical documents, showing a promising path for AI to improve clinical workflows and patient care.

Physicians meet with dozens of patients every day and make critical health-related recommendations based on notes, patient descriptions, test results, and diagnostic information gathered in those meetings. All this textual information is typically amassed in the patient’s electronic health record (EHR). The sheer volume of information in electronic health records has become an inflection point in modern medicine. Most doctors now rely on short summaries of long-form notes and medical records to manage patient care.

“The clinical burden of medical documentation is high, and it is time-consuming work. And this has consequences for patients,” says Dave Van Veen, a doctoral candidate in electrical engineering and first author of a new study in the journal Nature Medicine, who is exploring the possibilities of AI-assisted summarization. “Doctors have less time for patient care and there is always the possibility of error anytime you are summarizing information from EHR.”

In the study, Van Veen and colleagues at Stanford University have adapted eight large language models (LLMs) to clinical text and tested their summarization skills against those of human medical experts. More often than not, the researchers say, physicians preferred summaries generated by AI to those done by humans.

Read the study, Adapted Large Language Models can Outperform Medical Experts in Clinical Text Summarization

“AI often generates summaries that are comparable to or better than those written by medical experts. This demonstrates the potential of LLMs to integrate into the clinical workflow and reduce documentation burden,” says the study’s senior author, Akshay Chaudhari, an assistant professor of radiology and, by courtesy, of biomedical data science. “Such technology development and validation may allow clinicians to spend more time with patients rather than the EHR.”

Apples to Apples

Summarizing medical records is difficult, highly consequential, and detail-oriented work even for experienced medical professionals.

“We think AI has great potential to serve in an assistive capacity to expedite the physician’s caseload and also to reduce errors. More time with the patient and greater accuracy could lead to better patient care,” says Van Veen.

In their study, Chaudhari, Van Veen, and colleagues worked with eight established LLMs and adapted them to summarize a range of textual medical information—radiology reports, patient questions, progress notes, and doctor–patient dialogues. Then, in blind tests, a panel of 10 physicians compared summaries generated by the best performing LLMs to those created by human medical experts, rating the summaries on “completeness, correctness, and conciseness.”

“In most cases, summaries from the best-adapted LLMs were rated as good or better than those created by humans,” Van Veen says. Almost half the time (45%), evaluators thought the AI-generated summaries were at least as good as the human-produced ones. More than a third of the time (36%), they judged them “superior.”

No Room for Error

Knowing that much has been made of AI’s tendency to “hallucinate”—to, in essence, make up information that is not true—the researchers were keen to explore whether AI would introduce fabricated information into their summaries. If so, it would be a huge strike against AI, considering the serious consequences of the medical setting.

“It turns out that humans also get things wrong sometimes and that the best model, while not perfect, produced fewer instances of fabricated information than even human medical experts,” Van Veen says. “Far from introducing inaccuracies, LLMs could actually end up reducing fabricated information in clinical practice.”

Van Veen and colleagues will now fine-tune their models and eventually work to bring AI-assistance to real-world clinical settings.

“Stay tuned. We’re close to testing LLMs in real-world environments and helping doctors spend less time on documentation so they can provide better care to patients,” Van Veen says.

Contributing study authors include: Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerová, Nidhi Rohatgi, Poonam Hosamani, William Collins, Neera Ahuja, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, and John Pauly.

Financial support for this study courtesy of Microsoft Azure OpenAI credits, Accelerate Foundation Models Academic Research (AFMAR), One Medical, National Institute of Health (NIH), Agency for Healthcare Research and Quality, the Gordon and Betty Moore Foundation, and the National Institute of Biomedical Imaging and Bioengineering.

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.

Related News

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Mar 05, 2025

Media Mention

A study led by Stanford HAI Faculty Fellow Johannes Eichstaedt reveals that large language models adapt their behavior to appear more likable when they are being studied, mirroring human tendencies to present favorably.

Media Mention

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Natural Language ProcessingMachine LearningGenerative AIFoundation ModelsMar 05

Holistic Evaluation of Large Language Models for Medical Applications

Nigam Shah, Mike Pfeffer, Percy Liang

Feb 28, 2025

News

Medical and AI experts build a benchmark for evaluation of LLMs grounded in real-world healthcare needs.

News

Holistic Evaluation of Large Language Models for Medical Applications

Nigam Shah, Mike Pfeffer, Percy Liang

HealthcareFoundation ModelsFeb 28

Medical and AI experts build a benchmark for evaluation of LLMs grounded in real-world healthcare needs.

Managing Risks in AI-Powered Biomedical Research

Scott Hadly

Quick ReadFeb 24, 2025

News

How researchers are working to ensure AI accelerates medical breakthroughs without unintended harm.

News

Managing Risks in AI-Powered Biomedical Research

Scott Hadly

HealthcareQuick ReadFeb 24

How researchers are working to ensure AI accelerates medical breakthroughs without unintended harm.

news

AI can Outperform Humans in Writing Medical Summaries

Date

June 03, 2024

Topics

Healthcare

Natural Language Processing

A new study adapts large language models to summarize clinical documents, showing a promising path for AI to improve clinical workflows and patient care.

Read the study, Adapted Large Language Models can Outperform Medical Experts in Clinical Text Summarization

Apples to Apples

Summarizing medical records is difficult, highly consequential, and detail-oriented work even for experienced medical professionals.

No Room for Error

Van Veen and colleagues will now fine-tune their models and eventually work to bring AI-assistance to real-world clinical settings.

“Stay tuned. We’re close to testing LLMs in real-world environments and helping doctors spend less time on documentation so they can provide better care to patients,” Van Veen says.

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.