Proving AI in the Clinic: An Algorithm That Accurately Evaluates Heart Failure

Date

April 24, 2023

Topics

iStock/Hanna Sova

A clinical trial shows an AI algorithm can read echocardiograms accurately and work well with healthcare practitioners.

To gain approval by the FDA, drugs typically need to be proven safe and effective in rigorous, randomized clinical trials. Often, these trials are blinded: The patients and researchers do not know which patients received the standard treatment or the experimental treatment.

Meanwhile, researchers assessing the performance of a medical AI algorithm typically look at how well it performs on a benchmark data set, says James Zou, assistant professor of biomedical data science at Stanford University and member of the Stanford Institute for Human-Centered Artificial Intelligence. If it does better than other methods, it is deemed superior.

But there’s a major limitation to the benchmarking approach: “It doesn’t tell us how the algorithm will perform in a clinical workflow,” Zou says. Researchers know that humans’ interactions with AI recommendations can influence the decisions they make. For example, some doctors might defer to an AI while others might compete with it. And these kinds of reactions could have unexpected effects on patient outcomes, Zou says. “To truly assess a medical AI algorithm’s value, there needs to be a human in the loop.”

So, after David Ouyang, a former postdoc on Zou’s team who is now a cardiologist at Cedars-Sinai Hospital, and Bryan He, a computer science graduate student at Stanford, developed an AI system for reading echocardiograms, the team decided to set up a randomized clinical trial to compare the algorithm’s accuracy with that of human sonographers in a blinded fashion: Cardiologists reviewing the readings would not know if they were looking at the work of an AI or a human.

And the algorithm, known as EchoNet, aced the trial, as recently reported in Nature: It traced the size of the heart ventricle more accurately and faster than human sonographers. The algorithm’s accuracy will likely result in better outcomes for patients by yielding appropriate evaluations of patients’ heart failure progression, Zou says. And because the algorithm saves sonographers’ time, it will also increase their productivity.

“This is one of the very first human-in-the-loop, blinded, randomized clinical trials to test a medical AI algorithm in the cardiology space,” Zou says. “We hope it will help set the standard for medical AI testing in the future.”

Why EchoNet?

Nearly 6.5 million Americans over age 20 have heart failure, meaning the heart muscle is losing its ability to pump enough blood to meet the body’s needs. To evaluate heart failure’s severity, physicians often order an echocardiogram – an ultrasound video that measures various features of the heart as it pumps blood. One such feature is a measure of heart health known as the left ventricular ejection fraction (LVEF), which is the ratio of the maximum volume of the left ventricle to its minimum volume during a heartbeat. The LVEF is reported as a percentage, with 52% being the cutoff between a healthy and unhealthy heart and 30% signaling severe disease – information that helps cardiologists diagnose and treat heart failure.

The sonographers who perform echocardiograms calculate the LVEF by hand, tracing the size of the ventricle in each of two individual video frames: the image where the ventricle is most expanded and the image where it’s most contracted. They then use a mathematical formula to calculate the LVEF and report it to a cardiologist, who re-evaluates the tracing to make sure the LVEF is as accurate as possible.

This approach has several limitations. First, there’s significant variability in how sonographers and cardiologists trace the ventricle. Second, the American Heart Association recommends measuring the LVEF over several heartbeats. “There can be beat-to-beat variation in patients that is not captured when we only trace one beat,” Zou says. However, sonographers only read a single heartbeat because hand tracing the heart chamber is tedious and time consuming, Zou says. “The sonographers we’ve spoken to and worked with would welcome the help of an AI algorithm.”

EchoNet addresses all these problems: It is very accurate, as Ouyang, Zou, and their colleagues reported in a 2020 Nature paper; it can evaluate the LVEF over multiple heartbeats; and it does so in just milliseconds, saving sonographers’ time.

Designing the EchoNet Clinical Trial

For EchoNet to be FDA approved – as well as accepted by the medical community – Zou and Ouyang knew they needed to go beyond their initial proof of EchoNet’s algorithm in silico. They needed to demonstrate its accuracy in a clinical setting. The randomized clinical trial they designed – one of the first for an AI algorithm in the cardiology setting – was blinded to ensure clinicians’ attitudes toward AI wouldn’t affect the results. “If someone actually knows that they're dealing with an AI, that knowledge itself can affect their behavior,” Zou notes.

In the trial, a group of experienced sonographers and EchoNet traced the size of the ventricle in a retrospective dataset of over 3,000 electrocardiogram videos. To ensure cardiologists didn’t know if the tracings came from a human or an AI, EchoNet delivered its tracings and LVEF results through the same software reporting system used by sonographers. In addition to reducing any bias toward AI, this approach helped the team evaluate whether clinicians would resist or welcome the AI in actual practice. “Any change to the clinical workflow can be a huge barrier to actual deployment of these algorithms,” Zou says.

Finally, experienced cardiologists reviewed the tracings of the maximum and minimum ventricle sizes and decided whether to modify them or accept them as accurate.

The results: Cardiologists were unable to distinguish tracings drawn by sonographers from those drawn by EchoNet and were less likely to modify tracings made by the AI than those made by sonographers. The AI’s LVEF calculations were also more consistent with cardiologists’ assessments from the prior clinical report.

The team also looked at whether the EchoNet or sonographer tracings would have affected a cardiologist’s treatment recommendation – for example, whether a defibrillator would need to be implanted. The finding: Cardiologists corrected the echocardiogram readings in a way that crossed that threshold only 1.3% of the time for AI tracings compared with 3.1% for sonographer tracings. This suggests that by yielding more accurate readings, the use of EchoNet would also yield better outcomes for patients.

Further studies of EchoNet in clinical practice will need to look at actual medical outcomes of patients and should be performed across multiple hospitals to involve practitioners with different backgrounds and levels of experience. Such research is on Zou’s agenda.

AI Algorithms and the FDA

The EchoNet algorithm is currently progressing through the FDA approval process. “Once we get the approval, we hope to deploy EchoNet broadly across many hospitals beyond Stanford and Cedars-Sinai,” Zou says.

He’s also excited about the impact this study could have on the medical AI community. There have been only a handful of randomized clinical trials of AIs, and very few have been blinded or implemented into an existing clinical workflow. “We think this clinical trial of EchoNet demonstrates a possible standard for how medical AIs should be evaluated going forward.”

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.

Related News

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Mar 05, 2025

Media Mention

A study led by Stanford HAI Faculty Fellow Johannes Eichstaedt reveals that large language models adapt their behavior to appear more likable when they are being studied, mirroring human tendencies to present favorably.

Media Mention

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Natural Language ProcessingMachine LearningGenerative AIFoundation ModelsMar 05

Holistic Evaluation of Large Language Models for Medical Applications

Nigam Shah, Mike Pfeffer, Percy Liang

Feb 28, 2025

News

Medical and AI experts build a benchmark for evaluation of LLMs grounded in real-world healthcare needs.

News

Holistic Evaluation of Large Language Models for Medical Applications

Nigam Shah, Mike Pfeffer, Percy Liang

HealthcareFoundation ModelsFeb 28

Medical and AI experts build a benchmark for evaluation of LLMs grounded in real-world healthcare needs.

Managing Risks in AI-Powered Biomedical Research

Scott Hadly

Quick ReadFeb 24, 2025

News

How researchers are working to ensure AI accelerates medical breakthroughs without unintended harm.

News

Managing Risks in AI-Powered Biomedical Research

Scott Hadly

HealthcareQuick ReadFeb 24

How researchers are working to ensure AI accelerates medical breakthroughs without unintended harm.

news

Proving AI in the Clinic: An Algorithm That Accurately Evaluates Heart Failure

Date

April 24, 2023

Topics

Healthcare

Machine Learning

iStock/Hanna Sova

A clinical trial shows an AI algorithm can read echocardiograms accurately and work well with healthcare practitioners.

Why EchoNet?

Designing the EchoNet Clinical Trial

Finally, experienced cardiologists reviewed the tracings of the maximum and minimum ventricle sizes and decided whether to modify them or accept them as accurate.

AI Algorithms and the FDA

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.