Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Stanford Develops Real-World Benchmarks for Healthcare AI Agents | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Stanford Develops Real-World Benchmarks for Healthcare AI Agents

Date
September 15, 2025
Topics
Healthcare

Researchers are establishing standards to validate the efficacy of AI agents in clinical settings.

Beyond the hype and hope surrounding the use of artificial intelligence in medicine lies the real-world need to ensure that, at the very least, AI in a healthcare setting can carry out tasks that a doctor would in electronic health records.

Creating benchmark standards to measure that is what drives the work of a team of Stanford researchers. While the researchers note the enormous potential of this new technology to transform medicine, the tech ethos of moving fast and breaking things doesn’t work in healthcare. Ensuring that these tools are capable of doing these tasks is vital, and then they can be used as tools that augment the care clinicians provide every day.

“Working on this project convinced me that AI won’t replace doctors anytime soon,” said Kameron Black, co-author on the new benchmark paper and a Clinical Informatics Fellow at Stanford Health Care. “It’s more likely to augment our clinical workforce.”

MedAgentBench: Testing AI Agents in Real-World Clinical Systems

Black is one of a multidisciplinary team of physicians, computer scientists, and researchers from across Stanford University who worked on the new study, MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents, published in the New England Journal of Medicine AI.

Although large language models (LLMs) have performed well on the United States Medical Licensing Examination (USMLE) and at answering medical-related questions in studies, there is currently no benchmark testing how well LLMs can function as agents by performing tasks that a doctor would normally do, such as ordering medications, inside a real-world clinical system where data input can be messy. 

Unlike chatbots or LLMs, AI agents can work autonomously, performing complex, multistep tasks with minimal supervision. AI agents integrate multimodal data inputs, process information, and then utilize external tools to accomplish tasks, Black explained. 

Overall Success Rate (SR) Comparison of State-of-the-Art LLMs on MedAgentBench

Model

Overall SR

Claude 3.5 Sonnet v2

69.67%

GPT-4o

64.00%

DeepSeek-V3 (685B, open)

62.67%

Gemini-1.5 Pro

62.00%

GPT-4o-mini

56.33%

o3-mini

51.67%

Qwen2.5 (72B, open)

51.33%

Llama 3.3 (70B, open)

46.33%

Gemini 2.0 Flash

38.33%

Gemma2 (27B, open)

19.33%

Gemini 2.0 Pro

18.00%

Mistral v0.3 (7B, open)

4.00%

While previous tests only assessed AI’s medical knowledge through curated clinical vignettes, this research evaluates how well AI agents can perform actual clinical tasks such as retrieving patient data, ordering tests, and prescribing medications. 

“Chatbots say things. AI agents can do things,” said Jonathan Chen, associate professor of medicine and biomedical data science and the paper’s senior author. “This means they could theoretically directly retrieve patient information from the electronic medical record, reason about that information, and take action by directly entering in orders for tests and medications. This is a much higher bar for autonomy in the high-stakes world of medical care. We need a benchmark to establish the current state of AI capability on reproducible tasks that we can optimize toward.”

The study tested this by evaluating whether AI agents could utilize FHIR (Fast Healthcare Interoperability Resources) API endpoints to navigate electronic health records.

The team created a virtual electronic health record environment that contained 100 realistic patient profiles (containing 785,000 records, including labs, vitals, medications, diagnoses, procedures) to test about a dozen large language models on 300 clinical tasks developed by physicians. In initial testing, the best model, in this case, Claude 3.5 Sonnet v2, achieved a 70% success rate.

“We hope this benchmark can help model developers track progress and further advance agent capabilities,” said Yixing Jiang, a Stanford PhD student and co-author of the paper.

Many of the models struggled with scenarios that required nuanced reasoning, involved complex workflows, or necessitated interoperability between different healthcare systems, all issues a clinician might face regularly. 

“Before these agents are used, we need to know how often and what type of errors are made so we can account for these things and help prevent them in real-world deployments,” Black said.

What does this mean for clinical care? Co-author James Zou and Dr. Eric Topol claim that AI is shifting from a tool to a teammate in care delivery. With MedAgentBench, the Stanford team has shown this is a much more near-term reality by showcasing several frontier LLMs in their ability to carry out many day-to-day clinical tasks that a physician would perform. 

Already the team has noticed improvements in performance of the newest versions of models. With this in mind, Black believes that AI agents might be ready to handle basic clinical “housekeeping” tasks in a clinical setting sooner than previously expected. 

“In our follow-up studies, we’ve shown a surprising amount of improvement in the success rate of task execution by newer LLMs, especially when accounting for specific error patterns we observed in the initial study,” Black said. “With deliberate design, safety, structure, and consent, it will be feasible to start moving these tools from research prototypes into real-world pilots.”

The Road Ahead

Black says benchmarks like these are necessary as more hospitals and healthcare systems are incorporating AI into tasks including note-writing and chart summarization.

Accurate and trustworthy AI could also help alleviate a looming crisis, he adds. Pressed by patient needs, compliance demands, and staff burnout, healthcare providers are seeing a worsening global staffing shortage, estimated to exceed 10 million by 2030.

Instead of replacing doctors and nurses, Black hopes that AI can be a powerful tool for clinicians, lessening the burden of some of their workload and bringing them back to the patient bedside. 

“I’m passionate about finding solutions to clinician burnout,” Black said. “I hope that by working on agentic AI applications in healthcare that augment our workforce, we can help offload burden from clinicians and divert this impending crisis.”

Paper authors: Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, and Jonathan H. Chen

Read the piece in the New England Journal of Medicine AI.

Share
Link copied to clipboard!
Contributor(s)
Scott Hadly

Related News

To Practice PTSD Treatment, Therapists Are Using AI Patients
Sarah Wells
Nov 10, 2025
News
Doctor works on computer in the middle of a therapy session

Stanford's TherapyTrainer deploys AI to help therapists practice skills for written exposure therapy.

News
Doctor works on computer in the middle of a therapy session

To Practice PTSD Treatment, Therapists Are Using AI Patients

Sarah Wells
HealthcareNov 10

Stanford's TherapyTrainer deploys AI to help therapists practice skills for written exposure therapy.

Using AI to Streamline Speech and Language Services for Children
Katharine Miller
Oct 27, 2025
News
A child works with a speech pathologist to sound out words

Stanford researchers show that although top language models cannot yet accurately diagnose children’s speech disorders, fine-tuning and other approaches could well change the game.

News
A child works with a speech pathologist to sound out words

Using AI to Streamline Speech and Language Services for Children

Katharine Miller
Generative AIHealthcareOct 27

Stanford researchers show that although top language models cannot yet accurately diagnose children’s speech disorders, fine-tuning and other approaches could well change the game.

How To Build a Safe, Secure Medical AI Platform
Duncan McElfresh, Aditya Sharma, Pranav Masariya, Clancy Dennis, Elvis Jones Vishantan Kumar, Xu Wang, Krishna Jasti, Satchi Mouniswamy, Anurang Revri and Nikesh Kotecha
Oct 22, 2025
News
doctor focuses on a computer screen

Teams across Stanford Health Care’s Technology organization came together to build “ChatEHR”, a privacy preserving and practical GenAI tool that could serve as a model for other health systems

News
doctor focuses on a computer screen

How To Build a Safe, Secure Medical AI Platform

Duncan McElfresh, Aditya Sharma, Pranav Masariya, Clancy Dennis, Elvis Jones Vishantan Kumar, Xu Wang, Krishna Jasti, Satchi Mouniswamy, Anurang Revri and Nikesh Kotecha
HealthcareOct 22

Teams across Stanford Health Care’s Technology organization came together to build “ChatEHR”, a privacy preserving and practical GenAI tool that could serve as a model for other health systems