Holistic Evaluation of Large Language Models for Medical Applications

Medical and AI experts build a benchmark for evaluation of LLMs grounded in real-world healthcare needs.

Feb 28, 2025

Nigam Shah, Mike Pfeffer, Percy Liang

Large language models (LLMs) hold immense potential for improving healthcare, supporting everything from diagnostic decision-making to patient triage. They can now ace standardized medical exams such as the United States Medical Licensing Examination (USMLE). However, evaluating clinical readiness based solely on exam performance is akin to assessing someone’s driving ability using only a written test on traffic rules, a recent study finds.

While LLMs can generate sophisticated responses to healthcare questions, their real-world clinical performance remains under-examined. In fact, a recent JAMA review found that only 5% of evaluations used real patient data and the majority of studies evaluate performance on standardized medical exams. This state of affairs underscores the need for better evaluations that measure performance on real-world medical tasks, preferably using real clinical data when possible.

In 2022, Stanford HAI’s Center for Research on Foundation Models developed a benchmarking framework, the Holistic Evaluation of Language Models (HELM), which provides evaluations that are continually updated over time. We leverage the HELM framework for medical applications to create MedHELM. Our team, comprising researchers at the Center for Biomedical Informatics Research (BMIR), Stanford Health Care’s Technology and Digital Solutions team (TDS), and Microsoft Health and Life Sciences (HLS), collaborated with clinicians, administrators, and clinical informatics researchers to gather diverse and clinically relevant use cases for LLMs.

Developing a Taxonomy of Real-World Tasks

To ensure that MedHELM covers a broad range of medical scenarios, we start by listing tasks that healthcare practitioners deem worthwhile. Following HELM’s principles of broad coverage, multi-metric measurement, and standardization, we grouped these tasks into five overarching categories based on (1) Relevance, i.e., ensuring tasks reflect real-world medical use cases, and (2) Granularity, i.e., balancing specificity and generalization across different medical domains. The five categories are: Clinical Decision Support, Clinical Note Generation, Patient Communication and Education, Medical Research Assistance, and Administration and Workflow. We further divided them into 22 subcategories, producing an initial set of 98 tasks.

To validate the taxonomy’s clarity and clinical relevance, we surveyed 29 practicing clinicians across 15 medical specialties from Stanford Health Care and former trainees of the Master of Science in Clinical Informatics Management (MCiM) program. The reviewers agreed with the task definitions in our taxonomy in 96.73% of cases, and the taxonomy’s coverage of clinical tasks was rated 4.21 out of 5. Based on their feedback, we introduced 23 additional tasks, increasing the total from 98 to 121, and refined the scope and definitions of several subcategories (Figure 1).

Figure 1: An overview of the final taxonomy, comprising 5 categories, 22 subcategories, and an example subset from the 121 tasks.

Identifying Public and Private Datasets

Next, we identified relevant datasets, ranging from patient notes and structured EHR codes to patient-provider conversations, and mapped them to the appropriate subcategories. We assembled 31 datasets, of which 11 were newly created for MedHELM and 20 were sourced from existing sources. By ensuring that each subcategory has at least one corresponding dataset, we enabled a comprehensive evaluation of model performance across a spectrum of real-world medical scenarios, from documenting diagnostic reports to facilitating patient education. Figure 2 shows the datasets mapped to the Clinical Decision Support category.

Figure 2: An overview of the datasets included in the Clinical Decision Support category.

Converting Datasets to Benchmarks

To convert a dataset into a benchmark per the HELM framework, we need to define four elements:

Context: The portion of the dataset the model must analyze (e.g., patient notes)
Prompt: An instruction (e.g., “Calculate a patient’s HAS-BLED Score” for supporting diagnostic decisions)
A reference response: The reference output (a numeric result, classification label, or sample text) against which the model’s response can be scored
Metric: A scoring method (e.g., exact match, classification accuracy, BertScore) that quantifies how closely the model’s output matches the reference

Consider MedCalc-Bench, a publicly available dataset for assessing a model’s ability to perform clinically relevant numeric computations. It falls under the Supporting Diagnostic Decisions subcategory, and each entry in the dataset contains a clinical note, a prompt, and a ground truth answer. For example:

Context: “Patient note: A 70-year-old gentleman presents with a complex clinical picture. He has a history of poorly controlled hypertension, with his recent primary care visit demonstrating 172/114. His INR results have fluctuated and have been elevated above therapeutic range in more than two-thirds of measurements in the last year. He maintains a strict no-alcohol rule, keeping his weekly alcoholic drink count to zero. He denies use of aspirin, clopidogrel, or NSAIDs. A few years back, unfortunately, he was admitted due to a significant instance of bleeding requiring transfusion. His liver function tests have been normal, excluding any ongoing liver disease. His health status requires vigilant monitoring given the various vital factors involved.”
Prompt: “Question: What is the patient’s HAS-BLED score?”
Gold Standard response: “4”
Metric: Exact Match

Choosing Evaluation Metrics

Next we assess model performance across the different medical use cases using the benchmarks. While many benchmarks in MedHELM have discrete performance metrics, such as classification accuracy for yes/no questions or exact match for medical calculations, 12 datasets’ reference is open-ended text generation. Establishing good metrics for open-ended text generations is challenging. Existing text-matching metrics have limitations, such as favoring longer generations or specific writing styles, which may not accurately reflect true clinical quality. Outputs with high lexical overlap might differ significantly in correctness or completeness of medical details (such as omitting or adding a word ‘fever’), which can impact patient care. So we use a multi-pronged strategy for assessing generated text:

String-based metrics (BLEU, ROUGE, METEOR): These metrics assess n-gram overlap, which looks for shared sequence of words, letters, or other symbols, between the model output and reference text. While valuable for capturing broad linguistic similarity, they may overlook domain-specific nuances (e.g., fever).
Semantic similarity (BERTScore): By mapping text as embeddings and evaluating semantic alignment, metrics such as BERTScore can detect paraphrased expressions that n-gram based metrics might overlook. However, text with domain-specific jargon and formatting inconsistencies can artificially lower semantic similarity.

While each individual approach is imperfect, taken together they still provide a way to assess overall quality and semantic consistency in open-ended text generation. In the future, LLM-as-a-judge approaches can be employed if methods to verify a judge LLM’s performance can be validated in a medical setting.

Preliminary Results

We evaluated six large language models of varying sizes and architectures in a zero-shot setting, meaning no additional fine-tuning was applied for any specific benchmark. The models were chosen based on availability in Stanford Medicine’s secure infrastructure, given that patient data cannot be used with public APIs on the internet. This setup allowed us to assess each model’s out-of-the-box capabilities on a range of healthcare use cases, from structured classification tasks such as determining whether a future clinical event will occur (EHRSHOT) to open-ended text generation scenarios like generating treatment plans (MTSamples) or summarizing radiology reports (MIMIC-RRS). The six models were:

Large models: GPT-4o (2024-05-13, OpenAI) and Gemini 1.5 Pro (Google)
Medium models: Llama-3.3-70B-instruct (Meta) and GPT-4o-mini (2024-07-18, OpenAI)
Small models: Phi-3.5-mini-instruct (Microsoft) and Qwen-2.5-7B-instruct (Alibaba)

Out of the 186 possible benchmarking runs (31 x 6), we completed 170 evaluations using up to 1,000 samples from each benchmark. Constraints with our PHI-compliant environment precluded evaluating the small models on eight private datasets. Figure 3 summarizes model performance across all categories. For the example entry from MedCalc-Bench shown before, GPT-4o Response was “4” (correct) and Qwen-2.5-7B-instruct Response was “3” (incorrect).

table showing Performance of language models across 31 healthcare benchmarks, 5 categories, and 22 subcategories.

Figure 3: Performance of language models across 31 healthcare benchmarks, 5 categories, and 22 subcategories. In each row, green indicates the best-performing model and red represents the worst-performing model according to the corresponding metric for a given dataset. Datasets with * denote that a subsample of 1,000 instances were evaluated, with the overall dataset containing more than 1,000 instances.

The preliminary results provide only a partial picture of clinical LLM capabilities. Models such as Gemini 1.5 Pro and Phi-3.5-instruct often score poorly for reasons other than performance per se. For example, they either decline to answer sensitive medical questions or fail to follow formatting instructions (such as providing discrete multiple-choice answers instead of explanatory text). These issues highlight the need for further work into how to match outputs to metrics used.

Overall, large models did well on complex reasoning tasks (e.g., performing medical calculations and in detecting race bias in clinical text), while medium models performed competitively on medical prediction tasks with lower computational demands (e.g., predicting readmission risk). Small models, while adequate for well-structured tasks, struggled with tasks requiring domain expertise, particularly in mental health counseling and medical knowledge assessment. Notably, open-ended text generation produced comparable BertScore-F1 ranges across model sizes. Across 10 out of 12 benchmarks, the difference between minimum and maximum scores was less than 0.07, suggesting that such automated NLP metrics may not be adequate for analyzing domain performance gaps.

Where to Go From Here

We are excited to share MedHELM broadly, given the high impact it can have in enabling robust, reliable, and safer deployments of language models for tasks that matter in medicine. Having a way to benchmark opens up interesting avenues for future work. For example, we found that much of the variation in the BertScore-F1 within benchmarks was a result of the output format not being standardized. There were notable differences in the “steerability” of these LLMs toward a given output, such as some models’ reluctance to generate responses in a given format (i.e., with specific headings or returning only the multiple-choice answer) even when the structure was provided in the prompt. These observations emphasize the need for further work that matches metric design with model steerability to quantify model performance.

For example, we plan to incorporate fact-based metrics (e.g., SummaC, FActScore) to better quantify correctness, and to explore LLM-as-a-judge approaches with direct clinician feedback, thereby enabling nuanced scoring of outputs and their alignment with real-world preferences.

Additionally, the benchmark itself can grow with more specialized datasets for deeper coverage of all 121 tasks and a larger suite of models (e.g., Deepseek-R1). We invite feedback as well as contributions to the MedHELM effort from the community.

Resources

Website: Explore the latest MedHELM results, explore aggregate statistics, and gain detailed insights into model behavior across different scenarios. Note that the individual records from the private datasets have to be redacted and only the records from public datasets are available for examination.

GitHub repository: Adopt and use MedHELM for your research by adding new scenarios and metrics to HELM, and leverage its infrastructure for performing rigorous, systematic evaluations.

Contribution Acknowledgements

MedHELM is made possible by a unique collaboration between the Center for Research on Foundation Models, Technology and Digital Solutions at Stanford Health Care, and Microsoft Health and Life Sciences in partnership with faculty in the Departments of Medicine, Computer Science, Anesthesiology, Dermatology, Pediatrics, and Biomedical Data Science as well as trainees from the MCiM program at the Clinical Excellence Research Center. The effort is coordinated by the Center for Biomedical Informatics Research across a large multi-disciplinary team comprising:

Trainees: Suhana Bedi, Miguel Angel Fuentes Hernandez, Alyssa Unell, Hejie Cui, Michael Wornow, Akshay Swaminathan, Mehr Kashyap, Philip Chung, Fateme Nateghi

SHC staff: Hannah Kirsch, Jennifer Lee, Nikesh Kotecha, Timothy Keyes, Juan M. Banda, Nerissa Ambers, Carlene Lugtu, Aditya Sharma, Bilal Mawji, Alex Alekseyev, Vicky Zhou, Vikas Kakkar, Jarrod Helzer, Jason Alan Fries, Anurang Revri, Yair Bannett

CRFM researchers: Yifan Mai, Tony Lee

Microsoft researchers: Shrey Jain, Mert Oez, Hao Qiu, Leonardo Schettini, Wen-wai Yim, Matthew Lungren, Eric Horvitz

Faculty contributors: Roxana Daneshjou, Jonathan Chen, Emily Alsentzer, Keith Morse, Nirmal Ravi, Nima Aghaeepour, Vanessa Kennedy, Sanmi Koyejo

Survey contributors: Ashwin Nayak, Shivam Vedak, Sneha Jain, Birju Patel, Oluseyi Fayanju, Shreya J. Shah, Ethan Goh, Dong-han Yao, Brian Soetikno, Eduardo Reis, Sergios Gatidis, Vasu Divi, Robson Capasso, Rachna Saralkar, Chia-Chun Chiang, Jenelle Jindal, Tho Pham, Faraz Ghoddusi, Steven Lin, Albert Chu, Christy Hong, Mohana Roy, Michael Gensheimer, Hinesh Patel, Kevin Schulman (MCiM program director).