
A recent JAMA article reported that electronic health records (EHR) data was used in only 5% of the studies evaluating healthcare uses of large language models (LLMs). The vast majority of reported research results are based on one set of patient data called the Medical Information Mart for Intensive Care (MIMIC) database from PhysioNet, or rely on private data.
While MIMIC has been transformative for healthcare AI research, it lacks longitudinal health data—patient records that span extended time periods. This makes MIMIC inadequate for evaluating LLMs for tasks requiring long-term trajectories of patient care, such as chronic disease management, multi-visit prediction, or care pathway optimization. As a result, there is a gap between the purported benefits of LLMs and the ability of researchers to verify those benefits in real-life settings. This evaluation gap limits the ability to test model generalization across diverse patient populations and healthcare systems and can only be bridged by introducing EHR benchmarks beyond MIMIC.
Benchmark datasets that reflect the diversity and complexity of real-world healthcare are critical for fostering equitable, scalable AI systems. Simply put, responsible AI and honest evaluation of clinical benefit requires new benchmarks that contain longitudinal patient data and address gaps in population representation. While this need for new datasets is widely recognized, the complexity of clinical data, the stringent privacy, and ethical considerations create barriers to data sharing because sharing on platforms such as Hugging Face – which hosts many general-purpose machine learning benchmarks – is not feasible.
To address this need, we have developed three de-identified EHR benchmark datasets – EHRSHOT, INSPECT, and MedAlign – as a first step towards addressing this “evaluation gap.” These datasets, freely available to researchers worldwide for non-commercial use, represent a significant step forward in enabling the rigorous evaluation of healthcare AI. These datasets complement our release of 20 EHR foundation models, including decoder-only transformers (CLMBR), time-to-event models (MOTOR), and pretrained weights for benchmarking subquadratic, long-context architectures such as Hyena and Mamba.
Taken together, these datasets and models are a concrete step towards the shared vision of robust, accessible tools for the healthcare AI research community.
Summary of De-identified Datasets

The three de-identified longitudinal EHR datasets collectively contain 25,991 unique patients, 441,680 visits, and 295 million clinical events. While smaller in terms of patient count than MIMIC datasets, our datasets provide longitudinal data, providing a detailed view of each patient’s health journey over time; thus complementing the MIMIC datasets.
Longitudinal Data Address the Missing Context Problem
EHRs contain structured information, such as lab values and billing codes, as well as unstructured data like clinical narratives and medical imaging to provide a holistic view of patient health. For example, the INSPECT dataset contains 23,248 paired CT scans and radiology note impressions. MedAlign provides 46,252 clinical notes spanning 128 different note types, offering a detailed longitudinal view of patient care across 275 individuals. MedAlign stands out for containing such a comprehensive set of clinical documentation, capturing a variety of contexts that are often missing in other datasets.
Longitudinal datasets address the missing context problem in healthcare AI, where current medical datasets fail to reflect the full scope of past and future health information in real-world EHRs. Providing such longitudinal health context is essential for training multimodal models to understand complex, long-term health patterns, such as managing chronic diseases or cancer treatment planning.

Figure 1. A CT scan from the INSPECT dataset highlights a key issue in vision-language model-style pretraining: the exclusion of context, such as past medical history and future health outcomes. This missing context problem limits the ability to train models that incorporate full health trajectories (i.e., past and future events) to learn correlations essential for identifying prognostic markers in multimodal data. Figure adapted from (Huo et al. 2024).
Standardized Tasks enable Accurate Comparisons
All de-identified datasets include benchmark tasks to evaluate current technical challenges affecting healthcare AI. Having such benchmark tasks enables the creation of a unified leaderboard to support community tracking of state-of-the-art model developments.

In addition to having defined task labels, preserving unseen held-out test sets is essential for accurately comparing the performance of EHR foundation models in classification and prediction tasks. With MIMIC data, individual researchers typically define their own train/test splits, which necessitates retraining foundation models from scratch to compare methods—a costly and impractical process that hinders reproducibility and standardization of performance estimates.
To address this, our benchmark datasets also include a canonical train/validation/test split across all datasets. These identifiers will remain consistent across current and future dataset releases. All our released EHR foundation models also respect this canonical split, ensuring that evaluations on our benchmarks do not suffer from data leakage from pretraining.
Adherence to Data Standards to Support Tool Ecosystems
Our de-identified datasets are derived from Stanford’s internal STARR data repository and are released in Observational Medical Outcomes Partnership Common Data Model (OMOP CDM 5.4) format. While OMOP supports a robust ecosystem of statistical analysis tools, it is not optimized for training and evaluating foundation models. Therefore we participated in co-developing the Medical Event Data Standard (MEDS), an international collaboration among academic institutions such as Harvard University, Massachusetts Institute of Technology (MIT), Columbia University, and Korea Advanced Institute of Science & Technology (KAIST), to establish an ecosystem for EHR-based model development and benchmarking along with tutorials, data quality tools, and our open-source training infrastructure. To bridge the OMOP and the MEDS worlds, we also release tools such as MEDS Reader to accelerate data loading speeds by up to 100x as well as make our datasets available in the MEDS format.
Data Access Protocol and Researcher Responsibilities
While the data are de-identified, these are still healthcare-related data that need to be accessed only under specific access protocols. The data access protocol and licensing are modeled after PhysioNet, with the MIMIC datasets serving as a key inspiration for our approach to dataset releases. Researchers are required to apply through a data portal on Redivis, sign a user-level data use agreement (DUA) and rules of behavior agreement, and provide valid CITI training certificates before access to data is enabled.
The Road Ahead
We are excited for the community to use and build upon these datasets. As one example, our forthcoming FactEHR dataset, a factual decomposition and verification benchmark, is built using clinical notes sampled from MIMIC and MedAlign.
More Resources
Special Thanks
Releasing these datasets was a massive collaboration involving multiple offices and champions across Stanford University and Stanford Healthcare.
This research used data or services provided by STARR, “STAnford medicine Research data Repository,” a clinical data warehouse containing live Epic data from Stanford Health Care, the Stanford Children’s Hospital, the University Healthcare Alliance and Packard Children's Health Alliance clinics and other auxiliary data from Hospital applications such as radiology PACS. STARR platform is developed and operated by the Stanford Medicine Research Technology team and is made possible by funding from the Stanford School of Medicine Dean's Office.
Governance, Privacy, and Licensing
Austin Aker, Scott Edmiston, Jonathan Gortat, Mariko Kelly, Julie Marie Romero, Reed Sprague
Technology & Digital Solutions
Somalee Datta, Priya Desai, Todd Ferris, Natasha Flowers, Joseph Mesterhazy
Stanford AIMI Center
Stephanie Bogdan, Sarah Bogdan Warner, Johanna Kim, Natalie Lee, Lindsey Park, Angela Shin, Angela Shin, Jacqueline Thomas, Liberty Walton, Gabriel Yip
Stanford Center for Population Health Sciences, Stanford Libraries, Redivis
Isabella Chu, Peter Leonard, Ian Mathews
Research
MEDS Working Group