Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Why 'Zero-Shot' Clinical Predictions Are Risky | Stanford HAI
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

news

Why 'Zero-Shot' Clinical Predictions Are Risky

Date
January 07, 2026
Topics
Healthcare
Foundation Models
Doctor reviews a tablet in the foreground while other doctors and nurses stand over a medical bed in the background
istock

These models generate plausible timelines from historical patterns; without calibration and auditing, their “probabilities” may not reflect reality.

istock
Share
Link copied to clipboard!
Contributor(s)
Suhana Bedi, Jason Alan Fries, and Nigam H. Shah

Related News

What Your Phone Knows Could Help Scientists Understand Your Health
Katharine Miller
Mar 04, 2026
News
Woman using social media microblogging app on her smart phone

Stanford scientists have released an open-source platform that lets health researchers study the “screenome” – the digital traces of our daily lives – while protecting participants’ privacy.

News
Woman using social media microblogging app on her smart phone

What Your Phone Knows Could Help Scientists Understand Your Health

Katharine Miller
HealthcareMar 04

Stanford scientists have released an open-source platform that lets health researchers study the “screenome” – the digital traces of our daily lives – while protecting participants’ privacy.

How a HAI Seed Grant Helped Launch a Disease-Fighting AI Platform
Dylan Walsh
Mar 03, 2026
News

Stanford scientists in Senegal hunting for schistosomiasis—a parasitic disease infecting 200+ million people worldwide—used AI to transform local field work into satellite-powered disease mapping.

News

How a HAI Seed Grant Helped Launch a Disease-Fighting AI Platform

Dylan Walsh
Computer VisionHealthcareSciences (Social, Health, Biological, Physical)Machine LearningMar 03

Stanford scientists in Senegal hunting for schistosomiasis—a parasitic disease infecting 200+ million people worldwide—used AI to transform local field work into satellite-powered disease mapping.

From Privacy to ‘Glass Box’ AI, Stanford Students Are Targeting Real-World Problems
Nikki Goth Itoi
Feb 27, 2026
News

An Amazon-backed fellowship will support 10 Stanford PhD students whose work explores everything from how we communicate to understanding disease and protecting our data.

News

From Privacy to ‘Glass Box’ AI, Stanford Students Are Targeting Real-World Problems

Nikki Goth Itoi
Generative AIHealthcarePrivacy, Safety, SecurityComputer VisionSciences (Social, Health, Biological, Physical)Feb 27

An Amazon-backed fellowship will support 10 Stanford PhD students whose work explores everything from how we communicate to understanding disease and protecting our data.

The healthcare industry is buzzing with the promise of a "MedGPT moment." Generative models trained on millions of electronic health records (EHRs) are being viewed as "zero-shot predictors"—tools capable of forecasting patient mortality or disease progression without any task-specific training.

However, this framing glosses over a subtle but important distinction. While these models are powerful, they are not actually designed to "predict" clinical outcomes in the traditional sense. They are simulators.

Simulation vs. Prediction

Unlike a validated forecasting tool, a generative EHR model works by learning patterns from historical data to generate plausible patient timelines—sequences of diagnoses, procedures, medication codes, lab values, and their timing. When asked to estimate the risk of a 30-day readmission, the model doesn't "know" the answer; it generates, for example, 100 hypothetical future timelines for that patient and counts how often a readmission code appears.

If 60 out of 100 simulated timelines show a readmission, the model reports a 60% risk. However, these frequencies are derived from simulated patterns, not necessarily real-world probabilities. Treating a simulation as an "oracle" prediction can lead to unsafe clinical decisions, such as overtreating low-risk patients or missing high-risk ones.

Why We Haven’t Reached the "MedGPT Moment"

Comparisons to ChatGPT are often misleading. The leap from early GPT models to ChatGPT required a massive increase in scale, orders of magnitude more data, and specialized alignment techniques like human feedback to ensure safety and reliability.

Current generative EHR models are roughly where language models were between GPT-2 and GPT-3. They show promise but lack the safety refinements and rigorous calibration needed for clinical use. They also face unique medical challenges, such as representing precise timing and navigating complex hospital coding systems.

A New Evaluation Paradigm

To ensure these models are used responsibly, we propose five evaluation criteria:

  1. Performance by Frequency: Reporting how well models perform on rare vs. common medical events.

  2. Calibration: Ensuring a 30% predicted risk actually corresponds to 30% of patients experiencing that outcome.

  3. Timeline Completion: Reporting how often the model fails to generate a full patient timeline.

  4. Shortcut Audits: Checking if models rely on administrative "shortcuts" (like discharge codes) rather than medical conditions to make forecasts.

  5. Out-of-Distribution Validation: Testing models on fundamentally different patient populations without retraining.

By shifting our interpretation from prediction to simulation, we can better understand the strengths and limitations of these tools. This will lay the foundation for designing evaluation, oversight, and deployment strategies that allow generative AI to genuinely improve clinical care.

Read our full commentary in Nature Medicine.