Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Why 'Zero-Shot' Clinical Predictions Are Risky | Stanford HAI
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • AI Glossary
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

news

Why 'Zero-Shot' Clinical Predictions Are Risky

Date
January 07, 2026
Topics
Healthcare
Foundation Models
Doctor reviews a tablet in the foreground while other doctors and nurses stand over a medical bed in the background
istock

These models generate plausible timelines from historical patterns; without calibration and auditing, their “probabilities” may not reflect reality.

The healthcare industry is buzzing with the promise of a "MedGPT moment." Generative models trained on millions of electronic health records (EHRs) are being viewed as "zero-shot predictors"—tools capable of forecasting patient mortality or disease progression without any task-specific training.

However, this framing glosses over a subtle but important distinction. While these models are powerful, they are not actually designed to "predict" clinical outcomes in the traditional sense. They are simulators.

Simulation vs. Prediction

Unlike a validated forecasting tool, a generative EHR model works by learning patterns from historical data to generate plausible patient timelines—sequences of diagnoses, procedures, medication codes, lab values, and their timing. When asked to estimate the risk of a 30-day readmission, the model doesn't "know" the answer; it generates, for example, 100 hypothetical future timelines for that patient and counts how often a readmission code appears.

If 60 out of 100 simulated timelines show a readmission, the model reports a 60% risk. However, these frequencies are derived from simulated patterns, not necessarily real-world probabilities. Treating a simulation as an "oracle" prediction can lead to unsafe clinical decisions, such as overtreating low-risk patients or missing high-risk ones.

Why We Haven’t Reached the "MedGPT Moment"

Comparisons to ChatGPT are often misleading. The leap from early GPT models to ChatGPT required a massive increase in scale, orders of magnitude more data, and specialized alignment techniques like human feedback to ensure safety and reliability.

Current generative EHR models are roughly where language models were between GPT-2 and GPT-3. They show promise but lack the safety refinements and rigorous calibration needed for clinical use. They also face unique medical challenges, such as representing precise timing and navigating complex hospital coding systems.

A New Evaluation Paradigm

To ensure these models are used responsibly, we propose five evaluation criteria:

  1. Performance by Frequency: Reporting how well models perform on rare vs. common medical events.

  2. Calibration: Ensuring a 30% predicted risk actually corresponds to 30% of patients experiencing that outcome.

  3. Timeline Completion: Reporting how often the model fails to generate a full patient timeline.

  4. Shortcut Audits: Checking if models rely on administrative "shortcuts" (like discharge codes) rather than medical conditions to make forecasts.

  5. Out-of-Distribution Validation: Testing models on fundamentally different patient populations without retraining.

By shifting our interpretation from prediction to simulation, we can better understand the strengths and limitations of these tools. This will lay the foundation for designing evaluation, oversight, and deployment strategies that allow generative AI to genuinely improve clinical care.

Read our full commentary in Nature Medicine.

istock
Share
Link copied to clipboard!
Contributor(s)
Suhana Bedi, Jason Alan Fries, and Nigam H. Shah

Related News

An AI Health Coach Could Change Your Mindset
Katharine Miller
Apr 23, 2026
News
A runner with a smartphone laces her shoes

Bloom, a health coaching app created by Stanford researchers, helps people tap into their own motivations.

News
A runner with a smartphone laces her shoes

An AI Health Coach Could Change Your Mindset

Katharine Miller
HealthcareGenerative AIApr 23

Bloom, a health coaching app created by Stanford researchers, helps people tap into their own motivations.

Using LLMs To Improve Workplace Social Skills
Katharine Miller
Apr 20, 2026
News
A woman takes notes while working on a tablet

Practicing specific social skills with AI chatbots helps users build confidence and competence.

News
A woman takes notes while working on a tablet

Using LLMs To Improve Workplace Social Skills

Katharine Miller
Education, SkillsGenerative AIHealthcareApr 20

Practicing specific social skills with AI chatbots helps users build confidence and competence.

AI’s ‘Delusional Spirals’ (and What to Do About Them)
Andrew Myers
Apr 20, 2026
News

In a world where chatbots can stand in for friends, counselors, and even lovers, the mental health risks are a growing concern.

News

AI’s ‘Delusional Spirals’ (and What to Do About Them)

Andrew Myers
HealthcareGenerative AIApr 20

In a world where chatbots can stand in for friends, counselors, and even lovers, the mental health risks are a growing concern.