Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Why 'Zero-Shot' Clinical Predictions Are Risky | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Why 'Zero-Shot' Clinical Predictions Are Risky

Date
January 07, 2026
Topics
Healthcare
Foundation Models
Doctor reviews a tablet in the foreground while other doctors and nurses stand over a medical bed in the background
istock

These models generate plausible timelines from historical patterns; without calibration and auditing, their “probabilities” may not reflect reality.

The healthcare industry is buzzing with the promise of a "MedGPT moment." Generative models trained on millions of electronic health records (EHRs) are being viewed as "zero-shot predictors"—tools capable of forecasting patient mortality or disease progression without any task-specific training.

However, this framing glosses over a subtle but important distinction. While these models are powerful, they are not actually designed to "predict" clinical outcomes in the traditional sense. They are simulators.

Simulation vs. Prediction

Unlike a validated forecasting tool, a generative EHR model works by learning patterns from historical data to generate plausible patient timelines—sequences of diagnoses, procedures, medication codes, lab values, and their timing. When asked to estimate the risk of a 30-day readmission, the model doesn't "know" the answer; it generates, for example, 100 hypothetical future timelines for that patient and counts how often a readmission code appears.

If 60 out of 100 simulated timelines show a readmission, the model reports a 60% risk. However, these frequencies are derived from simulated patterns, not necessarily real-world probabilities. Treating a simulation as an "oracle" prediction can lead to unsafe clinical decisions, such as overtreating low-risk patients or missing high-risk ones.

Why We Haven’t Reached the "MedGPT Moment"

Comparisons to ChatGPT are often misleading. The leap from early GPT models to ChatGPT required a massive increase in scale, orders of magnitude more data, and specialized alignment techniques like human feedback to ensure safety and reliability.

Current generative EHR models are roughly where language models were between GPT-2 and GPT-3. They show promise but lack the safety refinements and rigorous calibration needed for clinical use. They also face unique medical challenges, such as representing precise timing and navigating complex hospital coding systems.

A New Evaluation Paradigm

To ensure these models are used responsibly, we propose five evaluation criteria:

  1. Performance by Frequency: Reporting how well models perform on rare vs. common medical events.

  2. Calibration: Ensuring a 30% predicted risk actually corresponds to 30% of patients experiencing that outcome.

  3. Timeline Completion: Reporting how often the model fails to generate a full patient timeline.

  4. Shortcut Audits: Checking if models rely on administrative "shortcuts" (like discharge codes) rather than medical conditions to make forecasts.

  5. Out-of-Distribution Validation: Testing models on fundamentally different patient populations without retraining.

By shifting our interpretation from prediction to simulation, we can better understand the strengths and limitations of these tools. This will lay the foundation for designing evaluation, oversight, and deployment strategies that allow generative AI to genuinely improve clinical care.

Read our full commentary in Nature Medicine.

istock
Share
Link copied to clipboard!
Contributor(s)
Suhana Bedi, Jason Alan Fries, and Nigam H. Shah

Related News

Smart Enough to Do Math, Dumb Enough to Fail: The Hunt for a Better AI Test
Andrew Myers
Feb 02, 2026
News
illustration of data and lines

A Stanford HAI workshop brought together experts to develop new evaluation methods that assess AI's hidden capabilities, not just its test-taking performance.

News
illustration of data and lines

Smart Enough to Do Math, Dumb Enough to Fail: The Hunt for a Better AI Test

Andrew Myers
Foundation ModelsGenerative AIPrivacy, Safety, SecurityFeb 02

A Stanford HAI workshop brought together experts to develop new evaluation methods that assess AI's hidden capabilities, not just its test-taking performance.

AI Reveals How Brain Activity Unfolds Over Time
Andrew Myers
Jan 21, 2026
News
Medical Brain Scans on Multiple Computer Screens. Advanced Neuroimaging Technology Reveals Complex Neural Pathways, Display Showing CT Scan in a Modern Medical Environment

Stanford researchers have developed a deep learning model that transforms overwhelming brain data into clear trajectories, opening new possibilities for understanding thought, emotion, and neurological disease.

News
Medical Brain Scans on Multiple Computer Screens. Advanced Neuroimaging Technology Reveals Complex Neural Pathways, Display Showing CT Scan in a Modern Medical Environment

AI Reveals How Brain Activity Unfolds Over Time

Andrew Myers
HealthcareSciences (Social, Health, Biological, Physical)Jan 21

Stanford researchers have developed a deep learning model that transforms overwhelming brain data into clear trajectories, opening new possibilities for understanding thought, emotion, and neurological disease.

Stanford Researchers: AI Reality Check Imminent
Forbes
Dec 23, 2025
Media Mention

Shana Lynch, HAI Head of Content and Associate Director of Communications, pointed out the "'era of AI evangelism is giving way to an era of AI evaluation,'" in her AI predictions piece, where she interviewed several Stanford AI experts on their insights for AI impacts in 2026.

Media Mention
Your browser does not support the video tag.

Stanford Researchers: AI Reality Check Imminent

Forbes
Generative AIEconomy, MarketsHealthcareCommunications, MediaDec 23

Shana Lynch, HAI Head of Content and Associate Director of Communications, pointed out the "'era of AI evangelism is giving way to an era of AI evaluation,'" in her AI predictions piece, where she interviewed several Stanford AI experts on their insights for AI impacts in 2026.