Why 'Zero-Shot' Clinical Predictions Are Risky

These models generate plausible timelines from historical patterns; without calibration and auditing, their “probabilities” may not reflect reality.
Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News

These models generate plausible timelines from historical patterns; without calibration and auditing, their “probabilities” may not reflect reality.

Stanford scientists have released an open-source platform that lets health researchers study the “screenome” – the digital traces of our daily lives – while protecting participants’ privacy.

Stanford scientists have released an open-source platform that lets health researchers study the “screenome” – the digital traces of our daily lives – while protecting participants’ privacy.

Stanford scientists in Senegal hunting for schistosomiasis—a parasitic disease infecting 200+ million people worldwide—used AI to transform local field work into satellite-powered disease mapping.

Stanford scientists in Senegal hunting for schistosomiasis—a parasitic disease infecting 200+ million people worldwide—used AI to transform local field work into satellite-powered disease mapping.

An Amazon-backed fellowship will support 10 Stanford PhD students whose work explores everything from how we communicate to understanding disease and protecting our data.

An Amazon-backed fellowship will support 10 Stanford PhD students whose work explores everything from how we communicate to understanding disease and protecting our data.
The healthcare industry is buzzing with the promise of a "MedGPT moment." Generative models trained on millions of electronic health records (EHRs) are being viewed as "zero-shot predictors"—tools capable of forecasting patient mortality or disease progression without any task-specific training.
However, this framing glosses over a subtle but important distinction. While these models are powerful, they are not actually designed to "predict" clinical outcomes in the traditional sense. They are simulators.
Unlike a validated forecasting tool, a generative EHR model works by learning patterns from historical data to generate plausible patient timelines—sequences of diagnoses, procedures, medication codes, lab values, and their timing. When asked to estimate the risk of a 30-day readmission, the model doesn't "know" the answer; it generates, for example, 100 hypothetical future timelines for that patient and counts how often a readmission code appears.
If 60 out of 100 simulated timelines show a readmission, the model reports a 60% risk. However, these frequencies are derived from simulated patterns, not necessarily real-world probabilities. Treating a simulation as an "oracle" prediction can lead to unsafe clinical decisions, such as overtreating low-risk patients or missing high-risk ones.
Comparisons to ChatGPT are often misleading. The leap from early GPT models to ChatGPT required a massive increase in scale, orders of magnitude more data, and specialized alignment techniques like human feedback to ensure safety and reliability.
Current generative EHR models are roughly where language models were between GPT-2 and GPT-3. They show promise but lack the safety refinements and rigorous calibration needed for clinical use. They also face unique medical challenges, such as representing precise timing and navigating complex hospital coding systems.
To ensure these models are used responsibly, we propose five evaluation criteria:
Performance by Frequency: Reporting how well models perform on rare vs. common medical events.
Calibration: Ensuring a 30% predicted risk actually corresponds to 30% of patients experiencing that outcome.
Timeline Completion: Reporting how often the model fails to generate a full patient timeline.
Shortcut Audits: Checking if models rely on administrative "shortcuts" (like discharge codes) rather than medical conditions to make forecasts.
Out-of-Distribution Validation: Testing models on fundamentally different patient populations without retraining.
By shifting our interpretation from prediction to simulation, we can better understand the strengths and limitations of these tools. This will lay the foundation for designing evaluation, oversight, and deployment strategies that allow generative AI to genuinely improve clinical care.
Read our full commentary in Nature Medicine.