Why 'Zero-Shot' Clinical Predictions Are Risky

These models generate plausible timelines from historical patterns; without calibration and auditing, their “probabilities” may not reflect reality.
The healthcare industry is buzzing with the promise of a "MedGPT moment." Generative models trained on millions of electronic health records (EHRs) are being viewed as "zero-shot predictors"—tools capable of forecasting patient mortality or disease progression without any task-specific training.
However, this framing glosses over a subtle but important distinction. While these models are powerful, they are not actually designed to "predict" clinical outcomes in the traditional sense. They are simulators.
Simulation vs. Prediction
Unlike a validated forecasting tool, a generative EHR model works by learning patterns from historical data to generate plausible patient timelines—sequences of diagnoses, procedures, medication codes, lab values, and their timing. When asked to estimate the risk of a 30-day readmission, the model doesn't "know" the answer; it generates, for example, 100 hypothetical future timelines for that patient and counts how often a readmission code appears.
If 60 out of 100 simulated timelines show a readmission, the model reports a 60% risk. However, these frequencies are derived from simulated patterns, not necessarily real-world probabilities. Treating a simulation as an "oracle" prediction can lead to unsafe clinical decisions, such as overtreating low-risk patients or missing high-risk ones.
Why We Haven’t Reached the "MedGPT Moment"
Comparisons to ChatGPT are often misleading. The leap from early GPT models to ChatGPT required a massive increase in scale, orders of magnitude more data, and specialized alignment techniques like human feedback to ensure safety and reliability.
Current generative EHR models are roughly where language models were between GPT-2 and GPT-3. They show promise but lack the safety refinements and rigorous calibration needed for clinical use. They also face unique medical challenges, such as representing precise timing and navigating complex hospital coding systems.
A New Evaluation Paradigm
To ensure these models are used responsibly, we propose five evaluation criteria:
Performance by Frequency: Reporting how well models perform on rare vs. common medical events.
Calibration: Ensuring a 30% predicted risk actually corresponds to 30% of patients experiencing that outcome.
Timeline Completion: Reporting how often the model fails to generate a full patient timeline.
Shortcut Audits: Checking if models rely on administrative "shortcuts" (like discharge codes) rather than medical conditions to make forecasts.
Out-of-Distribution Validation: Testing models on fundamentally different patient populations without retraining.
By shifting our interpretation from prediction to simulation, we can better understand the strengths and limitations of these tools. This will lay the foundation for designing evaluation, oversight, and deployment strategies that allow generative AI to genuinely improve clinical care.
Read our full commentary in Nature Medicine.



