Operationalizing Real-Time Monitoring of Clinical AI

This brief demonstrates how real-time monitoring can address critical gaps in the oversight of radiological AI tools.
Key Takeaways
Radiological AI tools account for the largest share of FDA-approved healthcare AI, yet clinical adoption remains slow and most deployed systems lack robust performance monitoring.
We introduce the Ensemble Monitoring Model (EMM) — a framework that assesses uncertainty in the predictions of radiology AI models trained to detect abnormalities (in this case, brain bleeds), thereby providing clinicians with actionable signals at the point of care and enabling real-time monitoring of AI tool performance after clinical adoption.
EMM addresses an urgent gap by offering a practical, customizable method for signaling when confidence is low in real time, diagnosing failure modes, and supporting retraining of clinical AI when needed.
Policymakers should treat continuous performance monitoring as a core component of responsible AI deployment in healthcare and consider requiring healthcare AI vendors to put in place post-deployment monitoring mechanisms.
Executive Summary
AI tools are increasingly used in radiology, with the specialty accounting for approximately 76% of all FDA-authorized AI-enabled medical devices as of December 2025. A variety of tools can detect anomalies in X-rays or CT scans and provide diagnostic support. Yet many of these AI systems are deployed with limited mechanisms for monitoring and evaluating their performance, leaving clinicians to determine on their own which AI outputs are reliable. Without effective post-deployment oversight, these tools risk contributing to diagnostic errors and missed findings.
In our paper “Automated real-time assessment of intracranial hemorrhage detection AI using an ensemble monitoring model (EMM),” we introduce a new framework to enable real-time monitoring of AI radiology tool performance after deployment. Inspired by clinical consensus practices, the Ensemble Monitoring Model (EMM) measures agreement between a primary AI model and an ensemble of five independent submodels to estimate uncertainty without requiring access to black box model components. Using a large dataset focused on the detection of brain bleeds, we demonstrate that EMM can reduce radiologists’ cognitive burden by effectively characterizing AI model uncertainty in real time at the point of care — when radiologists review both the images and the corresponding AI output — and guiding appropriate responses when cases are flagged for reduced accuracy.
The growing reliance on AI in radiology and healthcare more broadly highlights that effective governance cannot stop at product approval. There is a critical need for total lifecycle management that ensures AI tools remain safe, accurate, and reliable after they are deployed in clinical settings. EMM enables AI models to be continually optimized and monitored after deployment. Policymakers should view methods like EMM as an important component of a broader regulatory strategy to ensure that AI in healthcare delivers measurable benefits without introducing new and unmanaged risks.
Introduction
Despite an exponential increase in FDA-cleared radiological AI tools over the last decade, clinical adoption has been slow. These tools promise to enhance clinical efficiency — for example, by supporting radiology tasks that involve detecting anomalies in medical images and classifying or prioritizing different cases. Yet their adoption has also been accompanied by safety concerns, including a potential increase in misdiagnosis caused, for example, by cognitive pitfalls such as automation or confirmation bias. As a result, clinicians often have to meticulously verify each AI result.
Evidence shows that clinicians are strongly influenced by how certain an AI model claims to be about its predictions. When a system provides clear confidence information, physicians are more likely to incorporate the output into their decision-making. When no measure of certainty is available, clinicians are left to rely only on their own judgment and tend to trust the model far less.
Today, most monitoring of radiology AI systems still relies on retrospective, labor-intensive reviews of a small amount of manually labeled data, which provide only a partial view of real-world performance. To address this problem, researchers have developed a range of real-time monitoring techniques for estimating model confidence that use the same dataset the AI system was trained on to monitor it. Other methods approximate predictive reliability through the use of “deep ensembles,” i.e., a collection of multiple smaller, independent models that stem from the same model architecture but are each trained from a different random starting point, causing them to learn in subtly different ways.
While these techniques can be effective in research settings, they share a major practical limitation: Nearly all of them require access to internal model components such as training datasets, model weights, or intermediate outputs. For commercial AI products, which are typically deployed as closed, black box systems, this approach is largely unfeasible, leaving healthcare providers and policymakers without the means to oversee clinical adoption.
There is a need for real-time monitoring systems that can automatically characterize model confidence at the point of care without requiring access to internal model details. While measuring prediction uncertainty represents only one dimension of AI oversight — model performance can also be undermined by factors such as flawed input data, poor image quality, or improper image presentation — it remains a particularly important and substantive component of effective post-deployment evaluation.







