Toward Fairness in Health Care Training Data

This brief highlights the lack of geographic representation in medical-imaging AI training data and calls for nationwide, diversity-focused data-sharing initiatives.
Key Takeaways
Bias arises when we build algorithms using datasets that don’t mirror the population. When generalized to larger swathes of the population, these nonrepresentative data have the potential to confound research findings.
The vast majority of the health data used to build AI algorithms came from only three states, with little or no representation from the remaining 47 states.
Policymakers, regulators, industry, and academia need to work together to ensure medical AI data reflect America’s diversity across not only geography but also many other important attributes. To that end, nationwide data sharing initiatives should be a top priority.
Executive Summary
With recent advances in artificial intelligence (AI), researchers can now train sophisticated computer algorithms to interpret medical images – often with accuracy comparable to trained physicians. Yet our recent survey of medical research shows that these algorithms rely on datasets that lack population diversity and could introduce bias into the understanding of a patient’s health condition.
Artificial intelligence algorithms increasingly inform the decisions of human experts. In medical imaging, these algorithms may help a doctor spot a subtle finding or suggest an alternate diagnosis. But bias in the data used to train these high-stakes algorithms can bias the algorithm itself. Our analysis shows that the datasets used to develop these algorithms come from only a handful of locations – raising serious questions for policymakers—but also providing opportunities for course correction.
In our research, published in the Journal of the American Medical Association, we looked at data from more than 70 studies that used U.S. patient data to train algorithms designed to compete or collaborate with physicians to perform diagnostic tasks. Overwhelmingly, the datasets came from three states—California, Massachusetts, and New York—with little or no representation from the remaining 47 states. Rectifying this lack of representation in medical data should be front of mind for health policymakers and regulators. Lack of data diversity can be addressed in part by initiatives to streamline the nation’s digital infrastructure, to enhance the availability of patient data from underrepresented populations for larger studies, and to incentivize ethical data sharing and the democratization of medical data.







