Machine learning research traditionally studies a single model at a time. But the impact of this technology on people depends on the cumulative result of many interactions people have with different models — or a model ecosystem. In hiring, for example, a job candidate’s outcome depends not just on one employer using one hiring algorithm, but on all the companies that use that hiring algorithm.
An interdisciplinary team from the Center for Research on Foundation Models (CRFM), led by Stanford CS PhD Rishi Bommasani, analyzed several machine learning model ecosystems. Their aim was to characterize how individuals experience machine learning across multiple models in contexts, such as computer vision, natural language processing, and speech recognition.
Their paper, titled “Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes,” uncovers unsettling patterns of systemic failure that are invisible from the prevailing single-model perspective of analysis. Among other findings, the study, which will be presented at the Conference on Neural Information Processing Systems (NeurIPS) 2023, identifies new forms of racial disparity in medical imaging for dermatology that occur in the predictions of ML models but not in the predictions of human dermatologists.
Studying Commercially Deployed ML
The CRFM team built on previous work hypothesizing that standard practices in machine learning could yield homogeneous outcomes. In their past paper, they conjectured that the dependence of modern machine learning models on the same datasets (e.g., ImageNet) and foundation models (e.g., GPT-4) would lead to a pattern in which some individuals exclusively experience negative outcomes. They believed that algorithmic monoculture, or reliance on the same algorithms or algorithmic components, could subject some people to repeated harm. If individuals or groups experience negative outcomes repeatedly, systemic exclusion could become institutionalized.
The researchers called this concept “outcome homogenization,” where algorithmic monoculture ensures that individuals experience the same outcomes repeatedly across AI tools.
To build on this line of work, they designed a follow-up study that applies their ecosystem-level perspective to analyze commercially deployed ML models from providers like Amazon, Google, IBM, Microsoft, and others. The team leveraged a large-scale ML API (application programming interface) audit called HAPI (History of APIs) to identify overall trends in deployed machine learning. The unique dataset spans three modalities of text, images, and speech with three commercial systems per modality and covers a three-year period from 2020 to 2022. Altogether, HAPI provides 11 datasets with the predictions of nine deployed ML models on a total of 1.7 million data points.
Capitalizing on the breadth of the HAPI audit, the CRFM researchers aimed to provide generalizable insights on three pressing questions:
- How pervasive are homogeneous outcomes?
- When models change, how do they impact the broader ecosystem in which they operate?
- How do ecosystem-level outcomes vary across race?
A clear trend emerged in every context they considered: Commercial ML systems are prone to systemic failure, meaning some people always are misclassified by all the available models — and this is where the greatest harm becomes apparent. If every voice assistant product on the market uses the same underlying algorithm, and that algorithm can't recognize an individual's unique way of speaking, then that person becomes effectively excluded from using any speech-recognition technology. A similar pattern in other domains would mean that individuals who are declined by one bank or hiring firm may not be approved for a loan from any other financial institution or hired by a different firm.
“We found there are users who receive clear negative outcomes from all models in the ecosystem,” says Connor Toups, a Stanford computer science graduate student who served as lead author of the paper. “As we move to machine learning that mediates more decisions, this type of collective outcome is important to assessing overall social impact at an individual level.”
Polarized Outcomes Dominate
As the team investigated their three primary research questions, the answers painted a stark picture. On the question of pervasive homogeneous outcomes, the team expected that even if some models misclassified an individual, others would get it right. Instead, across every model ecosystem they studied, they found the rate of homogeneous outcomes exceeded baselines, meaning the deployed models often either collectively failed or collectively succeeded on the given task more often than would be expected.
But what happens when models improve, such as when Amazon's sentiment analysis model became more accurate from 2020 to 2021? Unfortunately, the team found that when a single model improved in performance over time, the ecosystem showed less of an improvement versus the baseline. This means that the benefits of higher accuracy accrued to individuals who already are correctly classified by other models in the ecosystem.
“We have data from multiple years and multiple iterations of a model where the accuracy rate has improved, but it only seems to get better for users who are already experiencing positive outcomes; at the ecosystem-level, users who need the most support get the least,” says Toups.
Ecosystem-Level Analysis in Medical Imaging
Finally, to explore the context of medical imaging and whether ecosystem-level outcomes vary across races, the team used the Diverse Dermatology Images (DDI) dataset, which contains predictions from three models and two board-certified dermatologists on 656 skin lesion images. Here, the finding showed an unexpected contrast between model behavior and human behavior.
Specifically, ML models display a racial disparity not seen in the predictions of board-certified dermatologists. For darker skin tones, the ML models produced more homogeneous outcomes than the baseline, whereas for light skin tones, they produced slightly fewer homogeneous outcomes than the baseline. But the researchers found no racial disparity in predictions by human dermatologists.
Implications for Research and Policy
The team believes the ecosystem-level methodology they’ve developed will help future research teams measure and address the societal impacts of machine learning. It’s an approach that can be applied to many real-world contexts, whether they are human-only, machine-only, or more complex intertwined types of decision-making. Further, they suggest policy interventions may be necessary to prevent negative effects of homogeneous outcomes.
“Software providers may not be aware that their systems are all failing the same people,” says Kathleen Creel, an assistant professor at Northeastern University and HAI Network Affiliate. “Without policy changes to encourage ecosystem-level monitoring, we can’t expect improvement.”
Moving forward, the scholars plan to investigate what causes homogeneous outcomes and whether machine learning makes homogeneous outcomes better or worse. To do so, researchers need more transparency in how commercial models are trained and deployed.
"Unfortunately, we don't know anything about the training data or models that underpin these commercial AI systems. And although we know these systems are widely used, we don't know specifically where these systems are deployed or who the downstream users are,” says Toups. “Therefore, even if we establish a consistent pattern of systemic failure, we are unable to further concretize the impact on people's lives. Greater transparency from ML providers would allow us to take the research further."
Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.