Stanford scholars propose a fair way to quantify how much individual datasets contribute to AI model performance and companies’ bottom lines.
Each of us continuously generates a stream of data. When we buy a coffee, watch a romcom or action movie, or visit the gym or the doctor’s office (tracked by our phones), we hand over our data to companies that hope to make money from that information – either by using it to train an AI system to predict our future behavior or by selling it to others.
But what is that data worth?
“There’s a lot of interest in thinking about the value of data,” says James Zou, assistant professor of biomedical data science at Stanford University, member of the Stanford Institute for Human-Centered Artificial Intelligence, and faculty lead of a new HAI executive education program on the subject. How should companies set prices for data they buy and sell? How much does any given dataset contribute to a company’s bottom line? Should each of us receive a data dividend when companies use our data?
Motivated by these questions, Zou and graduate student Amirata Ghorbani have developed a new and principled approach to calculating the value of data that is used to train AI models. Their approach, detailed in a paper presented at the International Conference on Machine Learning and summarized for a slightly less technical audience in arXiv, is based on a Nobel Prize-winning economics method and improves upon existing methods for determining the worth of individual datapoints or datasets. In addition, it can help AI systems designers identify low value data that should be excluded from AI training sets as well as high value data worth acquiring. It can even be used to reduce bias in AI systems.
Going Beyond the “Leave One Out” Method
Until recently, the most common approach to determining the value of data for an AI model has been the “leave one out” method, in which researchers remove each datapoint, one at a time, from a model’s training set to see how much the algorithm’s performance changes. That change in performance might seem like a pretty reasonable way to measure each datapoint’s marginal value, but it’s not, Zou says. That’s because “leave one out” doesn’t capture the interactions between the datapoints – interactions that can be very nonlinear and complicated, Zou says.
To come up with a better measure of data value, Zou and Ghorbani turned to work that contributed to a 2012 Nobel Prize in economics for American mathematician Lloyd Shapley.
In 1951, Shapley had grappled with the following question: If a team of people is offered a bonus for solving a problem, how should they split that bonus so that each person gets a fair share? He began with a few fairness principles that everyone could agree on. For example, people should be compensated in proportion to their contribution, and people who make the same contribution should be paid the same. The resulting formula calculates a value for each contributor, and this value is provably fair, efficient, and optimal. “Remarkably, there’s only one way to split the bonus where everyone is happy and no one complains,” Zou says. “That’s the Shapley value.”
In recent papers, Zou and Ghorbani extended and adapted the Shapley approach to the study of data. “Instead of humans working together, now we have the data from each human working together to train an AI system,” Zou says. “The ‘bonus’ that we’re trying to partition is the individual datapoints’ contributions to the AI model’s performance.”
Instead of removing each datapoint one at a time, Zou and his colleagues create thousands of hypothetical scenarios consisting of different random subsets of a full dataset. In the end, the Shapley value of each datapoint is a weighted value of the datapoint’s contribution across all of those different scenarios. Unlike the “leave one out” method, the Shapley value captures all of the possible interactions between the datapoints.
In their paper, Zou and Ghorbani showed that the data Shapley value provides a better measure of data quality than the “leave one out” approach. “We like this data Shapley value because it’s very principled,” Zou says. Each datapoint gets a value – a number – that tells you how valuable or how useful the datapoint is for developing your AI system. “That information can translate into the bottom line for an AI company or become the basis for compensating data producers and data owners,” Zou says. Specifically, using the data Shapley approach, the value of the information provided by each consumer purchase, online search, or Netflix click can be determined and monetized.
Using Value to Improve AI Models
In addition to helping companies optimize AI tools, profits, or guiding procedures for paying data dividends, the data Shapley value can help companies curate data and address the biases found in many AI systems.
Data curation is itself big business. Indeed, some companies have established a solid business model of cleaning datasets to make them useful.
Zou says the Shapley value can help with such curation by identifying low quality, noisy, or biased data. In one experiment, he and Ghorbani ran a Google image search for seven different types of skin cancer lesions and used the images to train a skin cancer classifier. Compared with a gold standard skin cancer model, it did a terrible job as a classifier. So they calculated each image’s Shapley value relative to the gold standard and then removed the images with low Shapley values. The result: The model’s performance improved significantly.
The data Shapley value can even be used to reduce the existing biases in datasets. For example, many facial recognition systems are trained on datasets that have more images of white males than minorities or women. When these systems are deployed in the real world, their performance suffers because they see more diverse populations. To address this problem, Zou and Ghorbani ran an experiment: After a facial recognition system had been deployed in a real setting, they calculated how much each image in the training set contributed to the model’s performance in the wild. They found that the images of minorities and women had the highest Shapley values and the images of white males had the lowest Shapley values. They then used this information to fix the problem – weighting the training process in favor of the more valuable images. “By giving those images higher value and giving them more weight in the training process, the data Shapley value will actually make the algorithm work better in deployment – especially for minority populations,” Zou says.
Context Is Everything
Data value is task-specific. “The data Shapley value is not meant to be an intrinsic value for a piece of data; it isn’t permanent and persistent,” Zou says. For predicting diabetes, patients’ blood sugar levels will be more valuable than their blood pressure. For predicting heart disease, that value proposition might well flip.
“Going forward, it’s useful for our community to think about how much each person’s data is contributing to each application,” Zou says. “It’s very difficult to think about the universal value of data in a quantitative sense.”
It’s an issue that highlights the challenges of calculating a data dividend: How would companies track and calculate the value of one individual’s data across multiple tasks being done on multiple platforms and by numerous companies?
But as the value of data to companies grows, so do discussions about compensating people for their data – when to do it, when not to do it, and how to regulate data usage and data compensation. “We’re all data producers and our data are being used and bought and sold as we speak,” Zou says. “Some of us would like to be compensated, and all of us would like to know how valuable our personal data is.”
Learn more about the HAI executive education program, The Value of Data and AI: Strategies for Senior Leadership.
Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.