Skip to main content Skip to secondary navigation
Page Content
A woman works on a computer whose screen is filled with data.

For more trustworthy algorithms, scholars at a recent conference suggest developing benchmarks for each step of the data pipeline. | iStock

Data is central to the AI enterprise. Facial recognition programs learn to recognize a cow or skunk by “seeing” thousands of images of each; autonomous vehicle systems learn how to merge onto a highway or brake for pedestrians by reviewing many thousands of hours of video and sensing data; the most advanced natural language programs rely on the entirety of Wikipedia.

Nevertheless, AI researchers often focus on tinkering with their model architectures and fine-tuning their algorithms rather than on making sure their datasets are suited to the task at hand. The result: Many refined AI models still fail to yield trustworthy results when deployed in the wild.

To change that, developers must turn their attention toward the data side of AI research, says James Zou, assistant professor of biomedical data science at Stanford University and member of the Stanford Institute for Human-Centered Artificial Intelligence. “One of the best ways to improve algorithms’ trustworthiness is to improve the data that goes into training and evaluating the algorithm,” he says.

 Watch the full Data-Centric AI Workshop.


Even though AI researchers understand that high-quality data is essential for a wide variety of AI applications, data-centric approaches have been undervalued, Zou says. In hopes of expanding the community of AI researchers who take a data-centric view, Zou along with Ce Zhang, assistant professor in computer science at ETH, Zurich, and graduate students Karan Goel from Stanford and Kenza Amara and Cedric Renggli from ETH Zurich, recently hosted a two-day Data-Centric AI Virtual Workshop.

“Creating good datasets for AI models has been an artisanal process,” Zou says. “The goal of the workshop was to explore how to turn that process from an art into a more principled scientific and engineering discipline.”

Key themes for the December workshop included the importance of shifting from a model-centric to a data-centric perspective, the need to develop benchmarks for each step of the data pipeline, and the value of getting more communities involved in building datasets for AI.

Data Centrality

In developing models, AI researchers need to consider data at each step, from product development through deployment and post-deployment, Zou says.

At the outset, researchers need to consider what data they should use, how much data they need, what features are important, whether their data measures what they’re studying or is a legitimate proxy for it, and how to make sure their datasets are representative and unbiased. Next, they need to consider how to clean, annotate, and sculpt the data; which subsets of the original data are most useful; and how to divvy up the data for training, validation, and testing. 

During the model training stage, researchers can also consider improving performance by using clever strategies such as data augmentation, mining for hard examples, and active learning, Zou notes. And finally, at the time of deployment, researchers need to plan for keeping the model up to date as new data comes along; auditing and monitoring the model to make sure it produces fair outcomes in the real world; and making sure the system complies with data regulations related to privacy and consent.

“As AI model-building rapidly matures,” Zou says, “most of AI researchers’ time and resources will need to be devoted to these data issues.”

Data Benchmarking

In computer science, researchers use benchmarks to measure how well a piece of hardware or software does a task compared to a particular standard. But in the data-centric AI context, benchmarks have been lacking.

To address this problem, Zou, Zhang and their colleagues recently released DCBench, a suite of benchmark challenges for data-centric AI that offers hundreds of self-contained puzzles related to three initial tasks. “These are some of the first benchmarks for solving data-centric challenges,” Zou says.

Users of DCBench can try their hand at determining the minimum dataset to train a model (a task called dataset pruning), cleaning a dataset efficiently (a task called budget cleaning), and detecting errors in a dataset given a trained model (a task called slice discovery). “The site offers not only the puzzles, but the solutions, so people can practice and get better at these tasks,” Zou says.

Because DCBench is still new, Zhang and Zou currently want researchers to provide feedback about whether the tool is focused on the right problems and measuring them in the right way, Zhang said at the workshop. In the future, Zhang says, DCBench will include many more tasks and hopefully become part of an even larger scale effort that becomes a useful resource for the community.

Building Trusted Datasets

Too often, AI researchers rely on datasets developed by just a handful of universities and companies. “If we want to build datasets that are more representative, reliable, and trustworthy, we need to have broader participation,” Zou says.

Zou is particularly excited about citizen science efforts in which communities themselves drive the creation of datasets. For example, iNaturalist is an app developed by Pietro Perona, professor of electrical engineering and computation and neural systems at Caltech, in collaboration with the California Academy of Sciences. It relies on crowdsourcing from people who share their knowledge to help users identify more than 50,000 plants, animals, and fungi all around the world. On the back end, the app also evaluates which contributors’ identifications are most accurate, and adjusts the weight of their contributions accordingly.

As another example, Katharina Borchert, who until recently served as the Chief Innovation Officer at Mozilla, described a project called Common Voice, which collects people’s voice samples for an app-based open-source voice recognition project. To create this publicly available dataset, Mozilla relies only on voice samples contributed by community volunteers. Mozilla has promoted the program around the globe and has thus far collected voice samples in 76 different languages.

The project has been particularly successful in Rwanda, where the gathering of Kinyarwanda voice samples is linked to a monthly public service day called umuganda. By asking people to contribute to Common Voice on their service day, Mozilla has developed a small but very active community of Rwandan contributors.

Borchert says datasets of underserved languages from Africa and other emerging markets will be particularly powerful. “That is where the next billion people will come online and that is also where literacy rates are much lower and where I still believe that speech interfaces can make a big difference.” 

The potential for crowdsourcing datasets for AI is almost unlimited, Borchert says. Despite the challenges, such as deciding how to set up the data collection, who to collect data from, how to motivate people to contribute, and how to deal with privacy issues, Borchert predicts that the approach will become widespread. “We’re going to see more and more of this, and the easier the contribution mechanism is, the more successful it can be.”

Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more

More News Topics