Courtesy of Snorkel AI
The problem has been data – not always simply a lack of it, but a lack of the right kind.
As Alex Ratner worked on his PhD in computer science at Stanford, he and his advisor, Christopher Ré, associate professor of computer science in the Stanford AI Lab and a Stanford Institute for Human-Centered ArtificiaI Intelligence faculty member, became interested in the data challenges encountered by those training and applying new machine-learning models.
The researchers observed that even the most sophisticated academic and industry partners and scientists – the Googles and Microsofts of the world – were struggling with a key aspect of new machine-learning technology: the large volumes of hand-labeled “training data” that these machine-learning models learned from. Teams had built high-performance algorithms and models to do everything from classify chest X-rays to analyze legal documents much more accurately than humans alone, but their success highlighted how much data went into creating the systems, which needed even more data to deliver greater accuracy.
“It was a bottleneck and pain point so many were facing,” Ré says. “The data doesn’t just ‘show up’ out of nowhere.” Indeed, experts have to label large volumes of data, such as radiologists identifying which X-rays are more likely to indicate cancer, to help AI models develop effective algorithms.
Out of their insights emerged an ambitious Stanford AI Lab project to generate the right kind of data more efficiently for machine-learning systems. Specifically, they would ask experts to supply rules of thumb for data labeling, which would then enable the systems themselves to label and use large volumes of data quickly. This would improve system accuracy and applicability, while enabling subject matter experts to contribute to, and largely drive, the machine-learning development process with unprecedented efficiency.
In July 2020, Snorkel AI, the Palo Alto business based on that research, emerged out of stealth, and has already raised $15 million from high-profile investors like Greylock and partnered with customers across sectors. Ratner is the firm’s CEO (and a University of Washington assistant professor), working with fellow Stanford co-founders Paroma Varma, Braden Hancock, and Henry Ehrenberg. Ré serves as a board member.
Label It a (Data) Problem
If machine-learning systems are vehicles, data are their vital fuel. These systems need training data on which to learn, Ratner says, and the burden has shifted from developing systems to getting the data needed to create and improve them.
Humans are a core part of the data process. Chest X-rays, legal documents, or financial-service transactions – “unstructured data” in machine-learning terms – must be labeled by experts so that the models can understand the data better, including what aspects matter most, and create and validate rules more easily: This type of X-ray could mean cancer, this loan looks more likely to default.
But data labeling can be a time-intensive, expensive process. Labeling X-rays, for example, can be a months-long process, a heavy ask for busy experts.
“We decided to treat the process of creating training data as a first-class machine-learning problem,” Ré says. “Soon we realized this could change how people built applications in the machine-learning and AI space, changing the entire workflow.”
Getting It Out of Their Heads
The trick was tapping experts’ expertise much more efficiently.
“We wanted to enable subject-matter experts to label and build these training datasets to teach the model in ways that were higher-level and more efficient,” Ratner says. Rather than ask a doctor to label an X-ray “yes” or “no” for cancer and expect the algorithm to figure out the working rule, doctors can specify what on the X-ray led to their conclusions: “I looked for a blob here and a jagged edge there.”
In this way, experts provide the rules they work by – not just labels – and machine-learning systems analyze those in aggregate to build a higher-powered algorithm with more nuanced functions, removing “noise” from the experts’ inputs and enabling the system to effectively label data itself, a much faster, more efficient process.
“We aimed to ‘upskill’ the way that subject matter experts interface with new machine learning technology,” Ratner says. “It’s almost insulting to ask them to throw their knowledge away and sit there and label 10,000 data points. That’s almost comical underutilization.”
Ré agrees: “The idea is to get the rule for labeling data out of the human head, even if imperfectly.” That reflected a similar approach to the system Ré’s AI Lab team had built previously: DeepDive, which used as inputs logical rules from users about how they would build machine-learning models. DeepDive became a company called Lattice Data, later acquired by Apple.
The new system – the engine for Snorkel AI – quickly proved its value. As a dramatic example, in a collaboration with Stanford Hospital, the research team showed how data labeling that had taken literal person-years to complete earlier was done in just hours with the technology underlying Snorkel AI.
Snorkel AI Surfaces
After working on the core system for years, the team launched Snorkel AI as a business, emerging from stealth last month.
“We needed a user interface that was reliable and professionally maintained,” Ratner says. The business’s flagship product, Snorkel Flow, enables customers to input data-related expertise into a system – think of it as the Snorkel AI “engine” – that uses it to generate the most accurate algorithms and models possible.
Customers are already working with Snorkel AI. Major banks are using the system to extract and classify key information from massive volumes of legal documents, and getting up to 99.1% accuracy, Ratner says. Google uses the Snorkel system to improve its internal algorithms. Other Snorkel applications include self-driving-car data and genomics.
Beyond commercial applications, the new technologies can drive large humanitarian impact, Ratner says. For example, the Stanford researchers previously used the underlying system to identify the sex of hundreds of thousands of people harmed by medical devices – important information the FDA doesn’t release. Currently, the system is being used to aid with COVID patient triaging.
“We’re looking to expand into verticals like healthcare, including insurance and pharma,” Ratner says. “We’ve had a ton of interest.”
Still, he cautions, “No technology works everywhere. We will continue to figure out where it’s useful and where it isn’t.”
In the end, Ratner says the team’s vision continues to be “about putting together the data scientists, machine-learning engineers, subject matter experts, and business stakeholders to really empower the ‘human blueprint’ in the most efficient, effective way.
Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.