Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Stanford Spin-Out Snorkel AI Solves a Major Data Problem | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Stanford Spin-Out Snorkel AI Solves a Major Data Problem

Date
August 18, 2020
Topics
Machine Learning
Courtesy of Snorkel AI

An idea born out of the Stanford AI Lab offers an unprecedented way to create data for AI-driven systems.

The problem has been data – not always simply a lack of it, but a lack of the right kind.

As Alex Ratner worked on his PhD in computer science at Stanford, he and his advisor, Christopher Ré, associate professor of computer science in the Stanford AI Lab and a Stanford Institute for Human-Centered ArtificiaI Intelligence faculty member, became interested in the data challenges encountered by those training and applying new machine-learning models.

The researchers observed that even the most sophisticated academic and industry partners and scientists – the Googles and Microsofts of the world – were struggling with a key aspect of new machine-learning technology: the large volumes of hand-labeled “training data” that these machine-learning models learned from. Teams had built high-performance algorithms and models to do everything from classify chest X-rays to analyze legal documents much more accurately than humans alone, but their success highlighted how much data went into creating the systems, which needed even more data to deliver greater accuracy.

“It was a bottleneck and pain point so many were facing,” Ré says. “The data doesn’t just ‘show up’ out of nowhere.” Indeed, experts have to label large volumes of data, such as radiologists identifying which X-rays are more likely to indicate cancer, to help AI models develop effective algorithms.

Out of their insights emerged an ambitious Stanford AI Lab project to generate the right kind of data more efficiently for machine-learning systems. Specifically, they would ask experts to supply rules of thumb for data labeling, which would then enable the systems themselves to label and use large volumes of data quickly. This would improve system accuracy and applicability, while enabling subject matter experts to contribute to, and largely drive, the machine-learning development process with unprecedented efficiency.

In July 2020, Snorkel AI, the Palo Alto business based on that research, emerged out of stealth, and has already raised $15 million from high-profile investors like Greylock and partnered with customers across sectors. Ratner is the firm’s CEO (and a University of Washington assistant professor), working with fellow Stanford co-founders Paroma Varma, Braden Hancock, and Henry Ehrenberg. Ré serves as a board member.

Label It a (Data) Problem

If machine-learning systems are vehicles, data are their vital fuel. These systems need training data on which to learn, Ratner says, and the burden has shifted from developing systems to getting the data needed to create and improve them.

Humans are a core part of the data process. Chest X-rays, legal documents, or financial-service transactions – “unstructured data” in machine-learning terms – must be labeled by experts so that the models can understand the data better, including what aspects matter most, and create and validate rules more easily: This type of X-ray could mean cancer, this loan looks more likely to default.

But data labeling can be a time-intensive, expensive process. Labeling X-rays, for example, can be a months-long process, a heavy ask for busy experts.

“We decided to treat the process of creating training data as a first-class machine-learning problem,” Ré says. “Soon we realized this could change how people built applications in the machine-learning and AI space, changing the entire workflow.”

Getting It Out of Their Heads

The trick was tapping experts’ expertise much more efficiently.

“We wanted to enable subject-matter experts to label and build these training datasets to teach the model in ways that were higher-level and more efficient,” Ratner says. Rather than ask a doctor to label an X-ray “yes” or “no” for cancer and expect the algorithm to figure out the working rule, doctors can specify what on the X-ray led to their conclusions: “I looked for a blob here and a jagged edge there.”

In this way, experts provide the rules they work by – not just labels – and machine-learning systems analyze those in aggregate to build a higher-powered algorithm with more nuanced functions, removing “noise” from the experts’ inputs and enabling the system to effectively label data itself, a much faster, more efficient process.

“We aimed to ‘upskill’ the way that subject matter experts interface with new machine learning technology,” Ratner says. “It’s almost insulting to ask them to throw their knowledge away and sit there and label 10,000 data points. That’s almost comical underutilization.”

Ré agrees: “The idea is to get the rule for labeling data out of the human head, even if imperfectly.” That reflected a similar approach to the system Ré’s AI Lab team had built previously: DeepDive, which used as inputs logical rules from users about how they would build machine-learning models. DeepDive became a company called Lattice Data, later acquired by Apple.

The new system – the engine for Snorkel AI – quickly proved its value. As a dramatic example, in a collaboration with Stanford Hospital, the research team showed how data labeling that had taken literal person-years to complete earlier was done in just hours with the technology underlying Snorkel AI.

Snorkel AI Surfaces

After working on the core system for years, the team launched Snorkel AI as a business, emerging from stealth last month.

“We needed a user interface that was reliable and professionally maintained,” Ratner says. The business’s flagship product, Snorkel Flow, enables customers to input data-related expertise into a system – think of it as the Snorkel AI “engine” – that uses it to generate the most accurate algorithms and models possible.

Customers are already working with Snorkel AI. Major banks are using the system to extract and classify key information from massive volumes of legal documents, and getting up to 99.1% accuracy, Ratner says. Google uses the Snorkel system to improve its internal algorithms. Other Snorkel applications include self-driving-car data and genomics.

Beyond commercial applications, the new technologies can drive large humanitarian impact, Ratner says. For example, the Stanford researchers previously used the underlying system to identify the sex of hundreds of thousands of people harmed by medical devices – important information the FDA doesn’t release. Currently, the system is being used to aid with COVID patient triaging.

“We’re looking to expand into verticals like healthcare, including insurance and pharma,” Ratner says. “We’ve had a ton of interest.”

Still, he cautions, “No technology works everywhere. We will continue to figure out where it’s useful and where it isn’t.”

In the end, Ratner says the team’s vision continues to be “about putting together the data scientists, machine-learning engineers, subject matter experts, and business stakeholders to really empower the ‘human blueprint’ in the most efficient, effective way.

Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more. 

Courtesy of Snorkel AI
Share
Link copied to clipboard!
Contributor(s)
Sachin Waikar
Related
  • Jayodita Sanghvi and Grace Tang: Big data meets big business
    the ​Stanford Engineering Staff
    Nov 13
    news
    Your browser does not support the video tag.

    The last decade has seen an explosion in the collection and processing of data. Now, the era of big data is making its way into the business world, with important implications.

Related News

AI Leaders Discuss How To Foster Responsible Innovation At TIME100 Roundtable In Davos
TIME
Jan 21, 2026
Media Mention

HAI Senior Fellow Yejin Choi discussed responsible AI model training at Davos, asking, “What if there could be an alternative form of intelligence that really learns … morals, human values from the get-go, as opposed to just training LLMs on the entirety of the internet, which actually includes the worst part of humanity, and then we then try to patch things up by doing ‘alignment’?” 

Media Mention
Your browser does not support the video tag.

AI Leaders Discuss How To Foster Responsible Innovation At TIME100 Roundtable In Davos

TIME
Ethics, Equity, InclusionGenerative AIMachine LearningNatural Language ProcessingJan 21

HAI Senior Fellow Yejin Choi discussed responsible AI model training at Davos, asking, “What if there could be an alternative form of intelligence that really learns … morals, human values from the get-go, as opposed to just training LLMs on the entirety of the internet, which actually includes the worst part of humanity, and then we then try to patch things up by doing ‘alignment’?” 

Stanford’s Yejin Choi & Axios’ Ina Fried
Axios
Jan 19, 2026
Media Mention

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.

Media Mention
Your browser does not support the video tag.

Stanford’s Yejin Choi & Axios’ Ina Fried

Axios
Energy, EnvironmentMachine LearningGenerative AIEthics, Equity, InclusionJan 19

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.

Spatial Intelligence Is AI’s Next Frontier
TIME
Dec 11, 2025
Media Mention

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.

Media Mention
Your browser does not support the video tag.

Spatial Intelligence Is AI’s Next Frontier

TIME
Computer VisionMachine LearningGenerative AIDec 11

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.