Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Squashing ‘Fantastic Bugs’: Researchers Look to Fix Flaws in AI Benchmarks | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
news

Squashing ‘Fantastic Bugs’: Researchers Look to Fix Flaws in AI Benchmarks

Date
December 08, 2025
Topics
Foundation Models
Generative AI

In evaluating thousands of benchmarks that AI developers use to assess the quality of their new models, a team of Stanford researchers says 5% could have serious flaws that can lead to major ramifications.

Share
Link copied to clipboard!
Contributor(s)
Andrew Myers

Related News

Stanford’s Yejin Choi & Axios’ Ina Fried
Axios
Jan 19, 2026
Media Mention

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.

Media Mention
Your browser does not support the video tag.

Stanford’s Yejin Choi & Axios’ Ina Fried

Axios
Energy, EnvironmentMachine LearningGenerative AIEthics, Equity, InclusionJan 19

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.

There’s One Easy Solution To The A.I. Porn Problem
The New York Times
Jan 12, 2026
Media Mention

Riana Pfefferkorn, Policy Fellow at HAI, urges immediate Congressional hearings to scope a legal safe harbor for AI-generated child sexual abuse materials following a recent scandal with Grok's newest generative image features.

Media Mention
Your browser does not support the video tag.

There’s One Easy Solution To The A.I. Porn Problem

The New York Times
Regulation, Policy, GovernanceGenerative AIJan 12

Riana Pfefferkorn, Policy Fellow at HAI, urges immediate Congressional hearings to scope a legal safe harbor for AI-generated child sexual abuse materials following a recent scandal with Grok's newest generative image features.

The Policy Implications Of Grok's 'Mass Digital Undressing Spree'
Tech Policy Press
Jan 08, 2026
Media Mention

HAI Policy Fellow Riana Pfefferkorn discusses the policy implications of the "mass digital undressing spree,” where the chatbot Grok responded to user prompts to remove the clothing from images of women and pose them in bikinis and to create "sexualized images of children" and post them on X.

Media Mention
Your browser does not support the video tag.

The Policy Implications Of Grok's 'Mass Digital Undressing Spree'

Tech Policy Press
Regulation, Policy, GovernanceGenerative AIJan 08

HAI Policy Fellow Riana Pfefferkorn discusses the policy implications of the "mass digital undressing spree,” where the chatbot Grok responded to user prompts to remove the clothing from images of women and pose them in bikinis and to create "sexualized images of children" and post them on X.

Each time an AI researcher trains a new model to understand language, recognize images, or solve a medical riddle, one big question remains: Is this model better than what went before? To answer that question, AI researchers rely on batteries of benchmarks, or tests to measure and assess a new model’s capabilities. Benchmark scores can make or break a model. 

But there are tens of thousands of benchmarks spread across several datasets. Which one should developers use, and are all of equal worth? 

In a new paper presented at the Conference on Neural Information Processing Systems (NeurIPS) in December, researchers Sanmi Koyejo, assistant professor of computer science at Stanford University, and Sang Truong, a doctoral student in Koyejo’s Stanford Trustworthy AI Research (STAIR) lab, have mathematically scoured the thousands of benchmarks to reveal that as many as one in twenty are invalid. 

“Benchmarks serve a real public good,” Koyejo says. “But greater scrutiny and detail into how we build them is warranted and needs to match their growing importance in the AI community.”

Truth and Consequences

The researchers playfully refer to these flaws as “fantastic bugs” – an allusion to the “fantastic beasts” of cinema – but the consequences are producing something of a crisis of reliability in AI. “Mistakes in benchmarks have a huge influence on the industry,” Koyejo says. 

Flawed benchmarks can seriously harm a model’s score – falsely promoting underperforming models and wrongly penalizing better-performing ones. They can also have more insidious but far-reaching effects, as developers often rely on benchmark scores to make critical funding, research, and resource allocation decisions, which can lead them to wrongly focus resources on less-capable models or withhold releasing models based on untrustworthy scores.

Koyejo and Truong now hope to work with benchmarking organizations to correct or remove flawed benchmarks to restore reliability and fairness in benchmark scoring and thereby improve model development and rankings across the board. 

Fantastic bugs can take many forms – outright errors, mismatched labeling, ambiguous or culturally biased questions, logical inconsistencies, and even formatting errors that lead to correct answers being scored as incorrect. For example, in one benchmark where the correct answer was “$5,” the system incorrectly graded answers such as “5 dollars” and “$5.00” as wrong. These faulty scores have serious consequences for the models and the developers, the researchers say. In one example provided in the paper, the model DeepSeek-R1 was ranked third lowest among competing models using unrevised benchmarks and rose to second place after the benchmark was updated.

AI Entomology

To unearth these fantastic bugs, Koyejo and Truong used longstanding statistical methods based in measurement theory to highlight outlier questions where unusually large numbers of models were stumbling. They then used a large language model (LLM) to evaluate and provide justification for flagging certain benchmarks for additional human review. 

“Our statistics-plus-AI framework effectively reduces the time of the human review by identifying a majority of truly problematic questions,” Truong says. To that end, the approach achieved 84% precision in discovering flawed questions across nine popular AI benchmarks. “That is, more than eight-in-ten questions flagged for review had demonstrable flaws,” Truong notes.

The researchers are now working with benchmark developers to address the flaws, advocating away from the present-day “publish-and-forget” approach to an era of continuing stewardship. Reaction to their work has been “mixed,” Koyejo says. Most acknowledge the need for more reliable measurements but are often reluctant to commit to continuous improvement. 

In encouraging benchmarking organizations to adopt their framework and address these concerns, Koyejo and Truong hope to see a significant enhancement in the standard of benchmarks used globally as a path to improved AI overall. This improvement is expected to lead to more accurate model evaluations, better resource allocation, and a general boost in the trust and credibility of AI systems. 

“As AI continues to integrate deeper into various sectors,” Koyejo says, “the impact of these changes could be profound, driving innovations and ensuring safer, more reliable, and more powerful AI.”

This work was partially funded by the Stanford Institute for Human-Centered AI. Contributing authors include Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jirayu Burapacheep, Jonathan Perera, Chibuike Uwakwe, and Stanford Graduate School of Education faculty members Benjamin W. Domingue and Nick Haber.