Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Squashing ‘Fantastic Bugs’: Researchers Look to Fix Flaws in AI Benchmarks | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Squashing ‘Fantastic Bugs’: Researchers Look to Fix Flaws in AI Benchmarks

Date
December 08, 2025
Topics
Foundation Models
Generative AI

In evaluating thousands of benchmarks that AI developers use to assess the quality of their new models, a team of Stanford researchers says 5% could have serious flaws that can lead to major ramifications.

Each time an AI researcher trains a new model to understand language, recognize images, or solve a medical riddle, one big question remains: Is this model better than what went before? To answer that question, AI researchers rely on batteries of benchmarks, or tests to measure and assess a new model’s capabilities. Benchmark scores can make or break a model. 

But there are tens of thousands of benchmarks spread across several datasets. Which one should developers use, and are all of equal worth? 

In a new paper presented at the Conference on Neural Information Processing Systems (NeurIPS) in December, researchers Sanmi Koyejo, assistant professor of computer science at Stanford University, and Sang Truong, a doctoral student in Koyejo’s Stanford Trustworthy AI Research (STAIR) lab, have mathematically scoured the thousands of benchmarks to reveal that as many as one in twenty are invalid. 

“Benchmarks serve a real public good,” Koyejo says. “But greater scrutiny and detail into how we build them is warranted and needs to match their growing importance in the AI community.”

Truth and Consequences

The researchers playfully refer to these flaws as “fantastic bugs” – an allusion to the “fantastic beasts” of cinema – but the consequences are producing something of a crisis of reliability in AI. “Mistakes in benchmarks have a huge influence on the industry,” Koyejo says. 

Flawed benchmarks can seriously harm a model’s score – falsely promoting underperforming models and wrongly penalizing better-performing ones. They can also have more insidious but far-reaching effects, as developers often rely on benchmark scores to make critical funding, research, and resource allocation decisions, which can lead them to wrongly focus resources on less-capable models or withhold releasing models based on untrustworthy scores.

Koyejo and Truong now hope to work with benchmarking organizations to correct or remove flawed benchmarks to restore reliability and fairness in benchmark scoring and thereby improve model development and rankings across the board. 

Fantastic bugs can take many forms – outright errors, mismatched labeling, ambiguous or culturally biased questions, logical inconsistencies, and even formatting errors that lead to correct answers being scored as incorrect. For example, in one benchmark where the correct answer was “$5,” the system incorrectly graded answers such as “5 dollars” and “$5.00” as wrong. These faulty scores have serious consequences for the models and the developers, the researchers say. In one example provided in the paper, the model DeepSeek-R1 was ranked third lowest among competing models using unrevised benchmarks and rose to second place after the benchmark was updated.

AI Entomology

To unearth these fantastic bugs, Koyejo and Truong used longstanding statistical methods based in measurement theory to highlight outlier questions where unusually large numbers of models were stumbling. They then used a large language model (LLM) to evaluate and provide justification for flagging certain benchmarks for additional human review. 

“Our statistics-plus-AI framework effectively reduces the time of the human review by identifying a majority of truly problematic questions,” Truong says. To that end, the approach achieved 84% precision in discovering flawed questions across nine popular AI benchmarks. “That is, more than eight-in-ten questions flagged for review had demonstrable flaws,” Truong notes.

The researchers are now working with benchmark developers to address the flaws, advocating away from the present-day “publish-and-forget” approach to an era of continuing stewardship. Reaction to their work has been “mixed,” Koyejo says. Most acknowledge the need for more reliable measurements but are often reluctant to commit to continuous improvement. 

In encouraging benchmarking organizations to adopt their framework and address these concerns, Koyejo and Truong hope to see a significant enhancement in the standard of benchmarks used globally as a path to improved AI overall. This improvement is expected to lead to more accurate model evaluations, better resource allocation, and a general boost in the trust and credibility of AI systems. 

“As AI continues to integrate deeper into various sectors,” Koyejo says, “the impact of these changes could be profound, driving innovations and ensuring safer, more reliable, and more powerful AI.”

This work was partially funded by the Stanford Institute for Human-Centered AI. Contributing authors include Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jirayu Burapacheep, Jonathan Perera, Chibuike Uwakwe, and Stanford Graduate School of Education faculty members Benjamin W. Domingue and Nick Haber.

Share
Link copied to clipboard!
Contributor(s)
Andrew Myers

Related News

Smart Enough to Do Math, Dumb Enough to Fail: The Hunt for a Better AI Test
Andrew Myers
Feb 02, 2026
News
illustration of data and lines

A Stanford HAI workshop brought together experts to develop new evaluation methods that assess AI's hidden capabilities, not just its test-taking performance.

News
illustration of data and lines

Smart Enough to Do Math, Dumb Enough to Fail: The Hunt for a Better AI Test

Andrew Myers
Foundation ModelsGenerative AIPrivacy, Safety, SecurityFeb 02

A Stanford HAI workshop brought together experts to develop new evaluation methods that assess AI's hidden capabilities, not just its test-taking performance.

AI Leaders Discuss How To Foster Responsible Innovation At TIME100 Roundtable In Davos
TIME
Jan 21, 2026
Media Mention

HAI Senior Fellow Yejin Choi discussed responsible AI model training at Davos, asking, “What if there could be an alternative form of intelligence that really learns … morals, human values from the get-go, as opposed to just training LLMs on the entirety of the internet, which actually includes the worst part of humanity, and then we then try to patch things up by doing ‘alignment’?” 

Media Mention
Your browser does not support the video tag.

AI Leaders Discuss How To Foster Responsible Innovation At TIME100 Roundtable In Davos

TIME
Ethics, Equity, InclusionGenerative AIMachine LearningNatural Language ProcessingJan 21

HAI Senior Fellow Yejin Choi discussed responsible AI model training at Davos, asking, “What if there could be an alternative form of intelligence that really learns … morals, human values from the get-go, as opposed to just training LLMs on the entirety of the internet, which actually includes the worst part of humanity, and then we then try to patch things up by doing ‘alignment’?” 

Stanford’s Yejin Choi & Axios’ Ina Fried
Axios
Jan 19, 2026
Media Mention

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.

Media Mention
Your browser does not support the video tag.

Stanford’s Yejin Choi & Axios’ Ina Fried

Axios
Energy, EnvironmentMachine LearningGenerative AIEthics, Equity, InclusionJan 19

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.