Squashing ‘Fantastic Bugs’: Researchers Look to Fix Flaws in AI Benchmarks

Date

December 08, 2025

Topics

In evaluating thousands of benchmarks that AI developers use to assess the quality of their new models, a team of Stanford researchers says 5% could have serious flaws that can lead to major ramifications.

Each time an AI researcher trains a new model to understand language, recognize images, or solve a medical riddle, one big question remains: Is this model better than what went before? To answer that question, AI researchers rely on batteries of benchmarks, or tests to measure and assess a new model’s capabilities. Benchmark scores can make or break a model.

But there are tens of thousands of benchmarks spread across several datasets. Which one should developers use, and are all of equal worth?

In a new paper presented at the Conference on Neural Information Processing Systems (NeurIPS) in December, researchers Sanmi Koyejo, assistant professor of computer science at Stanford University, and Sang Truong, a doctoral student in Koyejo’s Stanford Trustworthy AI Research (STAIR) lab, have mathematically scoured the thousands of benchmarks to reveal that as many as one in twenty are invalid.

“Benchmarks serve a real public good,” Koyejo says. “But greater scrutiny and detail into how we build them is warranted and needs to match their growing importance in the AI community.”

Truth and Consequences

The researchers playfully refer to these flaws as “fantastic bugs” – an allusion to the “fantastic beasts” of cinema – but the consequences are producing something of a crisis of reliability in AI. “Mistakes in benchmarks have a huge influence on the industry,” Koyejo says.

Flawed benchmarks can seriously harm a model’s score – falsely promoting underperforming models and wrongly penalizing better-performing ones. They can also have more insidious but far-reaching effects, as developers often rely on benchmark scores to make critical funding, research, and resource allocation decisions, which can lead them to wrongly focus resources on less-capable models or withhold releasing models based on untrustworthy scores.

Koyejo and Truong now hope to work with benchmarking organizations to correct or remove flawed benchmarks to restore reliability and fairness in benchmark scoring and thereby improve model development and rankings across the board.

Fantastic bugs can take many forms – outright errors, mismatched labeling, ambiguous or culturally biased questions, logical inconsistencies, and even formatting errors that lead to correct answers being scored as incorrect. For example, in one benchmark where the correct answer was “$5,” the system incorrectly graded answers such as “5 dollars” and “$5.00” as wrong. These faulty scores have serious consequences for the models and the developers, the researchers say. In one example provided in the paper, the model DeepSeek-R1 was ranked third lowest among competing models using unrevised benchmarks and rose to second place after the benchmark was updated.

AI Entomology

To unearth these fantastic bugs, Koyejo and Truong used longstanding statistical methods based in measurement theory to highlight outlier questions where unusually large numbers of models were stumbling. They then used a large language model (LLM) to evaluate and provide justification for flagging certain benchmarks for additional human review.

“Our statistics-plus-AI framework effectively reduces the time of the human review by identifying a majority of truly problematic questions,” Truong says. To that end, the approach achieved 84% precision in discovering flawed questions across nine popular AI benchmarks. “That is, more than eight-in-ten questions flagged for review had demonstrable flaws,” Truong notes.

The researchers are now working with benchmark developers to address the flaws, advocating away from the present-day “publish-and-forget” approach to an era of continuing stewardship. Reaction to their work has been “mixed,” Koyejo says. Most acknowledge the need for more reliable measurements but are often reluctant to commit to continuous improvement.

In encouraging benchmarking organizations to adopt their framework and address these concerns, Koyejo and Truong hope to see a significant enhancement in the standard of benchmarks used globally as a path to improved AI overall. This improvement is expected to lead to more accurate model evaluations, better resource allocation, and a general boost in the trust and credibility of AI systems.

“As AI continues to integrate deeper into various sectors,” Koyejo says, “the impact of these changes could be profound, driving innovations and ensuring safer, more reliable, and more powerful AI.”

This work was partially funded by the Stanford Institute for Human-Centered AI. Contributing authors include Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jirayu Burapacheep, Jonathan Perera, Chibuike Uwakwe, and Stanford Graduate School of Education faculty members Benjamin W. Domingue and Nick Haber.

Navigate

Participate

Stay Up To Date

Squashing ‘Fantastic Bugs’: Researchers Look to Fix Flaws in AI Benchmarks

Truth and Consequences

AI Entomology

Related News

From Privacy to ‘Glass Box’ AI, Stanford Students Are Targeting Real-World Problems

From Privacy to ‘Glass Box’ AI, Stanford Students Are Targeting Real-World Problems

AI Challenges Core Assumptions in Education

AI Challenges Core Assumptions in Education

America's 250 Greatest Innovators: Celebrating The American Dream

America's 250 Greatest Innovators: Celebrating The American Dream