Smart Enough to Do Math, Dumb Enough to Fail: The Hunt for a Better AI Test | Stanford HAI
Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • AI Glossary
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

news

Smart Enough to Do Math, Dumb Enough to Fail: The Hunt for a Better AI Test

Date
February 02, 2026
Topics
Foundation Models
Generative AI
Privacy, Safety, Security
illustration of data and lines
istock

A Stanford HAI workshop brought together experts to develop new evaluation methods that assess AI's hidden capabilities, not just its test-taking performance.

Ask artificial intelligence to compose a sonnet or solve a complex differential equation and it will perform marvels, yet at the same time it might also insist that 2.11 is greater than 2.9. 

Incongruous and perplexing responses like these recently led a team of AI researchers—among them Olawale “Wale” Salaudeen, AI Institute Fellow in Residence at Schmidt Sciences, Sanmi Koyejo, an assistant professor of computer science at Stanford University, and Angelina Wang, an assistant professor of information science at Cornell University—to convene a workshop at the Stanford Institute for Human-Centered AI to discuss and debate better ways to measure AI’s innate capabilities and traits. The workshop was funded by Schmidt Sciences and the MacArthur Foundation.

“This effort was necessary because too often, we are imposing human traits and associated behaviors onto these non-human systems without a scientific basis to do so,” Salaudeen said. “It is quite possible, perhaps even likely, that AI models have a very different set of behavior-explaining traits than humans, even though they may mimic humans behaviors.”

“Better measurement is key to safer and more reliable artificial intelligence,” Koyejo explained, “We need a measurement science for AI systems similar to fields ranging from physics to the learning sciences. Right now, it simply doesn't exist.”

Performance Plan

Each time a new model is trained, it is subjected to a series of questions and graded based on how well or poorly it answers them. But these questions fail to assess AI’s difficult-to-measure, hidden abilities, like intelligence, logic and reasoning. Despite thousands of libraries of benchmark questions currently available, most are either flawed or not up to the challenge of measuring the subtler skills of today’s increasingly sophisticated models, Koyejo says. 

The organizers say the benchmarking workshop was essential to the future of AI. It brought together experts from academia, industry, nonprofit, and policy to begin to answer the looming question: “What are we actually measuring when we benchmark AI systems?" 

Their larger goal was to spark a field-wide effort to develop a robust, accurate and standard set of benchmarks to measure AI’s understanding of the answers it provides. Wang drew a parallel to the field psychometrics, a sub-specialty of psychology that attempts to measure hidden qualities such as intelligence and reasoning. 

Just as psychometrics tests whether a human student understands the concepts behind the math versus simply memorizing answers, current AI benchmarks only check if AI gets the right answers. It may get an "A" on the test, but it’s missing the bigger point. 

“AI benchmarks test specific objective tasks and knowledge well, but not underlying traits and capabilities,” Wang says.

Full Measure

Building on decades of measurement sciences in psychology and neuroscience, the benchmarking workshop aimed to bridge this gap for AI. Attendees discussed the quality and validity of current benchmarks, explored whether AI’s latent traits could even be measured at all, and debated whether human concepts like reasoning can be applied to AI. Some of the organizers have begun to create an AI Construct Lexis as an early step toward developing for AI what the Cognitive Atlas has become for the cognitive sciences – a collaboratively created and curated knowledge base reflecting the latest and best ideas across the field.

“Early in its history, psychology faced a similar challenge with measuring seemingly unmeasurable traits,” Koyejo noted. “The field developed psychometrics that inferred ‘latent traits’ from patterns across multiple tests. AI needs a similar approach—moving from ‘Can AI pass this practical test?’ to ‘What underlying capability does the test reveal about AI?’” 

Salaudeen offered as an example the popular workshop topic Jingle-Jangle Fallacies, a term borrowed from psychometrics in which two unrelated ideas are equated because they bear the same name—a jingle fallacy. Or, conversely, that two related things are dismissed because they are labeled inaccurately—a jangle fallacy. 

He points to terms such as "common sense" and "reasoning," often used to describe AI’s comprehension, when they more likely merely reflect AI’s ability to recognize patterns or to make statistical inferences. Thus incongruous declarations of AI text generation as "creative" or "intelligent" fall into the scope of jingle fallacies, even dismissing such terminology unjustly because AI lacks “consciousness” is a form of jangle fallacy. 

One of Wang’s favorite moments of the workshop was called the “agreement spectrum,” that proved the confusion isn't just among machines, but among the scientists, as well. During an exercise, participants physically position themselves by walking to different areas of the room to show relative support or opposition to deliberately controversial “hot take” statements about AI. 

The resulting scatter plot of humans revealed a stark truth—there is almost no current consensus on how to define concepts such as AI “reasoning” or whether such a concept exists as a core property of AI systems that enable the broad set of behavior we attribute to systems that can reason. It is possible that reasoning as we conceptualize it for humans is incompatible with machines.

Future Focus

Toward the end of the workshop, organizers outlined next steps for the field. A potential outcome is a technical paper building on the discussions and insights from the workshop. Additionally, the scholars will continue to develop the atlas of AI traits and terminology.

Koyejo stressed that the long-term value of workshops like this is to produce more predictable and more reliable AI systems. “If we understand these tools better, we understand what to expect when they’re deployed in various contexts,” he said. Greater predictability could revolutionize how AI models are evaluated, optimized, and trusted in real-world applications.

In the end, the researchers agree that the potential of better measurement in AI extends far beyond academics. It could lead to AI systems that are not only more capable but also more reliable and transparent than in the past. This advancement could hasten the development of AI technologies that are safe, ethical, and more beneficial across many domains.

Against that backdrop, the workshop an important early step: “It’s exciting to make efforts toward unifying the field in possible ways to be rigorous and technical in how we think about evaluation,” Wang said.

“This productive process helped us understand what we should be measuring and to begin to comprehend how we can measure it," Koyejo added. "AI will be better for it.”

Additional workshop organizers included: Sang Truong, a graduate student at Stanford; Haoran Zhang, a graduate student at MIT; and Tracy Navichoque, a program manager at Stanford HAI.

istock
Share
Link copied to clipboard!
Contributor(s)
Andrew Myers

Related News

Reading Today’s Headlines Through AI: A Real-Time Audit of Six Commercial Chatbots
Mirac Suzgun and James Zou
Jun 03, 2026
News

In a new study, scholars measured how accurately popular AI chatbots answered questions about the emerging news and found substantial regional disparity, dependence on distinct information ecosystems, and acute fragility under imperfect prompts.

News

Reading Today’s Headlines Through AI: A Real-Time Audit of Six Commercial Chatbots

Mirac Suzgun and James Zou
Communications, MediaGenerative AIJun 03

In a new study, scholars measured how accurately popular AI chatbots answered questions about the emerging news and found substantial regional disparity, dependence on distinct information ecosystems, and acute fragility under imperfect prompts.

AI Coding Agents Fail at Teamwork
Andrew Myers
Jun 01, 2026
News
illustration of two people paddling in opposite directions

Two models working together perform worse than one alone, exposing a critical gap in artificial intelligence capabilities.

News
illustration of two people paddling in opposite directions

AI Coding Agents Fail at Teamwork

Andrew Myers
Generative AIMachine LearningJun 01

Two models working together perform worse than one alone, exposing a critical gap in artificial intelligence capabilities.

How AI is Transforming Scientific Discovery While Keeping Humans at the Center
Shana Lynch
May 27, 2026
News

From designing new antibodies to simulating 1,000 years of climate in a day, AI is transforming what's possible—but humans remain the ones deciding what matters.

News

How AI is Transforming Scientific Discovery While Keeping Humans at the Center

Shana Lynch
Sciences (Social, Health, Biological, Physical)Generative AIMay 27

From designing new antibodies to simulating 1,000 years of climate in a day, AI is transforming what's possible—but humans remain the ones deciding what matters.