Smart Enough to Do Math, Dumb Enough to Fail: The Hunt for a Better AI Test

A Stanford HAI workshop brought together experts to develop new evaluation methods that assess AI's hidden capabilities, not just its test-taking performance.
Ask artificial intelligence to compose a sonnet or solve a complex differential equation and it will perform marvels, yet at the same time it might also insist that 2.11 is greater than 2.9.
Incongruous and perplexing responses like these recently led a team of AI researchers—among them Olawale “Wale” Salaudeen, AI Institute Fellow in Residence at Schmidt Sciences, Sanmi Koyejo, an assistant professor of computer science at Stanford University, and Angelina Wang, an assistant professor of information science at Cornell University—to convene a workshop at the Stanford Institute for Human-Centered AI to discuss and debate better ways to measure AI’s innate capabilities and traits. The workshop was funded by Schmidt Sciences and the MacArthur Foundation.
“This effort was necessary because too often, we are imposing human traits and associated behaviors onto these non-human systems without a scientific basis to do so,” Salaudeen said. “It is quite possible, perhaps even likely, that AI models have a very different set of behavior-explaining traits than humans, even though they may mimic humans behaviors.”
“Better measurement is key to safer and more reliable artificial intelligence,” Koyejo explained, “We need a measurement science for AI systems similar to fields ranging from physics to the learning sciences. Right now, it simply doesn't exist.”
Performance Plan
Each time a new model is trained, it is subjected to a series of questions and graded based on how well or poorly it answers them. But these questions fail to assess AI’s difficult-to-measure, hidden abilities, like intelligence, logic and reasoning. Despite thousands of libraries of benchmark questions currently available, most are either flawed or not up to the challenge of measuring the subtler skills of today’s increasingly sophisticated models, Koyejo says.
The organizers say the benchmarking workshop was essential to the future of AI. It brought together experts from academia, industry, nonprofit, and policy to begin to answer the looming question: “What are we actually measuring when we benchmark AI systems?"
Their larger goal was to spark a field-wide effort to develop a robust, accurate and standard set of benchmarks to measure AI’s understanding of the answers it provides. Wang drew a parallel to the field psychometrics, a sub-specialty of psychology that attempts to measure hidden qualities such as intelligence and reasoning.
Just as psychometrics tests whether a human student understands the concepts behind the math versus simply memorizing answers, current AI benchmarks only check if AI gets the right answers. It may get an "A" on the test, but it’s missing the bigger point.
“AI benchmarks test specific objective tasks and knowledge well, but not underlying traits and capabilities,” Wang says.
Full Measure
Building on decades of measurement sciences in psychology and neuroscience, the benchmarking workshop aimed to bridge this gap for AI. Attendees discussed the quality and validity of current benchmarks, explored whether AI’s latent traits could even be measured at all, and debated whether human concepts like reasoning can be applied to AI. Some of the organizers have begun to create an AI Construct Lexis as an early step toward developing for AI what the Cognitive Atlas has become for the cognitive sciences – a collaboratively created and curated knowledge base reflecting the latest and best ideas across the field.
“Early in its history, psychology faced a similar challenge with measuring seemingly unmeasurable traits,” Koyejo noted. “The field developed psychometrics that inferred ‘latent traits’ from patterns across multiple tests. AI needs a similar approach—moving from ‘Can AI pass this practical test?’ to ‘What underlying capability does the test reveal about AI?’”
Salaudeen offered as an example the popular workshop topic Jingle-Jangle Fallacies, a term borrowed from psychometrics in which two unrelated ideas are equated because they bear the same name—a jingle fallacy. Or, conversely, that two related things are dismissed because they are labeled inaccurately—a jangle fallacy.
He points to terms such as "common sense" and "reasoning," often used to describe AI’s comprehension, when they more likely merely reflect AI’s ability to recognize patterns or to make statistical inferences. Thus incongruous declarations of AI text generation as "creative" or "intelligent" fall into the scope of jingle fallacies, even dismissing such terminology unjustly because AI lacks “consciousness” is a form of jangle fallacy.
One of Wang’s favorite moments of the workshop was called the “agreement spectrum,” that proved the confusion isn't just among machines, but among the scientists, as well. During an exercise, participants physically position themselves by walking to different areas of the room to show relative support or opposition to deliberately controversial “hot take” statements about AI.
The resulting scatter plot of humans revealed a stark truth—there is almost no current consensus on how to define concepts such as AI “reasoning” or whether such a concept exists as a core property of AI systems that enable the broad set of behavior we attribute to systems that can reason. It is possible that reasoning as we conceptualize it for humans is incompatible with machines.
Future Focus
Toward the end of the workshop, organizers outlined next steps for the field. A potential outcome is a technical paper building on the discussions and insights from the workshop. Additionally, the scholars will continue to develop the atlas of AI traits and terminology.
Koyejo stressed that the long-term value of workshops like this is to produce more predictable and more reliable AI systems. “If we understand these tools better, we understand what to expect when they’re deployed in various contexts,” he said. Greater predictability could revolutionize how AI models are evaluated, optimized, and trusted in real-world applications.
In the end, the researchers agree that the potential of better measurement in AI extends far beyond academics. It could lead to AI systems that are not only more capable but also more reliable and transparent than in the past. This advancement could hasten the development of AI technologies that are safe, ethical, and more beneficial across many domains.
Against that backdrop, the workshop an important early step: “It’s exciting to make efforts toward unifying the field in possible ways to be rigorous and technical in how we think about evaluation,” Wang said.
“This productive process helped us understand what we should be measuring and to begin to comprehend how we can measure it," Koyejo added. "AI will be better for it.”
Additional workshop organizers included: Sang Truong, a graduate student at Stanford; Haoran Zhang, a graduate student at MIT; and Tracy Navichoque, a program manager at Stanford HAI.