AI’s Ostensible Emergent Abilities Are a Mirage | Stanford HAI
Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • AI Glossary
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

news

AI’s Ostensible Emergent Abilities Are a Mirage

Date
May 08, 2023
Topics
Natural Language Processing
Machine Learning
iStock/Will Petri

According to Stanford researchers, large language models are not greater than the sum of their parts.

For a few years now, tech leaders have been touting AI’s supposed emergent abilities: the possibility that beyond a certain threshold of complexity, large language models (LLMs) are doing unpredictable things. If we can harness that capacity, AI might be able to solve some of humanity’s biggest problems, the story goes. But unpredictability is also scary: Could making a model bigger unleash a completely unpredictable and potentially malevolent actor into the world?

That concern is widely shared by many in the tech industry. Indeed, a recently publicized open letter signed by more than 1,000 tech leaders calls for a six-month pause on giant AI tech experiments as a way to step back from “the dangerous race to ever-larger unpredictable black-box models with emergent capabilities.”

But according to a new paper, we can perhaps put that particular concern about AI to bed, says lead author Rylan Schaeffer, a second-year graduate student in computer science at Stanford University. “With bigger models, you get better performance,” he says, “but we don’t have evidence to suggest that the whole is greater than the sum of its parts.”

Indeed, as he and his colleagues Brando Miranda, a Stanford PhD student, and Sanmi Koyejo, an assistant professor of computer science, show, the perception of AI’s emergent abilities is based on the metrics that have been used. “The mirage of emergent abilities only exists because of the programmers' choice of metric,” Schaeffer says. “Once you investigate by changing the metrics, the mirage disappears.”

Finding the Mirage

Schaeffer began wondering if AI’s alleged emergent abilities were real while attending a lecture describing them. “I noticed in the lecture that many claimed emergent abilities seemingly appeared when researchers used certain very specific ways of evaluating those models,” he says.

Specifically, these metrics more harshly evaluated the performance of smaller models, making it appear as if novel and unpredictable abilities are arising as the models get bigger. Indeed, graphs of these metrics display a sharp change in performance at a particular model size – which is why emergent properties are sometimes called “sharp left turns.”

But when Schaeffer and his colleagues used other metrics that measured the abilities of smaller and larger models more fairly, the leap attributed to emergent properties was gone. In the paper published April 28 on preprint service arXiv, Schaeffer and his colleagues looked at 29 different metrics for evaluating model performance. Twenty-five of them show no emergent properties. Instead, they reveal a continuous, linear growth in model abilities as model size grows.

And there are simple explanations for why the other four metrics incorrectly suggest the existence of emergent properties. “They’re all sharp, deforming, non-continuous metrics,” Schaeffer says. “They are very harsh judges.” Indeed, using the metric known as “exact string match,” even a simple math problem will appear to develop emergent abilities at scale, Schaeffer says. For example, imagine doing an addition problem and making an error that’s off by one digit. The exact string match metric will view that mistake as being just as bad as an error that’s off by one billion digits. The result: a disregard for the ways that small models gradually improve as they scale up, and the appearance that large models make great leaps ahead. 

Schaeffer and his colleagues had also noticed that no one has claimed that large vision models exhibit emergent properties. As it turns out, vision researchers don’t use the harsh metrics used by natural language researchers. When Schaeffer applied the harsh metrics to a vision model, voilà, the mirage of emergence appeared.

Artificial General Intelligence Will Be Foreseeable

This is the first time an in-depth analysis has shown that the highly publicized story of LLMs’ emergent abilities springs from the use of harsh metrics. But it’s not the first time anyone has hinted at that possibility. Google’s recent paper “Beyond the Imitation Game” suggested that metrics might be the issue. And after Schaeffer’s paper came out, a research scientist working on LLMs at OpenAI tweeted that the company has made similar observations. 

What it means for the future is this: We don’t need to worry about accidentally stumbling onto artificial general intelligence (AGI). Yes, AGI may still have huge consequences for human society, Schaeffer says, “but if it emerges, we should be able to see it coming.”

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more. 

iStock/Will Petri
Share
Link copied to clipboard!
Contributor(s)
Katharine Miller

Related News

AI Coding Agents Fail at Teamwork
Andrew Myers
Jun 01, 2026
News
illustration of two people paddling in opposite directions

Two models working together perform worse than one alone, exposing a critical gap in artificial intelligence capabilities.

News
illustration of two people paddling in opposite directions

AI Coding Agents Fail at Teamwork

Andrew Myers
Generative AIMachine LearningJun 01

Two models working together perform worse than one alone, exposing a critical gap in artificial intelligence capabilities.

AI Hiring Tools Can Yield Racial Bias and Systemic Rejection
Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky, Percy Liang
May 26, 2026
News
A 3D isometric conceptual illustration showing a single glowing yellow human icon standing out among a grid of identical blue figures

The first large-scale study of hiring algorithms in the wild finds concerning patterns to how systems reject candidates.

News
A 3D isometric conceptual illustration showing a single glowing yellow human icon standing out among a grid of identical blue figures

AI Hiring Tools Can Yield Racial Bias and Systemic Rejection

Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky, Percy Liang
Machine LearningEthics, Equity, InclusionWorkforce, LaborMay 26

The first large-scale study of hiring algorithms in the wild finds concerning patterns to how systems reject candidates.

New Approach to Scaling Laws Could Change How AI Models Are Trained
Andrew Myers
May 21, 2026
News
Digital image symbolizing neural nets

Leveraging statistical concepts from measurement science and education, AI researchers have greatly reduced the computational demand of predicting how the largest of large language models will scale up in the future. It could save millions of dollars in training costs.

News
Digital image symbolizing neural nets

New Approach to Scaling Laws Could Change How AI Models Are Trained

Andrew Myers
Natural Language ProcessingGenerative AIMay 21

Leveraging statistical concepts from measurement science and education, AI researchers have greatly reduced the computational demand of predicting how the largest of large language models will scale up in the future. It could save millions of dollars in training costs.