Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
AI’s Ostensible Emergent Abilities Are a Mirage | Stanford HAI
news

AI’s Ostensible Emergent Abilities Are a Mirage

Date
May 08, 2023
Topics
Natural Language Processing
Machine Learning
iStock/Will Petri

According to Stanford researchers, large language models are not greater than the sum of their parts.

For a few years now, tech leaders have been touting AI’s supposed emergent abilities: the possibility that beyond a certain threshold of complexity, large language models (LLMs) are doing unpredictable things. If we can harness that capacity, AI might be able to solve some of humanity’s biggest problems, the story goes. But unpredictability is also scary: Could making a model bigger unleash a completely unpredictable and potentially malevolent actor into the world?

That concern is widely shared by many in the tech industry. Indeed, a recently publicized open letter signed by more than 1,000 tech leaders calls for a six-month pause on giant AI tech experiments as a way to step back from “the dangerous race to ever-larger unpredictable black-box models with emergent capabilities.”

But according to a new paper, we can perhaps put that particular concern about AI to bed, says lead author Rylan Schaeffer, a second-year graduate student in computer science at Stanford University. “With bigger models, you get better performance,” he says, “but we don’t have evidence to suggest that the whole is greater than the sum of its parts.”

Indeed, as he and his colleagues Brando Miranda, a Stanford PhD student, and Sanmi Koyejo, an assistant professor of computer science, show, the perception of AI’s emergent abilities is based on the metrics that have been used. “The mirage of emergent abilities only exists because of the programmers' choice of metric,” Schaeffer says. “Once you investigate by changing the metrics, the mirage disappears.”

Finding the Mirage

Schaeffer began wondering if AI’s alleged emergent abilities were real while attending a lecture describing them. “I noticed in the lecture that many claimed emergent abilities seemingly appeared when researchers used certain very specific ways of evaluating those models,” he says.

Specifically, these metrics more harshly evaluated the performance of smaller models, making it appear as if novel and unpredictable abilities are arising as the models get bigger. Indeed, graphs of these metrics display a sharp change in performance at a particular model size – which is why emergent properties are sometimes called “sharp left turns.”

But when Schaeffer and his colleagues used other metrics that measured the abilities of smaller and larger models more fairly, the leap attributed to emergent properties was gone. In the paper published April 28 on preprint service arXiv, Schaeffer and his colleagues looked at 29 different metrics for evaluating model performance. Twenty-five of them show no emergent properties. Instead, they reveal a continuous, linear growth in model abilities as model size grows.

And there are simple explanations for why the other four metrics incorrectly suggest the existence of emergent properties. “They’re all sharp, deforming, non-continuous metrics,” Schaeffer says. “They are very harsh judges.” Indeed, using the metric known as “exact string match,” even a simple math problem will appear to develop emergent abilities at scale, Schaeffer says. For example, imagine doing an addition problem and making an error that’s off by one digit. The exact string match metric will view that mistake as being just as bad as an error that’s off by one billion digits. The result: a disregard for the ways that small models gradually improve as they scale up, and the appearance that large models make great leaps ahead. 

Schaeffer and his colleagues had also noticed that no one has claimed that large vision models exhibit emergent properties. As it turns out, vision researchers don’t use the harsh metrics used by natural language researchers. When Schaeffer applied the harsh metrics to a vision model, voilà, the mirage of emergence appeared.

Artificial General Intelligence Will Be Foreseeable

This is the first time an in-depth analysis has shown that the highly publicized story of LLMs’ emergent abilities springs from the use of harsh metrics. But it’s not the first time anyone has hinted at that possibility. Google’s recent paper “Beyond the Imitation Game” suggested that metrics might be the issue. And after Schaeffer’s paper came out, a research scientist working on LLMs at OpenAI tweeted that the company has made similar observations. 

What it means for the future is this: We don’t need to worry about accidentally stumbling onto artificial general intelligence (AGI). Yes, AGI may still have huge consequences for human society, Schaeffer says, “but if it emerges, we should be able to see it coming.”

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more. 

Share
Link copied to clipboard!
Contributor(s)
Katharine Miller

Related News

MedArena: Comparing LLMs for Medicine in the Wild
Eric Wu, Kevin Wu, James Zou
Apr 24, 2025
News

Stanford scholars leverage physicians to evaluate 11 large language models in real-world settings.

News

MedArena: Comparing LLMs for Medicine in the Wild

Eric Wu, Kevin Wu, James Zou
HealthcareNatural Language ProcessingGenerative AIApr 24

Stanford scholars leverage physicians to evaluate 11 large language models in real-world settings.

Language Models in the Classroom: Bridging the Gap Between Technology and Teaching
Instructors and students of CS293
Apr 09, 2025
News

Instructors and students from Stanford class CS293/EDUC473 address the failures of current educational technologies and outline how to empower both teachers and learners through collaborative innovation.

News

Language Models in the Classroom: Bridging the Gap Between Technology and Teaching

Instructors and students of CS293
Education, SkillsGenerative AINatural Language ProcessingApr 09

Instructors and students from Stanford class CS293/EDUC473 address the failures of current educational technologies and outline how to empower both teachers and learners through collaborative innovation.

The AI Race Has Gotten Crowded—and China Is Closing In on the US
Wired
Apr 07, 2025
Media Mention

Vanessa Parli, Stanford HAI Director of Research and AI Index Steering Committee member, notes that the 2025 AI Index reports flourishing and higher-quality academic research in AI.

Media Mention
Your browser does not support the video tag.

The AI Race Has Gotten Crowded—and China Is Closing In on the US

Wired
Regulation, Policy, GovernanceEconomy, MarketsFinance, BusinessGenerative AIIndustry, InnovationMachine LearningSciences (Social, Health, Biological, Physical)Apr 07

Vanessa Parli, Stanford HAI Director of Research and AI Index Steering Committee member, notes that the 2025 AI Index reports flourishing and higher-quality academic research in AI.

Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs