Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
AI Benchmarks Hit Saturation | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

AI Benchmarks Hit Saturation

Date
April 03, 2023
Topics
Machine Learning

AI continues to surpass human performance; it’s time to reevaluate our tests.

How good is AI? According to most of the technical performance benchmarks we have today, it’s nearly perfect. But that doesn’t mean most artificial intelligence tools work the way we want them to, says Vanessa Parli, associate director of research programs at the Stanford Institute for Human-Centered AI and a member of the AI Index steering committee.

She cites the current popular example of ChatGPT. “There’s been a lot of excitement, and it meets some of these benchmarks quite well," she said. "But when you actually use the tool, it gives incorrect answers, says thing we don’t want it to say, and is still difficult to interact with.”

In the newest AI Index, published on April 3, a team of independent researchers analyzed over 50 benchmarks in vision, language, speech, and more to find out that AI tools are able to score extremely high on many of these evaluations.

“Most of the benchmarks are hitting a point where we cannot do much better, 80-90% accuracy, ” she said. “We really need to be thinking about how we, as humans and society, want to interact with AI, and develop new benchmarks from there.”

In this conversation, Parli explains more about the benchmarking trends she sees from the AI Index.

What do you mean by benchmark?

A benchmark is essentially a goal for the AI system to hit. It’s a way of defining what you want your tool to do, and then working toward that goal. One example is HAI Co-Director Fei-Fei Li’s ImageNet, a dataset of over 14 million images. Researchers run their image classification algorithms on ImageNet as a way to test their system. The goal is to correctly identify as many of the images as possible. 

What did the AI Index study find regarding these benchmarks?

We looked across multiple technical benchmarks that have been created over the past dozen years –  around vision, around language, etc. – and evaluated the state-of-the-art result in each benchmark year over a year. So, for each benchmark, were researchers able to beat the score from last year? Did they meet it? Or was there no progress at all? We looked at ImageNet, a language benchmark called SUPERGlue, a hardware benchmark called MLPerf, and more; some 50 were analyzed and over 20 made it into the report.

And what did you find in your research?

In earlier years, people were improving significantly on the past year’s state of the art or best performance. This year across the majority of the benchmarks, we saw minimal progress to the point we decided not to include some in the report. For example, the best image classification system on ImageNet in 2021 had an accuracy rate of 91%; 2022 saw only a 0.1 percentage point improvement. 

So we’re seeing a saturation among these benchmarks – there just isn’t really any improvement to be made. 

Additionally, while some benchmarks are not hitting the 90% accuracy range, they are beating the human baseline. For example, the Visual Question Answering Challenge tests AI systems with open-ended textual questions about images. This year, the top performing model hit 84.3% accuracy. Human baseline is about 80%. 

What does that mean for researchers?

The takeaway for me is that perhaps we need newer and more comprehensive benchmarks to evaluate against. Another way that I think of it is this: Our AI tools right now are not exactly as we would want them to be – they give wrong information, they create sexist imagery. The question becomes, if benchmarks are supposed to help us reach a goal, what is this goal? How do we want to work with AI and how do we want AI to work with us? Perhaps we need more comprehensive benchmarks – right now, benchmarks mostly test against a single goal. But as we move toward AI tools that incorporate vision, language, and more, do we need benchmarks that help us understand the tradeoffs between accuracy and bias or toxicity, for example? Can we consider more social factors? A lot cannot be measured through quantitative benchmarks. I think this is an opportunity to reevaluate what we want from these tools.

Are researchers already beginning to build better benchmarks?

Being at Stanford HAI, home to the Center for Research on Foundation Models, I can point to HELM. HELM, developed by scholars at CRFM, looks across multiple scenarios and multiple tasks and is more comprehensive than benchmarks we have seen in the past. It considers not only accuracy, but also fairness, toxicity, efficiency, robustness, and more.

That’s just one example. But we need more of these approaches. Because benchmarks guide the direction of AI development, they must align more with how we, as humans and as a society, want to interact with these tools.

The AI Index is an independent initiative at Stanford HAI that is led by an interdisciplinary steering committee of experts from across academia and industry. It serves as one of the most credible and authoritative sources for data and insights about AI to provide policymakers, researchers, journalists, executives, and the general public a deeper understanding of the field.

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.

Share
Link copied to clipboard!
Authors
  • headshot
    Shana Lynch

Related News

Fei-Fei Li Wins Queen Elizabeth Prize for Engineering
Shana Lynch
Nov 07, 2025
News

The Stanford HAI co-founder is recognized for breakthroughs that propelled computer vision and deep learning, and for championing human-centered AI and industry innovation.

News

Fei-Fei Li Wins Queen Elizabeth Prize for Engineering

Shana Lynch
Computer VisionMachine LearningNov 07

The Stanford HAI co-founder is recognized for breakthroughs that propelled computer vision and deep learning, and for championing human-centered AI and industry innovation.

Offline “Studying” Shrinks the Cost of Contextually Aware AI
Andrew Myers
Sep 29, 2025
News
Blue abstract background with light traveling through abstract flat cable illustrating data flow (3D render)

By having AI study a user’s context offline, researchers dramatically reduce the memory and cost required to make AI contextually aware.

News
Blue abstract background with light traveling through abstract flat cable illustrating data flow (3D render)

Offline “Studying” Shrinks the Cost of Contextually Aware AI

Andrew Myers
Foundation ModelsMachine LearningSep 29

By having AI study a user’s context offline, researchers dramatically reduce the memory and cost required to make AI contextually aware.

BEHAVIOR Challenge Charts the Way Forward for Domestic Robotics
Andrew Myers
Sep 22, 2025
News

With a first-of-its-kind competition for roboticists everywhere, researchers at Stanford are hoping to push domestic robotics into a new age of autonomy and capability.

News

BEHAVIOR Challenge Charts the Way Forward for Domestic Robotics

Andrew Myers
RoboticsMachine LearningSep 22

With a first-of-its-kind competition for roboticists everywhere, researchers at Stanford are hoping to push domestic robotics into a new age of autonomy and capability.