Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.

What are AI Benchmarks?

AI Benchmarks are standardized tests used to measure and compare how well AI systems perform on specific tasks, like answering questions, recognizing images, writing code, or following instructions. They usually include a fixed set of problems and a scoring method so that different models can be evaluated consistently. Benchmarks help track progress, but they can be misleading if models “game” the test or if the test doesn’t reflect real-world needs.

What are AI Benchmarks? | Stanford HAI

AI Benchmarks mentioned at Stanford HAI

Explore Similar Terms

Predictive Analytics | Artificial Intelligence (AI)

See Full List of Terms & Definitions

Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • AI Glossary
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Enroll in a Human-Centered AI Course

This HAI program covers technical fundamentals, business implications, and societal considerations.
Better Benchmarks for Safety-Critical AI Applications
Nikki Goth Itoi
May 27
news
Business graph digital concept

Stanford researchers investigate why models often fail in edge-case scenarios.

Better Benchmarks for Safety-Critical AI Applications

Nikki Goth Itoi
May 27

Stanford researchers investigate why models often fail in edge-case scenarios.

Machine Learning
Business graph digital concept
news
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala
Dec 27
Research
Your browser does not support the video tag.

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala
Dec 27

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Your browser does not support the video tag.
Research
Extending the WILDS Benchmark for Unsupervised Adaptation
Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Lev
Apr 24
Research
Your browser does not support the video tag.

Extending the WILDS Benchmark for Unsupervised Adaptation

Extending the WILDS Benchmark for Unsupervised Adaptation

Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Lev
Apr 24

Extending the WILDS Benchmark for Unsupervised Adaptation

Your browser does not support the video tag.
Research
Sanmi Koyejo | Beyond Benchmarks: Building a Science of AI Measurement
seminarMar 19, 202512:00 PM - 1:15 PM
March
19
2025

The widepread deployment of AI systems in critical domains demands more rigorous approaches to evaluating their capabilities and safety.

March
19
2025

Sanmi Koyejo | Beyond Benchmarks: Building a Science of AI Measurement

Mar 19, 202512:00 PM - 1:15 PM

The widepread deployment of AI systems in critical domains demands more rigorous approaches to evaluating their capabilities and safety.

Sciences (Social, Health, Biological, Physical)
LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning
Neel Guha, Daniel E. Ho, Julian Nyarko
Sep 14
Research
Your browser does not support the video tag.

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

Neel Guha, Daniel E. Ho, Julian Nyarko
Sep 14

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

Your browser does not support the video tag.
Research
Stanford Develops Real-World Benchmarks for Healthcare AI Agents
Scott Hadly
Sep 15
news

Researchers are establishing standards to validate the efficacy of AI agents in clinical settings.

Stanford Develops Real-World Benchmarks for Healthcare AI Agents

Scott Hadly
Sep 15

Researchers are establishing standards to validate the efficacy of AI agents in clinical settings.

Healthcare
news
HAI Student Affinity Groups Take On Society’s Emerging Questions
Madeleine Wright
Jun 26
News

Stanford students across disciplines are teaming up to tackle society’s pressing questions in the age of AI.

HAI Student Affinity Groups Take On Society’s Emerging Questions

Madeleine Wright
Jun 26

Stanford students across disciplines are teaming up to tackle society’s pressing questions in the age of AI.

Arts, Humanities
Generative AI
Ethics, Equity, Inclusion
Privacy, Safety, Security
News