Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs

What are AI Benchmarks?

AI Benchmarks are standardized tests used to measure and compare how well AI systems perform on specific tasks, like answering questions, recognizing images, writing code, or following instructions. They usually include a fixed set of problems and a scoring method so that different models can be evaluated consistently. Benchmarks help track progress, but they can be misleading if models “game” the test or if the test doesn’t reflect real-world needs.

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News


AI Benchmarks mentioned at Stanford HAI

Explore Similar Terms

Predictive Analytics | Artificial Intelligence (AI)

See Full List of Terms & Definitions

Enroll in a Human-Centered AI Course

This HAI program covers technical fundamentals, business implications, and societal considerations.
Better Benchmarks for Safety-Critical AI Applications
Nikki Goth Itoi
May 27
news
Business graph digital concept

Stanford researchers investigate why models often fail in edge-case scenarios.

Better Benchmarks for Safety-Critical AI Applications

Nikki Goth Itoi
May 27

Stanford researchers investigate why models often fail in edge-case scenarios.

Machine Learning
Business graph digital concept
news
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala
Dec 27
Research
Your browser does not support the video tag.

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala
Dec 27

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Your browser does not support the video tag.
Research
Extending the WILDS Benchmark for Unsupervised Adaptation
Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Lev
Apr 24
Research
Your browser does not support the video tag.

Extending the WILDS Benchmark for Unsupervised Adaptation

Extending the WILDS Benchmark for Unsupervised Adaptation

Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Lev
Apr 24

Extending the WILDS Benchmark for Unsupervised Adaptation

Your browser does not support the video tag.
Research
Sanmi Koyejo | Beyond Benchmarks: Building a Science of AI Measurement
seminarMar 19, 202512:00 PM - 1:15 PM
March
19
2025

The widepread deployment of AI systems in critical domains demands more rigorous approaches to evaluating their capabilities and safety.

March
19
2025

Sanmi Koyejo | Beyond Benchmarks: Building a Science of AI Measurement

Mar 19, 202512:00 PM - 1:15 PM

The widepread deployment of AI systems in critical domains demands more rigorous approaches to evaluating their capabilities and safety.

Sciences (Social, Health, Biological, Physical)
LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning
Neel Guha, Daniel E. Ho, Julian Nyarko
Sep 14
Research
Your browser does not support the video tag.

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

Neel Guha, Daniel E. Ho, Julian Nyarko
Sep 14

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

Your browser does not support the video tag.
Research
Stanford Develops Real-World Benchmarks for Healthcare AI Agents
Scott Hadly
Sep 15
news

Researchers are establishing standards to validate the efficacy of AI agents in clinical settings.

Stanford Develops Real-World Benchmarks for Healthcare AI Agents

Scott Hadly
Sep 15

Researchers are establishing standards to validate the efficacy of AI agents in clinical settings.

Healthcare
news
‘We are Stanford’: Open Minds Event Honors Staff
Stanford Report
Mar 31
Media Mention

Stanford University President Jon Levin highlights Stanford’s pivotal role in shaping the future of AI, pointing to Stanford HAI as a leader in advancing its ethical development and deployment.

‘We are Stanford’: Open Minds Event Honors Staff

Stanford Report
Mar 31

Stanford University President Jon Levin highlights Stanford’s pivotal role in shaping the future of AI, pointing to Stanford HAI as a leader in advancing its ethical development and deployment.

Ethics, Equity, Inclusion
Media Mention