Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
What Makes a Good AI Benchmark? | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
policyPolicy Brief

What Makes a Good AI Benchmark?

Date
December 11, 2024
Topics
Foundation Models
Privacy, Safety, Security
What Makes a Good AI Benchmark
Read Paper
abstract

This brief presents a novel assessment framework for evaluating the quality of AI benchmarks and scores 24 benchmarks against the framework.

Key Takeaways

  • The rapid advancement and proliferation of AI systems, including foundation models, has catalyzed the widespread adoption of AI benchmarks—yet only very limited research to date has evaluated the quality of AI benchmarks in a structured manner.

  • We reviewed benchmarking literature and interviewed expert stakeholders to define what makes a high-quality benchmark, and developed a novel assessment framework for evaluating AI benchmarks based on 46 criteria across five benchmark life-cycle phases.

  • In scoring 24 AI benchmarks, we found large quality differences between them, including those widely relied on by developers and policymakers. Most benchmarks are highest quality at the design stage and lowest quality at the implementation stage.

  • Policymakers should encourage developers, companies, civil society groups, and government organizations to articulate benchmark quality when conducting or relying on AI model evaluations and consult best practices for minimum quality assurance. 

Executive Summary

The rapid advancement and proliferation of AI systems, and in particular foundation models (FMs), has made AI evaluation crucial for assessing model capabilities and risks. AI model evaluations currently include both internal approaches—such as privately testing models on proprietary data—and external approaches—such as scoring models on public benchmarks. Researchers and practitioners alike have adopted AI benchmarks as a standard practice for facilitating comparisons between, measuring the performance of, tracking progress in, and identifying weaknesses in different models.

Yet, no studies to date have assessed the quality of AI benchmarks in general in a structured manner, including both FM and non-FM benchmarks. Further, no comparative analyses have assessed the quality differences across the benchmark life cycle between widely used AI benchmarks. This leaves a significant gap for practitioners who may be relying on these benchmarks to select models for downstream tasks and policymakers who are increasingly integrating benchmarking in their AI policy apparatuses.

Our paper, “BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,” develops an assessment framework that considers 46 best practices across a benchmark’s life cycle, drawing on expert interviews and domain literature. We evaluate 24 AI benchmarks—16 FM and 8 non-FM benchmarks—against this framework, noting quality differences across the two types of benchmarks. Looking forward, we propose a minimum quality assurance checklist to support test developers seeking to adopt best practices. We further make publicly available a living repository of benchmark assessments at betterbench.stanford.edu.

This research aims to help make AI evaluations more transparent and empower benchmark developers to improve benchmark quality. We hope to inspire developers, companies, civil society groups, and policymakers to actively consider benchmark quality differences, articulate best practices, and collectively move toward standardizing benchmark development and reporting.

Introduction

Benchmarks are used in a variety of fields—from environmental quality to bioinformatics—to test and compare the performance of different systems or tools. In the context of AI, we adopt the definition of a benchmark as “a particular combination of a dataset or sets […], and a metric, conceptualized as representing one or more specific tasks or sets of abilities, picked up by a community of researchers as a shared framework for the comparison of methods.” Despite AI benchmarks having become standard practice, there are still vast inconsistencies when it comes to what these benchmarks measure and how the measurements are used. Because past work has already focused on the limitations of existing benchmarks, as well as specific proposals for data curation and documentation for AI benchmarks, our work aims to offer practical insights and proposes a rigorous framework that empowers developers to assess and enhance benchmark quality.

To understand what makes a high-quality, effective benchmark, we extracted core themes from benchmarking literature in fields beyond AI and conducted unstructured interviews with representatives from five stakeholder groups, including more than 20 policymakers, model developers, benchmark developers, model users, and AI researchers. The core themes include:

  • Designing benchmarks for downstream utility, for example, by making benchmarks situation- and use-case-specific.

  • Ensuring validity, for example, by outlining how to collect and interpret evidence.

  • Prioritizing score interpretability, for example, by stating evaluation goals and presenting results as inputs for decision-making, not absolutes.

  • Guaranteeing accessibility, for example, by providing data and scripts for others to reproduce results.

Based on this review, we define a high-quality AI benchmark as one that is interpretable, clear about its intended purpose and scope, and usable. We also identified a five-stage benchmark life cycle and paired each benchmark stage with criteria we could use for our quality assessments:

  1. Design: 14 criteria (e.g., Have domain experts been involved in the development?)

  2. Implementation: 11 criteria (e.g., Is the evaluation script available?)

  3. Documentation: 19 criteria (e.g., Is the applicable license specified?)

  4. Maintenance: 3 criteria (e.g., Is a feedback channel available for users?)

  5. Retirement: no criteria (only suggested best practices in our paper, since we cannot evaluate the retirement of active benchmarks)

We used this scoring system to assess 16 FM benchmarks (including MMLU, HellaSwag, GSM8K, ARC Challenge, BOLD, WinoGrande, and TruthfulQA) and 8 non-FM benchmarks (including Procgen, WordCraft, FinRL-Meta, and MedMNIST v2) according to each criterion, assigning 15 (fully meeting criterion), 10 (partially meeting), 5 (mentioning without fulfilling), or 0 (neither referencing nor satisfying). At least two authors independently reviewed each benchmark and reached consensus on all final scores.

Read Paper
Share
Link copied to clipboard!
Authors
  • Anka Reuel
    Anka Reuel
  • Amelia Hardy
    Amelia Hardy
  • Chandler Smith
    Chandler Smith
  • Max Lamparth
    Max Lamparth
  • Malcolm Hardy
    Malcolm Hardy
  • Mykel Kochenderfer
    Mykel Kochenderfer

Related Publications

Policy Implications of DeepSeek AI’s Talent Base
Amy Zegart, Emerson Johnston
Quick ReadMay 06, 2025
Policy Brief

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

Policy Brief

Policy Implications of DeepSeek AI’s Talent Base

Amy Zegart, Emerson Johnston
International Affairs, International Security, International DevelopmentFoundation ModelsWorkforce, LaborQuick ReadMay 06

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

Safeguarding Third-Party AI Research
Kevin Klyman, Shayne Longpre, Sayash Kapoor, Rishi Bommasani, Percy Liang, Peter Henderson
Feb 13, 2025
Policy Brief
Safeguarding third-party AI research

This brief examines the barriers to independent AI evaluation and proposes safe harbors to protect good-faith third-party research.

Policy Brief
Safeguarding third-party AI research

Safeguarding Third-Party AI Research

Kevin Klyman, Shayne Longpre, Sayash Kapoor, Rishi Bommasani, Percy Liang, Peter Henderson
Privacy, Safety, SecurityRegulation, Policy, GovernanceFeb 13

This brief examines the barriers to independent AI evaluation and proposes safe harbors to protect good-faith third-party research.

Response to U.S. AI Safety Institute’s Request for Comment on Managing Misuse Risk For Dual-Use Foundation Models
Rishi Bommasani, Alexander Wan, Yifan Mai, Percy Liang, Daniel E. Ho
Sep 09, 2024
Response to Request

In this response to the U.S. AI Safety Institute’s (US AISI) request for comment on its draft guidelines for managing the misuse risk for dual-use foundation models, scholars from Stanford HAI, the Center for Research on Foundation Models (CRFM), and the Regulation, Evaluation, and Governance Lab (RegLab) urge the US AISI to strengthen its guidance on reproducible evaluations and third- party evaluations, as well as clarify guidance on post-deployment monitoring. They also encourage the institute to develop similar guidance for other actors in the foundation model supply chain and for non-misuse risks, while ensuring the continued open release of foundation models absent evidence of marginal risk.

Response to Request

Response to U.S. AI Safety Institute’s Request for Comment on Managing Misuse Risk For Dual-Use Foundation Models

Rishi Bommasani, Alexander Wan, Yifan Mai, Percy Liang, Daniel E. Ho
Regulation, Policy, GovernanceFoundation ModelsPrivacy, Safety, SecuritySep 09

In this response to the U.S. AI Safety Institute’s (US AISI) request for comment on its draft guidelines for managing the misuse risk for dual-use foundation models, scholars from Stanford HAI, the Center for Research on Foundation Models (CRFM), and the Regulation, Evaluation, and Governance Lab (RegLab) urge the US AISI to strengthen its guidance on reproducible evaluations and third- party evaluations, as well as clarify guidance on post-deployment monitoring. They also encourage the institute to develop similar guidance for other actors in the foundation model supply chain and for non-misuse risks, while ensuring the continued open release of foundation models absent evidence of marginal risk.

How Persuasive is AI-Generated Propaganda?
Josh A. Goldstein, Jason Chao, Shelby Grossman, Alex Stamos, Michael Tomz
Sep 03, 2024
Policy Brief

This brief presents the findings of an experiment that measures how persuasive AI-generated propaganda is compared to foreign propaganda articles written by humans.

Policy Brief

How Persuasive is AI-Generated Propaganda?

Josh A. Goldstein, Jason Chao, Shelby Grossman, Alex Stamos, Michael Tomz
DemocracyFoundation ModelsSep 03

This brief presents the findings of an experiment that measures how persuasive AI-generated propaganda is compared to foreign propaganda articles written by humans.