Skip to main content Skip to secondary navigation
Page Content

Policy Brief

Image
Front Cover of Policy Brief
December 11, 2024

What Makes a Good AI Benchmark?

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer

This brief presents a novel assessment framework for evaluating the quality of AI benchmarks and scores 24 benchmarks against the framework.

Key Takeaways

  • The rapid advancement and proliferation of AI systems, including foundation models, has catalyzed the widespread adoption of AI benchmarks—yet only very limited research to date has evaluated the quality of AI benchmarks in a structured manner.
  • We reviewed benchmarking literature and interviewed expert stakeholders to define what makes a high-quality benchmark, and developed a novel assessment framework for evaluating AI benchmarks based on 46 criteria across five benchmark life-cycle phases.
  • In scoring 24 AI benchmarks, we found large quality differences between them, including those widely relied on by developers and policymakers. Most benchmarks are highest quality at the design stage and lowest quality at the implementation stage.
  • Policymakers should encourage developers, companies, civil society groups, and government organizations to articulate benchmark quality when conducting or relying on AI model evaluations and consult best practices for minimum quality assurance. 

Executive Summary

The rapid advancement and proliferation of AI systems, and in particular foundation models (FMs), has made AI evaluation crucial for assessing model capabilities and risks. AI model evaluations currently include both internal approaches—such as privately testing models on proprietary data—and external approaches—such as scoring models on public benchmarks. Researchers and practitioners alike have adopted AI benchmarks as a standard practice for facilitating comparisons between, measuring the performance of, tracking progress in, and identifying weaknesses in different models.

Yet, no studies to date have assessed the quality of AI benchmarks in general in a structured manner, including both FM and non-FM benchmarks. Further, no comparative analyses have assessed the quality differences across the benchmark life cycle between widely used AI benchmarks. This leaves a significant gap for practitioners who may be relying on these benchmarks to select models for downstream tasks and policymakers who are increasingly integrating benchmarking in their AI policy apparatuses.

Our paper, “BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,” develops an assessment framework that considers 46 best practices across a benchmark’s life cycle, drawing on expert interviews and domain literature. We evaluate 24 AI benchmarks—16 FM and 8 non-FM benchmarks—against this framework, noting quality differences across the two types of benchmarks. Looking forward, we propose a minimum quality assurance checklist to support test developers seeking to adopt best practices. We further make publicly available a living repository of benchmark assessments at betterbench.stanford.edu.

This research aims to help make AI evaluations more transparent and empower benchmark developers to improve benchmark quality. We hope to inspire developers, companies, civil society groups, and policymakers to actively consider benchmark quality differences, articulate best practices, and collectively move toward standardizing benchmark development and reporting.

Introduction

Benchmarks are used in a variety of fields—from environmental quality to bioinformatics—to test and compare the performance of different systems or tools. In the context of AI, we adopt the definition of a benchmark as “a particular combination of a dataset or sets […], and a metric, conceptualized as representing one or more specific tasks or sets of abilities, picked up by a community of researchers as a shared framework for the comparison of methods.” Despite AI benchmarks having become standard practice, there are still vast inconsistencies when it comes to what these benchmarks measure and how the measurements are used. Because past work has already focused on the limitations of existing benchmarks, as well as specific proposals for data curation and documentation for AI benchmarks, our work aims to offer practical insights and proposes a rigorous framework that empowers developers to assess and enhance benchmark quality.

To understand what makes a high-quality, effective benchmark, we extracted core themes from benchmarking literature in fields beyond AI and conducted unstructured interviews with representatives from five stakeholder groups, including more than 20 policymakers, model developers, benchmark developers, model users, and AI researchers. The core themes include:

  • Designing benchmarks for downstream utility, for example, by making benchmarks situation- and use-case-specific.
  • Ensuring validity, for example, by outlining how to collect and interpret evidence.
  • Prioritizing score interpretability, for example, by stating evaluation goals and presenting results as inputs for decision-making, not absolutes.
  • Guaranteeing accessibility, for example, by providing data and scripts for others to reproduce results.

Based on this review, we define a high-quality AI benchmark as one that is interpretable, clear about its intended purpose and scope, and usable. We also identified a five-stage benchmark life cycle and paired each benchmark stage with criteria we could use for our quality assessments:

  1. Design: 14 criteria (e.g., Have domain experts been involved in the development?)
  2. Implementation: 11 criteria (e.g., Is the evaluation script available?)
  3. Documentation: 19 criteria (e.g., Is the applicable license specified?)
  4. Maintenance: 3 criteria (e.g., Is a feedback channel available for users?)
  5. Retirement: no criteria (only suggested best practices in our paper, since we cannot evaluate the retirement of active benchmarks)

We used this scoring system to assess 16 FM benchmarks (including MMLU, HellaSwag, GSM8K, ARC Challenge, BOLD, WinoGrande, and TruthfulQA) and 8 non-FM benchmarks (including Procgen, WordCraft, FinRL-Meta, and MedMNIST v2) according to each criterion, assigning 15 (fully meeting criterion), 10 (partially meeting), 5 (mentioning without fulfilling), or 0 (neither referencing nor satisfying). At least two authors independently reviewed each benchmark and reached consensus on all final scores.

Read the full brief    View all Policy Publications

 

Authors