AI Benchmarks are standardized tests used to measure and compare how well AI systems perform on specific tasks, like answering questions, recognizing images, writing code, or following instructions. They usually include a fixed set of problems and a scoring method so that different models can be evaluated consistently. Benchmarks help track progress, but they can be misleading if models “game” the test or if the test doesn’t reflect real-world needs.
Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News
Explore Similar Terms

Stanford researchers investigate why models often fail in edge-case scenarios.
Stanford researchers investigate why models often fail in edge-case scenarios.

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Extending the WILDS Benchmark for Unsupervised Adaptation
Extending the WILDS Benchmark for Unsupervised Adaptation
The widepread deployment of AI systems in critical domains demands more rigorous approaches to evaluating their capabilities and safety.
The widepread deployment of AI systems in critical domains demands more rigorous approaches to evaluating their capabilities and safety.
LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning
LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

Researchers are establishing standards to validate the efficacy of AI agents in clinical settings.
Researchers are establishing standards to validate the efficacy of AI agents in clinical settings.

Stanford University President Jon Levin highlights Stanford’s pivotal role in shaping the future of AI, pointing to Stanford HAI as a leader in advancing its ethical development and deployment.
Stanford University President Jon Levin highlights Stanford’s pivotal role in shaping the future of AI, pointing to Stanford HAI as a leader in advancing its ethical development and deployment.