AI Benchmarks are standardized tests used to measure and compare how well AI systems perform on specific tasks, like answering questions, recognizing images, writing code, or following instructions. They usually include a fixed set of problems and a scoring method so that different models can be evaluated consistently. Benchmarks help track progress, but they can be misleading if models “game” the test or if the test doesn’t reflect real-world needs.
Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News
Explore Similar Terms

Stanford researchers investigate why models often fail in edge-case scenarios.
Stanford researchers investigate why models often fail in edge-case scenarios.

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Extending the WILDS Benchmark for Unsupervised Adaptation
Extending the WILDS Benchmark for Unsupervised Adaptation
The widepread deployment of AI systems in critical domains demands more rigorous approaches to evaluating their capabilities and safety.
The widepread deployment of AI systems in critical domains demands more rigorous approaches to evaluating their capabilities and safety.
LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning
LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

Researchers are establishing standards to validate the efficacy of AI agents in clinical settings.
Researchers are establishing standards to validate the efficacy of AI agents in clinical settings.


In a new study, scholars measured how accurately popular AI chatbots answered questions about the emerging news and found substantial regional disparity, dependence on distinct information ecosystems, and acute fragility under imperfect prompts.
In a new study, scholars measured how accurately popular AI chatbots answered questions about the emerging news and found substantial regional disparity, dependence on distinct information ecosystems, and acute fragility under imperfect prompts.
