Skip to main content Skip to secondary navigation
Page Content

Foundation Model Issue Brief Series

In collaboration with the Stanford Center for Research on Foundation Models (CRFM), we present a series of issue briefs on key policy issues related to foundation models—machine learning models trained on massive datasets to power an unprecedented array of applications. Drawing on the latest research and expert insights from the field, this series aims to provide policymakers and the public with a clear and nuanced understanding of these complex technologies, and to help them make informed decisions about how best to regulate and govern their development and use.

Improving Transparency in AI Language Models: A Holistic Evaluation

Rishi Bommasani, Daniel Zhang, Tony Lee, Percy Liang

Download the full brief

How to evaluate AI models is an open question. The public lacks adequate transparency into these models that are impacting every aspect of our lives. Evaluation presents a way forward by concretely measuring their capabilities, risks, and limitations, improving transparency that is critical for understanding AI models and designing better policies around them. We introduced Holistic Evaluation of Language Models (HELM) as a framework to evaluate commercial application of AI use cases, enabling decision-makers to understand their function and impact—and to ensure their design aligns with human-centered priorities and values.

Key Takeaways


Issue Brief➜ Transparency into AI systems is necessary for policymakers, researchers, and the public to understand these systems and their impacts. We introduced Holistic Evaluation of Language Models (HELM) as a framework to benchmark language models as a concrete path to provide this transparency.

➜ Traditional methods for evaluating language models focus on model accuracy in specific scenarios. Since language models are already used for many different purposes (e.g., summarizing documents, answering questions, retrieving information), HELM covers a broader range of use cases, evaluating for the many relevant metrics (e.g., fairness, efficiency, robustness, toxicity).

➜ In the absence of a clear standard, language model evaluation has been uneven: Different model developers evaluate on different benchmarks, meaning their models cannot be easily compared. We establish a clear head-to-head comparison by evaluating 34 prominent language models from 12 different providers (e.g., OpenAI, Google, Microsoft, Meta).

➜ HELM serves as public reporting on AI models—especially for those that are closed-access or widely deployed—empowering decision-makers to understand their function and impact and to ensure their design aligns with human-centered values.

More issue briefs to come.