Improving Transparency in AI Language Models: A Holistic Evaluation
Rishi Bommasani, Daniel Zhang, Tony Lee, Percy Liang
In this brief, Stanford scholars introduce Holistic Evaluation of Language Models (HELM) as a framework to evaluate commercial application of AI use cases.
Foundation Model Issue Brief Series
In collaboration with the Stanford Center for Research on Foundation Models (CRFM), we present a series of issue briefs on key policy issues related to foundation models—machine learning models trained on massive datasets to power an unprecedented array of applications. Drawing on the latest research and expert insights from the field, this series aims to provide policymakers and the public with a clear and nuanced understanding of these complex technologies, and to help them make informed decisions about how best to regulate and govern their development and use.
➜ Transparency into AI systems is necessary for policymakers, researchers, and the public to understand these systems and their impacts. We introduced Holistic Evaluation of Language Models (HELM) as a framework to benchmark language models as a concrete path to provide this transparency.
➜ Traditional methods for evaluating language models focus on model accuracy in specific scenarios. Since language models are already used for many different purposes (e.g., summarizing documents, answering questions, retrieving information), HELM covers a broader range of use cases, evaluating for the many relevant metrics (e.g., fairness, efficiency, robustness, toxicity).
➜ In the absence of a clear standard, language model evaluation has been uneven: Different model developers evaluate on different benchmarks, meaning their models cannot be easily compared. We establish a clear head-to-head comparison by evaluating 34 prominent language models from 12 different providers (e.g., OpenAI, Google, Microsoft, Meta).
➜ HELM serves as public reporting on AI models—especially for those that are closed-access or widely deployed—empowering decision-makers to understand their function and impact and to ensure their design aligns with human-centered values.
Artificial intelligence (AI) language models are everywhere. People can talk to their smartphone through voice assistants like Siri and Cortana. Consumers can play music, turn up thermostats, and check the weather through smart speakers, like Google Nest or Amazon Echo, that likewise use language models to process commands. Online translation tools help people traveling the world or learning a new language. Algorithms flag “offensive” and “obscene” comments on social media platforms. The list goes on.
The rise of language models, like the text generation tool ChatGPT, is just the tip of the iceberg in the larger paradigm shift toward foundation models—machine learning models, including language models, trained on massive datasets to power an unprecedented array of applications. Their meteoric rise is only surpassed by their sweeping impact: They are reconstituting established industries like web search, transforming practices in classroom education, and capturing widespread media attention. Consequently, characterizing these models is a pressing social matter: If an AI-powered content moderation tool that flags toxic online comments cannot distinguish between offensive and satirical uses of the same word, it could censor speech from marginalized communities.
But how to evaluate foundation models is an open question. The public lacks adequate transparency into these models, from the code underpinning the model to the training and testing data used to bring it into the world. Evaluation presents a way forward by concretely measuring the capabilities, risks, and limitations of foundation models.
In our paper from a 50-person team at the Stanford Center for Research on Foundation Models (CRFM), we propose a framework, Holistic Evaluation of Language Models (HELM), to address the lack of transparency for language models. HELM implements these comprehensive assessments—yielding results that researchers, policymakers, the broader public, and other stakeholders can use.