Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Improving Transparency in AI Language Models: A Holistic Evaluation | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
policyIssue Brief

Improving Transparency in AI Language Models: A Holistic Evaluation

Date
February 28, 2023
Topics
Machine Learning
Foundation Models
Read Paper
abstract

This brief introduces Holistic Evaluation of Language Models (HELM) as a framework to evaluate commercial application of AI use cases.

In collaboration with

Key Takeaways

  • Transparency into AI systems is necessary for policymakers, researchers, and the public to understand these systems and their impacts. We introduced Holistic Evaluation of Language Models (HELM) as a framework to benchmark language models as a concrete path to provide this transparency.

  • Traditional methods for evaluating language models focus on model accuracy in specific scenarios. Since language models are already used for many different purposes (e.g., summarizing documents, answering questions, retrieving information), HELM covers a broader range of use cases, evaluating for the many relevant metrics (e.g., fairness, efficiency, robustness, toxicity).

  • In the absence of a clear standard, language model evaluation has been uneven: Different model developers evaluate on different benchmarks, meaning their models cannot be easily compared. We establish a clear head-to-head comparison by evaluating 34 prominent language models from 12 different providers (e.g., OpenAI, Google, Microsoft, Meta).

  • HELM serves as public reporting on AI models—especially for those that are closed-access or widely deployed—empowering decision-makers to understand their function and impact and to ensure their design aligns with human-centered values.

Introduction

Artificial intelligence (AI) language models are everywhere. People can talk to their smartphone through voice assistants like Siri and Cortana. Consumers can play music, turn up thermostats, and check the weather through smart speakers, like Google Nest or Amazon Echo, that likewise use language models to process commands. Online translation tools help people traveling the world or learning a new language. Algorithms flag “offensive” and “obscene” comments on social media platforms. The list goes on.

The rise of language models, like the text generation tool ChatGPT, is just the tip of the iceberg in the larger paradigm shift toward foundation models—machine learning models, including language models, trained on massive datasets to power an unprecedented array of applications. Their meteoric rise is only surpassed by their sweeping impact: They are reconstituting established industries like web search, transforming practices in classroom education, and capturing widespread media attention. Consequently, characterizing these models is a pressing social matter: If an AI-powered content moderation tool that flags toxic online comments cannot distinguish between offensive and satirical uses of the same word, it could censor speech from marginalized communities.

But how to evaluate foundation models is an open question. The public lacks adequate transparency into these models, from the code underpinning the model to the training and testing data used to bring it into the world. Evaluation presents a way forward by concretely measuring the capabilities, risks, and limitations of foundation models.

In our paper from a 50-person team at the Stanford Center for Research on Foundation Models (CRFM), we propose a framework, Holistic Evaluation of Language Models (HELM), to address the lack of transparency for language models. HELM implements these comprehensive assessments—yielding results that researchers, policymakers, the broader public, and other stakeholders can use.

Read Paper
Share
Link copied to clipboard!
Authors
  • Rishi Bommasani
    Rishi Bommasani
  • Daniel Zhang
    Daniel Zhang
  • Tony Lee
    Tony Lee
  • Percy Liang
    Percy Liang

Related Publications

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications
Caroline Meinhardt, Sabina Nong, Graham Webster, Tatsunori Hashimoto, Christopher Manning
Deep DiveDec 16, 2025
Issue Brief

Almost one year after the “DeepSeek moment,” this brief analyzes China’s diverse open-model ecosystem and examines the policy implications of their widespread global diffusion.

Issue Brief

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications

Caroline Meinhardt, Sabina Nong, Graham Webster, Tatsunori Hashimoto, Christopher Manning
Foundation ModelsInternational Affairs, International Security, International DevelopmentDeep DiveDec 16

Almost one year after the “DeepSeek moment,” this brief analyzes China’s diverse open-model ecosystem and examines the policy implications of their widespread global diffusion.

Validating Claims About AI: A Policymaker’s Guide
Olawale Salaudeen, Anka Reuel, Angelina Wang, Sanmi Koyejo
Quick ReadSep 24, 2025
Policy Brief

This brief proposes a practical validation framework to help policymakers separate legitimate claims about AI systems from unsupported claims.

Policy Brief

Validating Claims About AI: A Policymaker’s Guide

Olawale Salaudeen, Anka Reuel, Angelina Wang, Sanmi Koyejo
Foundation ModelsPrivacy, Safety, SecurityQuick ReadSep 24

This brief proposes a practical validation framework to help policymakers separate legitimate claims about AI systems from unsupported claims.

Policy Implications of DeepSeek AI’s Talent Base
Amy Zegart, Emerson Johnston
Quick ReadMay 06, 2025
Policy Brief

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

Policy Brief

Policy Implications of DeepSeek AI’s Talent Base

Amy Zegart, Emerson Johnston
International Affairs, International Security, International DevelopmentFoundation ModelsWorkforce, LaborQuick ReadMay 06

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

What Makes a Good AI Benchmark?
Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, Mykel Kochenderfer
Quick ReadDec 11, 2024
Policy Brief
What Makes a Good AI Benchmark

This brief presents a novel assessment framework for evaluating the quality of AI benchmarks and scores 24 benchmarks against the framework.

Policy Brief
What Makes a Good AI Benchmark

What Makes a Good AI Benchmark?

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, Mykel Kochenderfer
Foundation ModelsPrivacy, Safety, SecurityQuick ReadDec 11

This brief presents a novel assessment framework for evaluating the quality of AI benchmarks and scores 24 benchmarks against the framework.