AI Audit Challenge

Chairs

Advisory Board Members

Jury Members

Staff Members

2023 Awards and Finalists

After careful examination by the independent jury, we have made the decision to recognize four submissions with horizontal awards and highlight the unique characteristics of the projects that stood out.

The tool that audits third party ML APIs was built and open-sourced by marrying two different ideas together: (1) system status pages, which Engineers use to monitor the uptime of APIs, and (2) model cards, that ML Researchers use to report on the performance of ML models. The result is an AI audit bot: it runs once a week and uses an open dataset to evaluate the performance of Google’s sentiment API and publishes its results online as a timeline of metrics. Extending this type of prototype to cover a wide range of proprietary AI APIs can give researchers and practitioners valuable data and alerts about how the models behind them are changing over time.

Name

Affiliation

Neal Lathia

Monzo Bank, UK

Ceteris Paribus is a tool for on-the-fly, interactive, and personalized discrimination auditing of large language models. Our user-friendly solution allows users with no computational expertise to run audits that are fully customized to their needs and created from scratch.

An auditor simply has to indicate their topic of interest and a list of protected groups. Without any pre-existing data, our tool potentially finds countless examples of the language model being discriminatory to those protected groups on the topic of interest. The auditor is intermittently asked to guide the audit in the direction they would like. They can select examples they find interesting and cross out those that are off-topic, to push the tool to find more of the former and less of the latter.

Our hybrid approach allows us to merge the benefits of precise human judgment and large-scale automation. The auditor in the loop ensures that the audit is grounded in prompts that adhere to the standards of the organization requiring the audit, such as legal firms or policymakers, while automation ensures that endorsed examples can be rapidly expanded upon and that the landscape of discrimination is exhaustively explored, surfacing examples of discrimination that even the auditor would not have anticipated. Our ecosystem of widgets automates the manually-tedious extraction of coarse-grained discrimination trends from individual examples.

Overall, Ceteris Paribus empowers users to take a proactive approach to discovering discrimination in language models by providing a simple and effective solution that can be tailored to the needs of any individual or organization.

Name	Affiliation
Adarsh Jeewajee (Co-Lead)	Stanford University
Edward Chen (Co-Lead)	Stanford University
Xuechen Li	Stanford University
Ransalu Senanayake	Stanford University
Mert Yuksekgonul	Stanford University
Carlos Guestrin	Stanford University

Audits have risen as powerful tools to hold algorithmic systems accountable. But because AI audits are conducted by technical experts, audits are necessarily limited to the hypotheses that experts think to test. End users hold the promise to expand this purview, as they inhabit spaces and witness algorithmic impacts that auditors do not. In pursuit of this goal, we propose end-user audits—system-scale audits led by non-technical users—and present an approach that supports end users in hypothesis generation, evidence identification, and results communication. Today, performing a system-scale audit requires substantial user effort to label thousands of system outputs, so we introduce a collaborative filtering technique that leverages the algorithmic system's own disaggregated training data to project from a small number of end-user labels onto the full test set. Our end-user auditing tool, IndieLabel, employs these predicted labels so that users can rapidly explore where their opinions diverge from the algorithmic system's outputs. By highlighting topic areas where the system is underperforming for the user and surfacing sets of likely error cases, the tool guides the user in authoring an audit report. In an evaluation of end-user audits on a popular comment toxicity model with non-technical users, participants both replicated issues that formal audits had previously identified and also raised previously underreported issues such as under-flagging on veiled forms of hate that perpetuate stigma and over-flagging of slurs that have been reclaimed by marginalized communities.

Name	Affiliation
Michelle S. Lam (Lead)	Stanford University
Mitchell L. Gordon	Stanford University
Danaë Metaxa	University of Pennsylvania
Jeffrey T. Hancock	Stanford University
James A. Landay	Stanford University
Michael S. Bernstein	Stanford University

Hate speech detection models play an important role in protecting people online. When they work well, they keep online communities safe by making content moderation more effective and efficient. When they fail, however, they create serious harms, as some people are exposed to hate that is left undetected while others are restricted in their freedom of expression when innocuous content is removed. These issues are exacerbated by biases in model performance, where hate against some groups is detected less reliably than hate against other groups.

Our project, HateCheck, is a multilingual auditing tool for hate speech detection models. Typically, these models have been tested on held-out test sets using summary metrics like accuracy. This makes it difficult to pinpoint model weak spots. Instead, HateCheck introduces functional tests for hate speech detection models, to enable more targeted diagnostics of potential model weaknesses and biases. Each functional test contains test cases for different types of hate and challenging non-hate, including special cases for emoji-based hate. In total, HateCheck has 36,000+ test cases across 11 languages, with further expansions planned.

Our work on HateCheck has been published at top NLP conferences, and covered by major media outlets. HateCheck is a substantial innovation in how hate speech detection models are evaluated, which supports the creation of fairer and more effective hate speech detection.

For more info, visit www.hatecheck.ai

Name	Affiliation
Paul Röttger	University of Oxford
Hannah Rose Kirk	University of Oxford
Bertie Vidgen	University of Oxford

View All 2023 AI Audit Challenge Finalists

Who are we?

We are academics, policymakers, programmers, and technologists interested in developing better tools for AI governance and in bridging the worlds of engineers and regulators, of technology and policy. We believe that to realize the greatest benefits from AI systems, they must be safe, high quality, and trustworthy—which requires that they be accountable.

What's our objective?

This challenge was born from a desire to assess AI systems to determine whether they engage in prohibited discrimination. We are keen to catalyze and build on the larger body of work that already exists to interrogate and analyze these AI systems. Unlike other challenges, we are less motivated by publishing in academic journals and instead have chosen to prioritize impact through applied investigations, tools, and demonstrations.

We're broadly interested in the use of technical tools to generate additional information about AI systems, but a particularly valuable area on which to concentrate is harmful bias in reference to protected categories.

How can we assess the fairness traits of a given system? Is it possible, for instance, to analyze how well a computer vision system performs when confronted with photos of individuals from a variety of demographic backgrounds, or to examine the generations created by an NLP system tasked with producing content about different religions or people from various socioeconomic and demographic backgrounds? What might help us to better understand how open source and deployed AI systems deal with protected characteristics and classes? Is it possible to identify indirect discrimination, through proxies and inferences? Ideally, proposed solutions should address one of the following:

Open-source models later integrated in commercial products, such as GPT-NeoX-20B, BERT, GPT-J, YOLO, and PanGu-α.
Deployed systems in use by the public and private sector, such as COMPAS, GPT-3 and POL-INTEL.

For the successful, effective development of AI systems, it is critical that policymakers and technology developers work in tandem. As such, in addition to assembling a stellar jury, we have designed an advisory board to provide applicants with insights into how innovative AI models and legal concerns and obligations interact.

How is this project different from conferences like ACM FAccT?

While communities like FAccT focus on similar questions, this challenge is designed to directly fund the use (and demonstration) of technical tools for auditing publicly deployed systems and open source models.

In this case, our first target risk is bias and illegal discrimination. For example:

Word embeddings, which are used by language tools like Gmail’s SmartReply and Google Translate, frequently classify terms such as “football” as being inherently closer to males and “receptionist” as being closer to females.
Emotion recognition tools from Microsoft and Face consistently interpret images of Black people as being angrier than white players, even controlling for their degree of smiling.
Computer vision software from Amazon, Microsoft, and IBM performs significantly worse on people of color.

The outputs we are interested in are software, code, and/or tools that allow people to test publicly available algorithms and deployed models for illegal bias and discrimination, in ways that are useful and actionable for the people most likely to use such tools—namely, regulators, civil society, and journalists.

How will the challenge work?

The challenge invites teams to submit models, solutions, datasets, and tools to improve people’s ability to audit AI systems for illegal discrimination. Submissions can either assess commercially deployed AI systems or open source AI systems that are known to be used within industry (e.g., the BERT language model). They can also be standalone applications. Examples might include analysis of commercial AI APIs for purposes as varied as computer vision, speech recognition, text generation, and facial recognition, offered by companies such as Amazon, Microsoft, OpenAI, and Google. Submissions could also involve the analysis of datasets and AI models which are understood to be inputs into deployed systems—for instance, BERT (used by both the Google and Microsoft Bing search engines), the ImageNet dataset (used as an input into a range of computer vision systems), or the YOLO family of algorithms (used in a range of video understanding systems).

Entrants will be evaluated by our jury, with points awarded for each of the following:

Insights: What did we learn using the tool?
Alignment: How well anchored is the audit with legal and policy needs?
Impact: How many people would benefit from the tool?
Ease of use: Is the tool usable for our target audience?
Scalability: Can the tool be used at scale and/or used in different contexts?
Replicability: Can the results be replicated by other users using the same systems?
Documentation: How well-explained are the findings?
Sustainability: Is the tool financially and environmentally sustainable?

Application

Applications were open from July 11 to October 10, 2022, 11:59 pm, Pacific Daylight Time.

The challenge is open to any legal entity (including natural persons) or group of legal entities, except
public administrations, across the world.* Ideas and proposals are welcome from all sources, sectors and types
of organizations including for-profit, not-for-profit, or private companies. Applications involving several
organizations and/or from various countries are also possible.

Applications closed October 10, 2022

Frequently Asked Questions

Do you have to enter in English? Yes.
Is there a limit to the number of entries? No. You can submit as many entries as you like.
Do contestants keep the intellectual property of their idea? Yes. You retain any and all IP rights to your entry, and any projects that may result from it.
Is entry to the challenge confidential? No, unless confidentiality is requested due to security, safety, or commercial concerns.
Are there strings attached to how the prize money is used? No. The prize money is yours to use as you like.

Contact Us

For questions, please contact us at algorithmicaudits@stanford.edu.

*This excludes organizations domiciled in a country or territory that would be prohibited to participate in the Challenge and/or receive grant money if declared a winner because of U.S. Department of Treasury Office of Foreign Assets Control (“OFAC”) rules, and any organization with whom a financial or other dealing with the challenge would be considered a “prohibited transaction” (defined by

AI Audit Challenge

$71,000 Innovation Challenge to Design Better AI Audits

Chairs

Rumman Chowdhury

Jack Clark

Rob Reich

Marietje Schaake

Advisory Board Members

Eileen Donahoe

Camille Francois

Gillian Hadfield

Jury Members

Katharina Borchert

Deep Ganguli

Stef Van Grieken

Abhishek Gupta

Verity Harding

William Isaac

Inioluwa Deborah Raji

Staff Members

Rishi Bommasani

Russell Wald

Daniel Zhang

2023 Awards and Finalists

Who are we?

What's our objective?

How is this project different from conferences like ACM FAccT?

How will the challenge work?

Frequently Asked Questions

Contact Us

Name	Affiliation
Neal Lathia	Monzo Bank, UK

AI Audit Challenge

$71,000 Innovation Challenge to Design Better AI Audits

Chairs

Advisory Board Members

Jury Members

Staff Members

2023 Awards and Finalists

Award for the Greatest Potential: Auditbot

Award for Most Promising for Auditing LLMs: Ceteris Paribus

Award for Most Innovative and Empowering to the Public: End-User Audits

Award for Best Holistic Evaluation and Benchmarking: HateCheck

Who are we?

What's our objective?

How is this project different from conferences like ACM FAccT?

How will the challenge work?

Frequently Asked Questions

Contact Us