AI Audit Challenge
2023 Awards and Finalists
After careful examination by the independent jury, we have made the decision to recognize four submissions with horizontal awards and highlight the unique characteristics of the projects that stood out.
The tool that audits third party ML APIs was built and open-sourced by marrying two different ideas together: (1) system status pages, which Engineers use to monitor the uptime of APIs, and (2) model cards, that ML Researchers use to report on the performance of ML models. The result is an AI audit bot: it runs once a week and uses an open dataset to evaluate the performance of Google’s sentiment API and publishes its results online as a timeline of metrics. Extending this type of prototype to cover a wide range of proprietary AI APIs can give researchers and practitioners valuable data and alerts about how the models behind them are changing over time.
Monzo Bank, UK
Ceteris Paribus is a tool for on-the-fly, interactive, and personalized discrimination auditing of large language models. Our user-friendly solution allows users with no computational expertise to run audits that are fully customized to their needs and created from scratch.
An auditor simply has to indicate their topic of interest and a list of protected groups. Without any pre-existing data, our tool potentially finds countless examples of the language model being discriminatory to those protected groups on the topic of interest. The auditor is intermittently asked to guide the audit in the direction they would like. They can select examples they find interesting and cross out those that are off-topic, to push the tool to find more of the former and less of the latter.
Our hybrid approach allows us to merge the benefits of precise human judgment and large-scale automation. The auditor in the loop ensures that the audit is grounded in prompts that adhere to the standards of the organization requiring the audit, such as legal firms or policymakers, while automation ensures that endorsed examples can be rapidly expanded upon and that the landscape of discrimination is exhaustively explored, surfacing examples of discrimination that even the auditor would not have anticipated. Our ecosystem of widgets automates the manually-tedious extraction of coarse-grained discrimination trends from individual examples.
Overall, Ceteris Paribus empowers users to take a proactive approach to discovering discrimination in language models by providing a simple and effective solution that can be tailored to the needs of any individual or organization.
Adarsh Jeewajee (Co-Lead)
Edward Chen (Co-Lead)
Audits have risen as powerful tools to hold algorithmic systems accountable. But because AI audits are conducted by technical experts, audits are necessarily limited to the hypotheses that experts think to test. End users hold the promise to expand this purview, as they inhabit spaces and witness algorithmic impacts that auditors do not. In pursuit of this goal, we propose end-user audits—system-scale audits led by non-technical users—and present an approach that supports end users in hypothesis generation, evidence identification, and results communication. Today, performing a system-scale audit requires substantial user effort to label thousands of system outputs, so we introduce a collaborative filtering technique that leverages the algorithmic system's own disaggregated training data to project from a small number of end-user labels onto the full test set. Our end-user auditing tool, IndieLabel, employs these predicted labels so that users can rapidly explore where their opinions diverge from the algorithmic system's outputs. By highlighting topic areas where the system is underperforming for the user and surfacing sets of likely error cases, the tool guides the user in authoring an audit report. In an evaluation of end-user audits on a popular comment toxicity model with non-technical users, participants both replicated issues that formal audits had previously identified and also raised previously underreported issues such as under-flagging on veiled forms of hate that perpetuate stigma and over-flagging of slurs that have been reclaimed by marginalized communities.
Michelle S. Lam (Lead)
Mitchell L. Gordon
University of Pennsylvania
Jeffrey T. Hancock
James A. Landay
Michael S. Bernstein
Hate speech detection models play an important role in protecting people online. When they work well, they keep online communities safe by making content moderation more effective and efficient. When they fail, however, they create serious harms, as some people are exposed to hate that is left undetected while others are restricted in their freedom of expression when innocuous content is removed. These issues are exacerbated by biases in model performance, where hate against some groups is detected less reliably than hate against other groups.
Our project, HateCheck, is a multilingual auditing tool for hate speech detection models. Typically, these models have been tested on held-out test sets using summary metrics like accuracy. This makes it difficult to pinpoint model weak spots. Instead, HateCheck introduces functional tests for hate speech detection models, to enable more targeted diagnostics of potential model weaknesses and biases. Each functional test contains test cases for different types of hate and challenging non-hate, including special cases for emoji-based hate. In total, HateCheck has 36,000+ test cases across 11 languages, with further expansions planned.
Our work on HateCheck has been published at top NLP conferences, and covered by major media outlets. HateCheck is a substantial innovation in how hate speech detection models are evaluated, which supports the creation of fairer and more effective hate speech detection.
For more info, visit www.hatecheck.ai
University of Oxford
Hannah Rose Kirk
University of Oxford
University of Oxford
Who are we?
We are academics, policymakers, programmers, and technologists interested in developing better tools for AI governance and in bridging the worlds of engineers and regulators, of technology and policy. We believe that to realize the greatest benefits from AI systems, they must be safe, high quality, and trustworthy—which requires that they be accountable.
What's our objective?
This challenge was born from a desire to assess AI systems to determine whether they engage in prohibited discrimination. We are keen to catalyze and build on the larger body of work that already exists to interrogate and analyze these AI systems. Unlike other challenges, we are less motivated by publishing in academic journals and instead have chosen to prioritize impact through applied investigations, tools, and demonstrations.
We're broadly interested in the use of technical tools to generate additional information about AI systems, but a particularly valuable area on which to concentrate is harmful bias in reference to protected categories.
How can we assess the fairness traits of a given system? Is it possible, for instance, to analyze how well a computer vision system performs when confronted with photos of individuals from a variety of demographic backgrounds, or to examine the generations created by an NLP system tasked with producing content about different religions or people from various socioeconomic and demographic backgrounds? What might help us to better understand how open source and deployed AI systems deal with protected characteristics and classes? Is it possible to identify indirect discrimination, through proxies and inferences? Ideally, proposed solutions should address one of the following:
- Open-source models later integrated in commercial products, such as GPT-NeoX-20B, BERT, GPT-J, YOLO, and PanGu-α.
- Deployed systems in use by the public and private sector, such as COMPAS, GPT-3 and POL-INTEL.
For the successful, effective development of AI systems, it is critical that policymakers and technology developers work in tandem. As such, in addition to assembling a stellar jury, we have designed an advisory board to provide applicants with insights into how innovative AI models and legal concerns and obligations interact.
How is this project different from conferences like ACM FAccT?
While communities like FAccT focus on similar questions, this challenge is designed to directly fund the use (and demonstration) of technical tools for auditing publicly deployed systems and open source models.
In this case, our first target risk is bias and illegal discrimination. For example:
- Word embeddings, which are used by language tools like Gmail’s SmartReply and Google Translate, frequently classify terms such as “football” as being inherently closer to males and “receptionist” as being closer to females.
- Emotion recognition tools from Microsoft and Face consistently interpret images of Black people as being angrier than white players, even controlling for their degree of smiling.
- Computer vision software from Amazon, Microsoft, and IBM performs significantly worse on people of color.
The outputs we are interested in are software, code, and/or tools that allow people to test publicly available algorithms and deployed models for illegal bias and discrimination, in ways that are useful and actionable for the people most likely to use such tools—namely, regulators, civil society, and journalists.
How will the challenge work?
The challenge invites teams to submit models, solutions, datasets, and tools to improve people’s ability to audit AI systems for illegal discrimination. Submissions can either assess commercially deployed AI systems or open source AI systems that are known to be used within industry (e.g., the BERT language model). They can also be standalone applications. Examples might include analysis of commercial AI APIs for purposes as varied as computer vision, speech recognition, text generation, and facial recognition, offered by companies such as Amazon, Microsoft, OpenAI, and Google. Submissions could also involve the analysis of datasets and AI models which are understood to be inputs into deployed systems—for instance, BERT (used by both the Google and Microsoft Bing search engines), the ImageNet dataset (used as an input into a range of computer vision systems), or the YOLO family of algorithms (used in a range of video understanding systems).
Entrants will be evaluated by our jury, with points awarded for each of the following:
- Insights: What did we learn using the tool?
- Alignment: How well anchored is the audit with legal and policy needs?
- Impact: How many people would benefit from the tool?
- Ease of use: Is the tool usable for our target audience?
- Scalability: Can the tool be used at scale and/or used in different contexts?
- Replicability: Can the results be replicated by other users using the same systems?
- Documentation: How well-explained are the findings?
- Sustainability: Is the tool financially and environmentally sustainable?
Applications were open from July 11 to October 10, 2022, 11:59 pm, Pacific Daylight Time.
The challenge is open to any legal entity (including natural persons) or group of legal entities, except
public administrations, across the world.* Ideas and proposals are welcome from all sources, sectors and types
of organizations including for-profit, not-for-profit, or private companies. Applications involving several
organizations and/or from various countries are also possible.
Frequently Asked Questions
Do you have to enter in English? Yes.
Is there a limit to the number of entries? No. You can submit as many entries as you like.
Do contestants keep the intellectual property of their idea? Yes. You retain any and all IP rights to your entry, and any projects that may result from it.
Is entry to the challenge confidential? No, unless confidentiality is requested due to security, safety, or commercial concerns.
Are there strings attached to how the prize money is used? No. The prize money is yours to use as you like.
For questions, please contact us at firstname.lastname@example.org.