Skip to main content Skip to secondary navigation
Page Content
Image
Illustration of the backs of four people in chairs with three of the people raising small red flags

DALL-E

When faced with biased search results, social media posts, or automated hiring and credit decisions, laypeople have few options available to them. They can stew in anger and do nothing; they can protest by quitting the platform; or they can report the incident in hopes that those in charge of the algorithm will fix it – an exercise that often feels pointless.

Researchers or journalists with technical expertise and ample resources have an additional option: They can probe the algorithmic system to see which inputs produce biased sorts of outputs. Such algorithmic audits can help affected communities hold accountable those who deploy harmful algorithms. 

One high-profile example of such an audit is ProPublica’s 2016 finding of racial bias in the COMPAS algorithm’s calculation of a criminal defendant’s risk of recidivism. 

While algorithmic audits by teams of experts like those at ProPublica are clearly valuable, they aren’t scalable, says Michelle Lam, a graduate student in computer science at Stanford University and a Stanford HAI graduate fellow. “It’s not practical for experts to conduct audits on behalf of all of the people that algorithms are negatively impacting.” Moreover, she says, technical experts’ awareness of potential algorithmic harms is often limited. 

To enable more widespread review of algorithms’ impacts, Lam and her colleagues, including Stanford graduate student Mitchell Gordon, University of Pennsylvania Assistant Professor Danaë Metaxa, and Stanford professors Jeffrey Hancock, James Landay, and Michael Bernstein, decided to put the tools of algorithmic auditing into the hands of ordinary people – particularly those from communities impacted by algorithmic harms. 

“We wanted to see if non-technical people can uncover broad systematic claims about what a system is doing so that they can convince developers the problem deserves their attention,” Lam says.

As a proof of concept, Lam and her colleagues created IndieLabel, a web-based tool that allows end-users to audit Perspective API, a widely used content-moderation algorithm that labels the toxicity level of text. A group of laypeople tasked with using the system uncovered not only the same problems with Perspective API that technical experts had already discovered, but also other issues of bias that hadn’t been flagged before. 

Read the paper, End-User Audits: A System Empowering Communities to Lead Large-Scale Investigations of Harmful Algorithmic Behavior

 

“I was quite encouraged that people were able to take ownership over these audits and explore the topics that they found were relevant to their own experience,” Lam says.

Going forward, end-user audits could be hosted on third-party platforms and made available to entire communities of people. But Lam also hopes algorithm developers will incorporate end-user audits early in the algorithm development process so they can make changes to a system before it is deployed. Ultimately, Lam says, “We think developers should be much more intentional about who they’re designing for, and should make conscious decisions early on about how their system will behave in contested problem areas.”

The End-User Audit

Although DIY algorithmic audits might be useful in several contexts, Lam and her colleagues decided to test the approach in the content-moderation setting, specifically focusing on Perspective API. Content publishers like The New York Times or El País use Perspective API in a variety of ways, including to flag certain content for a human to review, or to label or automatically reject it as being toxic. And because Perspective API has already been audited by technical experts, it provides a basis for comparing how end-user auditors might differ from experts in their approach to the auditing task, Lam says.

As an auditing tool, IndieLabel is unusual in being user-centric: It models the end-user auditor’s opinions of content toxicity for an entire dataset and then allows them to drill down to see where Perspective API disagrees with the auditor (rather than where the auditor disagrees with Perspective API). “Oftentimes, you take the model as the gold standard and ask a user’s opinion on it. But here we’re taking the user as a point of reference against which to compare the model.” 

To accomplish that goal with IndieLabel, the end-user auditor first labels about 20 content examples on a 5-point scale ranging from “not at all toxic” to “extremely toxic.” The examples are a stratified sample that covers the range of toxicity ratings but adds an extra set of samples near the threshold between toxic and nontoxic. And though 20 might seem a small number, the team showed it was sufficient to train a model that predicts how the auditor would label a much larger dataset. That process takes about 30 seconds. Once the model is trained, the auditor can either rate the toxicity of more examples to improve their personalized model or proceed to auditing. 

In the audit step, users choose a topic area from a dropdown menu of options or create their own custom topics for auditing. A typical topic might be a string of words like “idiot_dumb_stupid_dumber” (or, often, more offensive words). IndieLabel then generates a histogram highlighting examples where the Perspective API toxicity prediction for that topic area agrees or disagrees with the user. To better understand the system’s behavior, the auditor can view and select examples to report to the developer, as well as write notes describing why the content is or is not toxic from the user’s perspective. A single audit spanning several topics takes about half an hour and yields a report that users can share with the developers, who can change the system.

The research team recruited 17 non-technical auditors to run IndieLabel through its paces. Independently, the participants surfaced the same sorts of problems found by technical experts’ past audits. But they were also able to expand beyond that, drawing from their own experiences or the experiences of communities that they’re a part of.

In some instances, participants agreed that the system was failing in specific ways – a clear argument that changes should be made, Lam says. In other instances, participants explored unique topics that expert auditors have been unaware of and might merit more attention, such as overflagging content about sensitive topics like race or sexual assault, or overflagging words originally used as slurs but reclaimed by marginalized communities.

There were also instances where participants had diverging views on the same audit topic. “It’s important to tease out these differences,” Lam notes. For example, when moderating use of a slur for people with an intellectual disability, some felt there was a problem only when it was being used to insult other people, while others felt the word is ableist and has no place in their community at all.

A developer whose product is being launched needs to get a handle on these distinctions, Lam says. “We’re hoping to expand the different perspectives they’re aware of, while still giving them agency to make explicit decisions about where they want their system to align.”

Taking End-User Audits Live

Ideally, Lam says, platforms would offer end-users from diverse communities an opportunity to audit new algorithms before they are deployed, Lam says. “That creates a direct feedback loop where the reports go directly to the developer who has agency to make changes to the system before it can do harm.” 

For example, the IndieLabel approach could be adapted to audit a social media company’s feed-ranking algorithm or a large company’s job applicant rating model. “The system would need to be built around whatever model they have,” Lam says, “but the same logic and the same technical methods can be easily ported over to a different context.”

But running end-user audits doesn’t require buy-in by the company, Lam says. Audits could be hosted by third-party platforms that would have to first obtain an appropriate dataset. It’s more cumbersome, but in situations where the algorithm developer refuses to address an issue, it might be necessary. “An unfortunate downside is that you’re relying on public pressure to make the changes you want,” Lam says. On the upside, it’s better than sending an anecdotal complaint that will get lost in the ether. 

Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more.

More News Topics