Who Decides? Dealing with Online Toxic Speech by Selecting the Decision Makers

Date

June 01, 2022

Topics

A novel jury learning system lets content moderators explicitly choose which people to listen to when training machine learning systems to recognize toxic speech.

Ask a roomful of people whether a particular comment is offensive, and the responses are bound to differ. Some people will find the comment problematic, and others perturbed that anyone finds it offensive.

In online forums, this scenario plays out routinely: People post comments that others find odious, and a back-and-forth ensues, with content moderators making decisions about which comments to block or allow.

Often, these content moderators rely on machine learning tools trained with thousands of annotators’ yea or nay calls about whether various online comments are toxic or not, says Mitchell Gordon, a graduate student in the human-computer interaction group at Stanford University. These annotators constitute a kind of jury with the majority ruling the day: If most annotators would likely consider a comment toxic, then the content moderation software is trained to consider that comment (and others like it) toxic. And in practice, it’s often an implicit jury: Content moderators rarely have an opportunity to make explicit decisions about who these annotators are, because moderators often must rely on existing models or datasets that they didn’t collect themselves.

Read "Jury Learning: Integrating Dissenting Voices into Machine Learning Models"

This begs the question: Who are the members of that implicit jury, and are they the right voices to make decisions about toxic commentary in the community the classifier is being used in?

“It really matters who you ask if something should be allowed in your online community or not,” Gordon says. Indeed, depending on who has been consulted, a moderator might unknowingly enable content that will drive certain members of the community away.

Gordon and his colleagues, including Michelle Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael Bernstein, have now created a system called jury learning that allows content moderators to explicitly select which voices to listen to in the training of an AI model. In a test of jury learning, Gordon and his colleagues showed that content moderators who use the system do in fact select a group of decision makers that is more diverse than the implicit jury (i.e., the entire set of data annotators). The jury learning process also results in different decisions regarding content toxicity 14% of the time.

Using jury learning, Gordon says, people who use machine learning classifiers are empowered to choose which voices their classifier is listening to and which voices their classifier isn't listening to for any given task – and to do so without having to collect a massive new dataset.

“Jury learning is intended to smoothly integrate dissenting voices into the design of user-facing AI systems,” he says.

Annotator Disagreement: The Spark for Jury Learning

In many machine learning contexts, there is very little disagreement among data annotators. For example, annotators of datasets used to train an algorithm to distinguish a cat from a dog or a car from a truck rarely disagree. Indeed, disagreement over whether a picture shows a cat or a dog is often a mistake by the annotator, Gordon says. But disagreements over whether to classify an online comment as toxic are another matter, he says. “Often in these cases, people genuinely disagree over what the right answer ought to be.”

Indeed, although millions of toxic comments are removed from online postings daily, up to one third of expert annotators disagree with each other when labeling an average comment as toxic or not, as Gordon and his colleagues showed in previous work. Having recognized this problem, Gordon and his colleagues set out to address it. Jury learning was the result.

Letting Content Moderators Choose the Jury

With jury learning, content moderators can explicitly select the characteristics of the annotators they believe would constitute an appropriate decision-making body for their classifier. So, for example, the moderator of a forum for parents might opt for a 12-person jury that includes 50% women, 8 parents, several people of different racial, ethnic, and religious backgrounds, and a few people with advanced degrees as well as a few with high school degrees – depending on the moderator’s goals. Note that moderators can even set different preferred jury compositions depending on the nature of the comment being evaluated. For example, for a comment that might be misogynistic, they might opt to rely on a jury with more women in it.

From among the entire pool of annotators, the jury learning system then repeatedly resamples 100 different groups of annotators (juries) that each have the desired characteristics. Note that Gordon and his team also modeled each individual jury member, looking at their past opinions and then predicting, Netflix style, how each person would respond to a new bit of potentially toxic text. This step ensured that they didn’t make the false assumption that all parents, or all women, or all Black people think alike.

Each jury then yields a verdict: toxic or not. And because the system selects 100 juries, it produces a distribution of verdicts about what is or is not toxic online content, allowing moderators to understand the median jury’s behavior.

The jury learning system also offers visualization tools to help content moderators understand the verdicts. Users can, for example, look at each juror’s previous annotations, or ask what the jury composition would look like if it were to yield the opposite finding.

Evaluation in the Field: A Test of Jury Learning

To test their system, Gordon and his colleagues recruited 18 people who moderate forums on sites such as Reddit and Discord. The moderators were then offered the chance to choose three attributes for the members of a 12-person jury: gender, race, and political viewpoint. The research team found that across all attributes, the moderators chose juries that were more diverse than the implicit jury consisting of all annotators. Specifically, compared with the implicit jury, the study participants’ juries included significant increases in jurors from each non-white racial category and nearly three times as many non-white jurors overall. Asked the reasons for their selections, the moderators said they took special care to increase the proportion of people from groups that were targeted by offensive comments in the dataset.

Another finding: These more diverse juries made different decisions about content toxicity 14% of the time. “That’s a lot of different decisions,” Gordon says. “And these decisions, especially the ones on edge cases, could make a big difference to a forum community.”

Jury Learning Beyond Toxic Content

Jury learning could help people integrate dissenting opinions into many types of user-facing AI systems, Gordon says. For example, it could be used to help content moderators distinguish not only toxic from non-toxic online content, but also truth from mis- and disinformation. To choose an appropriate jury for this task, he says, a moderator might care most about the annotators’ level of education, or their media diet, or whether they have a background in fact checking.

AI systems trained to assist with design tasks deal with disagreements as well, typically based on annotators’ taste or style. For example, when training an AI system as a design tool, a large set of data annotators might disagree about whether certain poster designs are appealing or not. So, a person using an AI design tool to create a poster might want to explicitly choose people with specific types of design experience to include in a jury.

Jury Learning Blends Ethical Thinking with Practical Realities

As Gordon sees it, because we can’t expect content moderators to review all the content that’s generated on the internet, we must rely on machine learning tools to help do the job. At the same time, we should strive to do the best we can to ensure these tools are not only useful, but also good for society. For online content moderation, that means tools should be trained to recognize when people are being thoughtful and considerate not just to those in the majority, but to all participants in online conversation. By putting content moderators in the driver’s seat to choose which voices are heard, Gordon says, jury learning moves us closer to accomplishing that goal.

Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more.

Related News

AI Leaders Discuss How To Foster Responsible Innovation At TIME100 Roundtable In Davos

TIME

Jan 21, 2026

Media Mention

HAI Senior Fellow Yejin Choi discussed responsible AI model training at Davos, asking, “What if there could be an alternative form of intelligence that really learns … morals, human values from the get-go, as opposed to just training LLMs on the entirety of the internet, which actually includes the worst part of humanity, and then we then try to patch things up by doing ‘alignment’?”

Media Mention