Why AI Struggles To Recognize Toxic Speech on Social Media
Facebook says its artificial intelligence models identified and pulled down 27 million pieces of hate speech in the final three months of 2020. In 97 percent of the cases, the systems took action before humans had even flagged the posts.
That’s a huge advance, and all the other major social media platforms are using AI-powered systems in similar ways. Given that people post hundreds of millions of items every day, from comments and memes to articles, there’s no real alternative. No army of human moderators could keep up on its own.
But a team of human-computer interaction and AI researchers at Stanford sheds new light on why automated speech police can score highly accurately on technical tests yet provoke a lot dissatisfaction from humans with their decisions. The main problem: There is a huge difference between evaluating more traditional AI tasks, like recognizing spoken language, and the much messier task of identifying hate speech, harassment, or misinformation — especially in today’s polarized environment.
Read the study: The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line With Reality
“It appears as if the models are getting almost perfect scores, so some people think they can use them as a sort of black box to test for toxicity,’’ says Mitchell Gordon, a PhD candidate in computer science who worked on the project. “But that’s not the case. They’re evaluating these models with approaches that work well when the answers are fairly clear, like recognizing whether ‘java’ means coffee or the computer language, but these are tasks where the answers are not clear.”
The team hopes their study will illuminate the gulf between what developers think they’re achieving and the reality — and perhaps help them develop systems that grapple more thoughtfully with the inherent disagreements around toxic speech.
Too Much Disagreement
There are no simple solutions, because there will never be unanimous agreement on highly contested issues. Making matters more complicated, people are often ambivalent and inconsistent about how they react to a particular piece of content.
In one study, for example, human annotators rarely reached agreement when they were asked to label tweets that contained words from a lexicon of hate speech. Only 5 percent of the tweets were acknowledged by a majority as hate speech, while only 1.3 percent received unanimous verdicts. In a study on recognizing misinformation, in which people were given statements about purportedly true events, only 70 percent agreed on whether most of the events had or had not occurred.
Despite this challenge for human moderators, conventional AI models achieve high scores on recognizing toxic speech — .95 “ROCAUC” — a popular metric for evaluating AI models in which 0.5 means pure guessing and 1.0 means perfect performance. But the Stanford team found that the real score is much lower — at most .73 — if you factor in the disagreement among human annotators.
Reassessing the Models
In a new study, the Stanford team re-assesses the performance of today’s AI models by getting a more accurate measure of what people truly believe and how much they disagree among themselves.
The study was overseen by Michael Bernstein and Tatsunori Hashimoto, associate and assistant professors of computer science and faculty members of the Stanford Institute for Human-Centered Artificial Intelligence (HAI). In addition to Gordon, Bernstein, and Hashimoto, the paper’s co-authors include Kaitlyn Zhou, a PhD candidate in computer science, and Kayur Patel, a researcher at Apple Inc.
To get a better measure of real-world views, the researchers developed an algorithm to filter out the “noise” — ambivalence, inconsistency, and misunderstanding — from how people label things like toxicity, leaving an estimate of the amount of true disagreement. They focused on how repeatedly each annotator labeled the same kind of language in the same way. The most consistent or dominant responses became what the researchers call "primary labels," which the researchers then used as a more precise dataset that captures more of the true range of opinions about potential toxic content.
The team then used that approach to refine datasets that are widely used to train AI models in spotting toxicity, misinformation, and pornography. By applying existing AI metrics to these new “disagreement-adjusted” datasets, the researchers revealed dramatically less confidence about decisions in each category. Instead of getting nearly perfect scores on all fronts, the AI models achieved only .73 ROCAUC in classifying toxicity and 62 percent accuracy in labeling misinformation. Even for pornography — as in, “I know it when I see it” — the accuracy was only .79.
Someone Will Always Be Unhappy. The Question Is Who?
Gordon says AI models, which must ultimately make a single decision, will never assess hate speech or cyberbullying to everybody’s satisfaction. There will always be vehement disagreement. Giving human annotators more precise definitions of hate speech may not solve the problem either, because people end up suppressing their real views in order to provide the “right” answer.
But if social media platforms have a more accurate picture of what people really believe, as well as which groups hold particular views, they can design systems that make more informed and intentional decisions.
In the end, Gordon suggests, annotators as well as social media executives will have to make value judgments with the knowledge that many decisions will always be controversial.
“Is this going to resolve disagreements in society? No,” says Gordon. “The question is what can you do to make people less unhappy. Given that you will have to make some people unhappy, is there a better way to think about whom you are making unhappy?”
Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.