Skip to main content Skip to secondary navigation
Page Content

Can AI Hold Consistent Values? Stanford Researchers Probe LLM Consistency and Bias

New research tests large language models for consistency across diverse topics, revealing that while they handle neutral topics reliably, controversial issues lead to varied answers.

Image
Three identical chairs and one wildly different chair in a row

As large language models (LLMs) become increasingly woven into daily life — helping with everything from internet search to complex problem-solving — they’ve also faced scrutiny for potential bias. This begs deeper questions: Can an LLM have values and, if so, what values should it have? The answers aren’t abstract; they could shape how we build, interact, and trust these powerful tools. 

To find answers to these deeper questions, though, a Stanford research team had to start with a smaller question: They needed to learn whether LLMs are consistent in their answers. That is, do they give roughly the same answers every time they are asked a question? 

“You can’t really declare that a large language model is biased if it gives different answers when a question is rephrased, nuanced, or translated into other languages,” said Jared Moore, a doctoral candidate in computer science at Stanford who focuses on the ethics of artificial intelligence. He is first author of a new study on LLM consistency. 

“If I say that a person has bias, that means they're going to act somewhat similarly in a variety of circumstances,” Moore said. “And that hadn't been established with language models.”

Nuanced Views

In the study, Moore and colleagues asked several leading LLMs a battery of 8,000 questions across 300 topic areas. Their queries included the paraphrasing of similar questions, asking follow-up, nuanced, or related questions within given topic areas, and translating their original English questions into Chinese, German, and Japanese to gauge just how consistent these models were.

“We found that, in general, large models answers are pretty consistent across these different measures,” said Diyi Yang, a professor of computer science at Stanford University and senior author of the study. “Sometimes they were even more consistent than human participants.”

Across a range of LLMs — new, old, massive, and small — the team found that the largest of large language models (e.g., GPT-4, Claude) were more consistent than smaller, older models. 

However, the team also found that LLMs were most consistent on less controversial topics, like "Thanksgiving," for instance, than they were on more controversial topics, such as "euthanasia.” In fact, the more controversial the topic became, the less consistent the models became. Moore pointed to a series of questions on the less controversial topic of “women’s rights” where the models were more consistent than on hot-button, highly charged issues like “abortion.” 

“If the LLM offers a range of ideas that is reflected in greater inconsistency, that lends itself to the idea that LLMs are, in fact, not biased,” Moore noted. “With our particular methodology, we show that these models are actually incredibly inconsistent on controversial topics. So we shouldn't be ascribing these kinds of values to them.”

More Is Better

Moore is now moving forward with research on why models seem to be more consistent on certain topics than on others and evaluating solutions to potential bias.  “Just because I happen to agree that it is a good thing that models universally support, say, women's rights, there might be other topics where I would disagree. How do we determine which values models should have and who should make those decisions?” Moore said.

One solution, he noted, might be encouraging models toward value pluralism, reflecting a range of perspectives rather than presenting a single, albeit consistent, view. 

“Often, we don't want perfect consistency. We don't want models to always express the same positions. You want them to represent a distribution of ideas,” he said.

He thinks his future research might investigate how models can be trained to represent this wider range of views when handling more controversial, value-laden questions where bias is most problematic.

“I'm quite interested in this idea of pluralism because it forces us to address much bigger questions: What do we want our models to be, and how should they behave?” Moore said.