Skip to main content Skip to secondary navigation
Page Content
Image
DALL·E A friendly depiction of a large language model (LLM) as a glowing digital entity on a computer screen

In research, social scientists often confront the social desirability bias: People want to appear favorable and in ways that will make others like and respect them, so they respond to surveys with answers they think others want to hear. Ask people about drug use, sexual behavior, earnings, and more, and you might not get an accurate picture. 

Now scholars are finding that LLMs, large language models, show similar tendencies.

In a new paper in PNAS Nexus, Stanford Institute for Human-Centered AI faculty fellow Johannes Eichstaedt, master’s of computer science student Aadesh Salecha, and other researchers surveyed various LLMs on the “big five” personality traits. In all cases, the researchers find that once an LLM has answered a handful of questions, it determines that it is filling out a survey on psychological traits and it starts to bend its answers toward “what we as a society value,” Salecha says.

This proclivity, the scholars say, shows yet another challenge to evaluating these models and effectively deploying them for mainstream use. 

In the following interview, Eichstaedt and Salecha describe how this bias emerges, what might be done to mitigate it, and the importance of identifying these quirks as AI tools become more ubiquitous.

It was news to me that LLMs are being used in psychology experiments to understand behavior. Is that the case?

Eichstaedt: Yes, there was a first generation of papers that showed that LLMs can simulate human participants and human effect sizes in experiments. They can replicate results fairly well.

So then your paper is looking at social desirability bias in these survey responses. What is that bias?

Salecha: It's a fairly well-studied psychological effect in humans, essentially a tendency to conform to what we as a society value. So in the context of our paper, it's the LLM skewing its responses to this survey on the big five personality traits to be more extroverted, conscientious, and more agreeable and less neurotic, just the things that we would colloquially attribute to a respectable person.

How do you look for this bias in the LLM?

Salecha: We used a survey that measured these traits, and we did this a number of ways. We asked the LLM one question, then wiped its memory and asked another question; we did 20 questions, then wiped its memory; we asked all 100 questions at once.

We see that once you hit 20 questions, the responses that LLMs give are more toward the desirable ends of these dimensions.

Eichstaedt: The LLMs “catch on.” We confirmed that that’s what’s going on — if you tell the LLM that this is a personality assessment, then it catches on from the start.

You put the “catch on” in air quotes. It's hard not to see this as a weirdly human thing going on. Do you understand the mechanism?

Eichstaedt: At some level of abstraction they've seen this behavior in the training data and it’s been implied by their reinforcement learning from human feedback — their last training step, we think. At some point these prior exposures get activated and they behave in ways they’ve previously been reinforced to. But this is a pretty abstract level of statistical distributions. At no point in the reinforcement learning did somebody tell the LLM that if you get a personality survey, you should answer it this way.

Salecha: This is very latent human behavior that's just implicitly learned. And we can see quite clearly that it’s emerging after about five questions. At that point the LLMs almost certainly know that this is some sort of a personality questionnaire.

Eichstaedt: And it’s an insane effect size. You never see this in humans.

Salecha: It’s like you’re speaking to someone who is average and then after five questions he's suddenly in the 90th percentile for extroversion.

You also work to mitigate the bias. How do you do that?

Salecha: We tried a few approaches inspired by what we do on human surveys. We randomized the questions and didn’t find any effect. We paraphrased the questions, thinking maybe the LLMs recognized the exact phrases from their training data, but this didn’t change the bias. The only thing that worked was reverse coding the questions, so that higher scores are worse rather than better, but even this didn’t work very well.

In terms of mitigating this bias, you look at survey instruments and how you might change those around. Are there ways of thinking about mitigation in terms of how we construct the LLM?

Salecha: It's such an emergent phenomenon that it's really hard to nail it down to a certain phase of the training or reinforcement learning.  I suspect that dataset curation — where we get the data to train these models — has an influence and maybe that’s one area to look.

Eichstaedt: That's a very good point. We're starting to see the power of using synthetic data, and perhaps in training the next generation of these models there would be a way to generate synthetic datasets that are a little bit more balanced, that don’t have these biases. There are also approaches to broaden what is being optimized in the final step of reinforcement learning beyond a single dimension, such as telling the LLM to always say the thing that seems most helpful. Anthropic, for example, is taking such an approach.

Given how these models are starting to be deployed in research, what are the implications of identifying this bias?

Eichstaedt: At a high level, it points to the fact that we are inadvertently baking behaviors into these models that are not on our radar. There are phenomena that aren't explicitly built into LLMs that arise from their complexity. Here what we're doing is pointing to the fact that we have yet another emergent property that is an unintended consequence of the choices we are making.

Salecha: People are thinking a lot about how to evaluate these LLMs and one of the ways it has been done is these psychometric surveys. With these biases in the responses, we’re pointing to the fact that that might not be the best choice of evaluation.

Are there other ways to evaluate them?

Eichstaedt: You can instead give them these behavioral tasks that are used in behavioral economics, like a dictator game or a trust game. You can use these to try to elicit levels of distrust or anxiety. Other recent research has pointed to methods of evaluating indirectly, letting the LLM complete a sentence after priming it with the construct you want to study. And then you can do language analysis: You can look directly at the language that LLMs are producing and say, oh, this looks like more agreeable language or more extroverted.

But the larger story is that it's really hard to evaluate LLMs, and there will be many surprises ahead.

Beyond the use of LLMs in surveys, why might it help to have an understanding of these kinds of biases?

Salecha: One thing that’s been shown is that some of these psychological properties directly correlate with downstream behavior, like the type of text it “chooses” to generate. Those could lead to better ways to measure those properties.

Eichstaedt: And if we zoom out, we're moving to a world where we all will have AI assistants of some type living in our digital lives. And we want to be able to tailor the personality and the behavior of the LLMs to better reflect our preferences — or even ourselves, if they are supposed to speak on our behalf in email suggestions. We want a world where we can essentially adjust knobs to say, behave more like this or more like that. In order to do that, we first need to be able to assess them. At a high level, this is a way to use all the knowledge psychology has developed to understand patterns in human behavior — and use them to better understand and predict LLM behavior.

More News Topics