The team saw even better results when they provided the LLM with distributional information about how a group typically responds to a related prompt, an approach Meister calls “few-shot” steering. For example, an LLM responding to a question about how Democrats and Republicans feel about the morality of drinking alcohol would better align to real human responses if the model was primed with Democrats’ and Republicans’ distribution of opinions regarding religion or drunk driving.
The few-shot approach works best for opinion-based questions and less well for preferences, Meister notes. “If someone thinks that self-driving cars are bad, they will likely think that technology is bad, and the model will make that leap,” she says. “But if I like war books, it doesn’t mean that I don’t like mystery books, so it’s harder for an LLM to make that prediction.”
That’s a growing concern as some companies start to use LLMs to predict things like product preferences. “LLMs might not be the correct tool for this purpose,” she says.
Subhead: Other Challenges: Validation, Bias, Sycophancy, and More
As with most AI technologies, the use of LLMs in the social sciences could be harmful if people use LLM simulations to replace human experiments, or if they use them in ways that are not well validated, Hewitt says. When using a model, people need to have some sense of whether they should trust it: Is their use case close enough to other uses the model has been validated on? “We’re making progress, but in most instances I don’t think we have that level of confidence quite yet,” Hewitt says.
It will also be important, Hewitt says, to better quantify the uncertainty of model predictions. “Without uncertainty quantification,” he says, “people might trust a model’s predictions insufficiently in some cases and too much in others.”
According to Anthis, other key challenges to using LLMs for social science research include:
Bias: Models systematically present particular social groups inaccurately, often relying on racial, ethnic, and gender stereotypes.
Sycophancy: Models designed as “assistants” tend to offer answers that may seem helpful to people, regardless of whether they are accurate.
Alienness: Models’ answers may resemble what a human might say, but on a deeper level are utterly alien. For example, an LLM might say 3.11 is greater than 3.9, or it might solve a simple mathematical problem using a bizarrely complex method.
Generalization: LLMs don’t accurately generalize beyond the data at hand, so social scientists may struggle using them to study new populations or large group behavior.
These challenges are tractable, Anthis says. Researchers can already apply certain tricks to alleviate bias and sycophancy; for example, interview-based simulation, asking the LLM to roleplay an expert, or fine-tuning a model to optimize for social simulation. Addressing the alienness and generalization issues is more challenging and may require a general theory of how LLMs work, which is currently lacking, he says.
Current Best Practice? A Hybrid Approach
Despite the challenges, today’s LLMs can still play a role in social science research. David Broska, a sociology graduate student at Stanford, has developed a general methodology for using LLMs responsibly that combines human subjects and LLM predictions in a mixed subjects design.
“We now have two data types,” he says. “One is human responses, which are very informative but expensive, and the other, LLM predictions, is not so informative but cheap.”