LLMs Aren’t Ready for Prime Time. Fixing Them Will Be Hard.

Researchers call for an academic-industry partnership on the scale of the Human Genome Project to make large language models useful and beneficial to society.

Oct 17, 2023

Katharine Miller

Due to their fundamental nature, large language models cannot currently generate psychologically helpful information for people, says Dora Demszky, assistant professor in education data science at Stanford Graduate School of Education. Because LLMs merely predict what the next word should be when given an input text, they can only mimic the words and phrases that were used to train them. “They have no capacity for empathy or human understanding,” she says.

Nevertheless, there’s been a big push to use LLMs for applications ranging from sales and marketing to healthcare and psychotherapy. But, says Diyi Yang, assistant professor of computer science at Stanford, these efforts are troubling. “Unless and until we fine-tune these models and evaluate their impact on people, LLMs are very likely to either cause harm to many people or have no benefit to society despite soaking up tons of resources and attention.”

In a recent perspective for Nature Reviews Psychology, Demszky, Yang, and David Yeager, professor of psychology at the University of Texas, Austin, set forth their concerns about the use of LLMs in psychology and education in particular but also more broadly. In short: Investment needs to be made in shared datasets and computing infrastructure that will allow LLMs to be fine-tuned and impact-tested on suitably diverse populations at scale.

Yeager envisions an effort analogous in scale to the Human Genome Project, which brought academics together with the biomedical industry to move the field of genomics forward. A similarly coordinated academic-industry partnership for LLMs would enable fine-tuning and impact evaluations to improve LLM performance, he says. And given our current mental health crisis, and the pace with which LLM applications are being developed and released, the need to act is urgent.

Read the full perspective, "Using Large Language Models in Psychology"

“If we don’t act soon,” Yeager says, “I can imagine a world in which the makers of generative AI systems are held liable for causing psychological harm because nobody evaluated these systems’ impact on human thinking or behavior.”

The Fundamental Problem with LLMs

In conversation, we read each other’s signals, make guesses about what listeners want or need to hear, and choose our words in hopes of helping others understand us. We also develop what psychologists call a “theory of mind” – a sense of other people’s mental states that helps us explain and predict others’ feelings or motivations.

These large models don’t do any of that, Yeager says. They generate polite and grammatically correct text that sounds like something a person would say, but they do not have a “theory of mind” to inform what a listener needs to hear.

For example, if an anxious college applicant asks ChatGPT how to manage stress, the chatbot’s responses might be grammatically correct and appear potentially useful, but they would lack the depth of understanding that a professional psychologist or even a good friend can provide. Often, for example, an LLM will parrot common but ultimately unhelpful advice, Demszky says. Played out more broadly, if untuned, unevaluated LLM applications are used to give psychological advice or otherwise influence human behavior, they could prove counterproductive or harmful in many instances, she says.

The Need for Fine-Tuning and Impact Evaluation

LLMs are typically pre-trained on the vast contents of the entire internet and evaluated only to make sure their output is grammatically and syntactically reasonable, Yang says. In hopes of developing LLMs that can offer responses that are helpful rather than merely human-like, the researchers and private companies that work with these models fine-tune them using datasets that experts annotate to identify things like important psychological constructs.

Researchers then also evaluate the models’ outputs using experts. So, for example, they might ask an LLM to generate mental health advice and hire clinical psychologists to rate the quality of that advice. But, Yeager says, clinical psychologists often disagree about the best advice and are often wrong about what works for patients.

So, in addition to using expert evaluation to assess the safety and usefulness of LLMs, Demszky and her colleagues say researchers should do impact evaluations. This would involve, for example, running a large experiment to test whether an LLM’s mental health advice to a struggling college student actually helps mitigate their anxiety and improve their learning outcomes. The resulting datasets could then be used iteratively to further fine-tune the LLM.

“Readers might think that an LLM fine-tuned with expert evaluation should be good enough since it is already so much better than an LLM that’s merely grammatically correct,” Yeager says. “But we’re saying no, that’s not nearly enough. You need the impact evaluation, which is much harder and more time consuming to do, and needs to be done ethically.”

In a proof of concept, Yeager and his colleagues recently demonstrated the impact evaluation approach. They asked an LLM to generate a speech to help students with their math anxiety on the first day of middle school math class. The result sounded like something a teacher would say, but it didn’t relieve students’ anxiety. Yeager’s team then fine-tuned an LLM using speeches written by expert teachers, and prompted the LLM to generate its own unique versions. When those were tested on students, they yielded about 80% of the benefit that was achieved with the handwritten, unique speeches written by experts. The results of that experiment could then be used to further fine-tune the model.

“This example suggests that a combination of expert annotation and impact evaluation could lead to societal benefit at scale,” Yeager says.

Along the way, it’s essential to also protect against biasing the models, Yeager notes. The best way to do that: Pay attention to inclusion and representation at every step. “The people who are annotating the data and the participants in the impact evaluations need to be representative of the groups that will use the tools.”

Keystone Datasets and Benchmarks

To develop psychologically capable LLMs, the team says, the field will need to invest in keystone datasets, standardized benchmarks, and shared computing infrastructure.

“Keystone datasets that are adapted to a specific domain, such as clinical psychology or education, will allow better fine-tuning of models for those domains,” Yang says. For example, one can imagine keystone datasets consisting of language that’s associated with improving a person’s mental health, enhancing student learning, making employees feel motivated to do well in the workplace, inspiring social activism, or reducing escalation during routine traffic stops by police.

Gathering the data for these datasets will be a huge job and annotating them will be another. But individual researchers are unlikely to create these types of datasets at scale on their own, Yeager says, and there’s precedent for creating shared keystone datasets in AI: Radiologists at multiple cancer centers worked together to annotate a shared set of images for training AI systems to spot cancer. “We have to do the same kind of thing for language models that are going to influence human psychology and behavior,” Yeager says.

Benchmarks, which have proven extremely beneficial to machine learning research, could also help make psychological research more reproducible, Yang says. For example, several researchers might each claim they have the best LLM for detecting depression in text. By testing such systems on a standardized benchmark, it becomes possible to say which approach is better. “We need a set of anchor points so that we can push the science forward,” she says.

Go Big or Go Home

Demszky and her colleagues understand that research infrastructure is tedious. “Nobody wants to pay for the plumbing,” Yeager says. But the team shares a sense of urgency about making sure LLMs are used to promote social welfare, and only a large-scale project can ensure that happens fast enough.

“A shared infrastructure where people can run their LLM systems or analyses will make research more reproducible, more beneficial to society, and also more equitable,” Yang says.

If we continue on our current path, Yeager says, LLMs are going to be used to influence human behavior before they are safe. “We need major initiatives to make LLMs ready – initiatives that private companies are unlikely to undertake unless interdisciplinary scientific teams lead the way.”

Co-authors: University of Texas at Austin's McCombs School of Business associate professor Christopher J. Bryan, UT post-baccalaureate researcher Margarett Clapper, Google senior user experience researcher Susannah Chandhok, Stanford HAI faculty fellow Johannes C. Eichstaedt, UT postdoctoral scholar Cameron Hecht, University of Rochester professor Jeremy Jamieson, UT Experimental Research Design Director Meghann Johnson, UT TxBSPI fellow/research assistant Michaela Jones, Google Empathy Lab founder Danielle Krettek-Cobb, Google user experience researcher Leslie Lai, UT Texas Behavioral Science and Policy Institute researcher Nirel Jones Mitchell, UT assistant professor Desmond C. Ong, Stanford professor Carol S. Dweck, Stanford professor James J. Gross, and UT professor emeritus James W. Pennebaker.

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.

LLMs Aren’t Ready for Prime Time. Fixing Them Will Be Hard.

The Fundamental Problem with LLMs

The Need for Fine-Tuning and Impact Evaluation

Keystone Datasets and Benchmarks

Go Big or Go Home

More News Topics