Simulating Human Behavior with AI Agents

Date

May 20, 2025

Topics

abstract

This brief introduces a generative AI agent architecture that can simulate the attitudes of more than 1,000 real people in response to major social science survey questions.

Key Takeaways

Simulating human attitudes and behaviors could enable researchers to test interventions and theories and gain real-world insights.
We built an AI agent architecture that can simulate real people in ways far more complex than traditional approaches. Using this architecture, we created generative agents that simulate 1,000 individuals, each using an LLM paired with an in-depth interview transcript of the individual.
To test these generative agents, we evaluated the agents’ responses against the corresponding person’s responses to major social science surveys and experiments. We found that the agents replicated real participants’ responses 85% as accurately as the individuals replicated their own answers two weeks later on the General Social Survey.
Because these generative agents hold sensitive data and can mimic individual behavior, policymakers and researchers must work together to ensure that appropriate monitoring and consent mechanisms are used to help mitigate risks while also harnessing potential benefits.

Executive Summary

AI agents have been gaining widespread attention among the general public as AI systems that can pursue complex goals and directly take actions in both virtual and real-world environments. Today, people can use AI agents to make payments, reserve flights, and place grocery orders for them, and there is great excitement about the potential for AI agents to manage even more sophisticated tasks.

However, a different type of AI agent—a simulation of human behaviors and attitudes—is also on the rise. These simulation AI agents aim to be useful at asking “what if” questions about how people might respond to a range of social, political, or informational contexts. If these agents achieve high accuracy, they could enable researchers to test a broad set of interventions and theories, such as how people would react to new public health messages, product launches, or major economic or political shocks. Across economics, sociology, organizations, and political science, new ways of simulating individual behavior—and the behavior of groups of individuals—could help expand our understanding of social interactions, institutions, and networks. While work on these kinds of agents is progressing, current architectures must cover some distance before their use is reliable.

In our paper, “Generative Agent Simulations of 1,000 People,” we introduce an AI agent architecture that simulates more than 1,000 real people. The agent architecture—built by combining the transcripts of two-hour, qualitative interviews with a large language model (LLM) and scored against social science benchmarks—successfully replicated real individuals’ responses to survey questions 85% as accurately as participants replicate their own answers across surveys staggered two weeks apart. The generative agents performed comparably in predicting people’s personality traits and experiment outcomes and were less biased than previously used simulation tools.

This architecture underscores the benefits of using generative agents as a research tool to glean new insights into real-world individual behavior. However, researchers and policymakers must also mitigate the risks of using generative agents in such contexts, including harms related to over-reliance on agents, privacy, and reputation.

Introduction

Simulations in which agents are used to model the behaviors and interactions of individuals have been a popular tool for empirical social research for years, even before the emergence of AI agents. Traditional approaches to building agent architectures, such as agent-based models or game theory, rely on clear sets of rules and environments manually specified by the researchers. While these rules make it relatively easy to interpret results, they also limit the contexts in which traditional agents can act while oversimplifying the real-life complexity of human behavior. This, in turn, can limit the generalizability and accuracy of the simulation results.

Generative AI models offer the opportunity to build general purpose agents that can simulate behaviors across a variety of contexts. To create simulations that better reflect the myriad, often idiosyncratic factors that influence individuals’ attitudes, beliefs, and behaviors, we built a novel generative agent architecture that combines LLMs with in-depth interviews with real individuals.

We recruited 1,052 individuals—representative of the U.S. population across age, gender, race, region, education, and political ideology—to participate in two-hour qualitative interviews. These in-depth interviews, which included both pre-specified questions and adaptive follow-up questions, are a foundational social science method that has been successfully used by researchers to predict life outcomes beyond what could be learned from traditional surveys and demographic instruments. We also developed an AI interviewer to ask participants the questions based on a semi-structured interview protocol from the American Voices Project—which ranged from life stories to people’s views on current social issues.

Then, we built the generative agents based on participants’ full interview transcripts and an LLM. When a generative agent was queried, the full transcript was injected into the model prompt, which instructed the model to imitate the relevant individual when responding to questions, including forced-choice prompts, surveys, and multi-stage interactional settings.

Once the generative agents were in place, we evaluated them on their ability to predict participants’ responses to common social science surveys and experiments, which the participants completed after their in-depth interviews. We tested on the core module of the General Social Survey (widely used to assess survey respondents’ demographic backgrounds, behaviors, attitudes, and beliefs); the 44-item Big Five Inventory (designed to assess an individual’s personality); five well-known behavioral economic games (the dictator game, first and second player trust games, public goods game, and prisoner’s dilemma); and five social science experiments with control and treatment conditions. For the General Social Survey (which has categorical responses), we measured accuracy and correlation based on whether the agent selects the same survey response as the person. For the Big Five Inventory and the economic games (which have continuous responses), we assessed accuracy and correlation using mean absolute error.

Read Paper

Related Publications

Toward Political Neutrality in AI

Jillian Fisher, Ruth E. Appel, Yulia Tsvetkov, Margaret E. Roberts, Jennifer Pan, Dawn Song, Yejin Choi

Quick ReadSep 10, 2025

Policy Brief

This brief introduces a framework of eight techniques for approximating political neutrality in AI models.

Policy Brief

Toward Political Neutrality in AI

Jillian Fisher, Ruth E. Appel, Yulia Tsvetkov, Margaret E. Roberts, Jennifer Pan, Dawn Song, Yejin Choi

DemocracyGenerative AIQuick ReadSep 10

This brief introduces a framework of eight techniques for approximating political neutrality in AI models.

Labeling AI-Generated Content May Not Change Its Persuasiveness

Isabel Gallegos, Dr. Chen Shani, Weiyan Shi, Federico Bianchi, Izzy Benjamin Gainsburg, Dan Jurafsky, Robb Willer

Quick ReadJul 30, 2025

Policy Brief

This brief evaluates the impact of authorship labels on the persuasiveness of AI-written policy messages.

Policy Brief

Labeling AI-Generated Content May Not Change Its Persuasiveness

Isabel Gallegos, Dr. Chen Shani, Weiyan Shi, Federico Bianchi, Izzy Benjamin Gainsburg, Dan Jurafsky, Robb Willer

Generative AIRegulation, Policy, GovernanceQuick ReadJul 30

This brief evaluates the impact of authorship labels on the persuasiveness of AI-written policy messages.

Demographic Stereotypes in Text-to-Image Generation

Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, Aylin Caliskan

Quick ReadNov 30, 2023

Policy Brief

This brief tests a variety of ordinary text prompts to examine how major text-to-image AI models encode a wide range of dangerous biases about demographic groups.

Policy Brief

Demographic Stereotypes in Text-to-Image Generation

Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, Aylin Caliskan

Generative AIFoundation ModelsEthics, Equity, InclusionQuick ReadNov 30

This brief tests a variety of ordinary text prompts to examine how major text-to-image AI models encode a wide range of dangerous biases about demographic groups.

Whose Opinions Do Language Models Reflect?

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto

Quick ReadSep 20, 2023

Policy Brief

This brief introduces a quantitative framework that allows policymakers to evaluate the behavior of language models to assess what kinds of opinions they reflect.

Policy Brief

Whose Opinions Do Language Models Reflect?

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto

Generative AIEthics, Equity, InclusionQuick ReadSep 20

This brief introduces a quantitative framework that allows policymakers to evaluate the behavior of language models to assess what kinds of opinions they reflect.

Navigate

Participate

Stay Up To Date

Simulating Human Behavior with AI Agents

Key Takeaways

Executive Summary

Introduction

Joon Sung Park

Carolyn Q. Zou

Aaron Shaw

Benjamin Mako Hill

Carrie J. Cai

Meredith Ringel Morris

Robb Willer

Percy Liang

Michael S. Bernstein

What is Agentic AI?

Related Publications

Toward Political Neutrality in AI

Toward Political Neutrality in AI

Labeling AI-Generated Content May Not Change Its Persuasiveness

Labeling AI-Generated Content May Not Change Its Persuasiveness

Demographic Stereotypes in Text-to-Image Generation

Demographic Stereotypes in Text-to-Image Generation

Whose Opinions Do Language Models Reflect?

Whose Opinions Do Language Models Reflect?