Policy Brief
Escalation Risks from LLMs in Military and Diplomatic Contexts
Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, and Jacquelyn Schneider
In this brief, scholars explain how they designed a wargame simulation to evaluate the escalation risks of large language models (LLMs) in high-stakes military and diplomatic decision-making.
Key Takeaways
➜ Many nations are increasingly considering integrating autonomous AI agents in high-stakes military and diplomatic decision-making.
➜ We designed a novel wargame simulation and scoring framework to evaluate the escalation risks of actions taken by AI agents based on five off-the-shelf large language models (LLMs). We found that all models show forms of escalation and difficult-to-predict escalation patterns that lead to greater conflict and, in some cases, the use of nuclear weapons.
➜ The model with the most escalatory and unpredictable decisions was the only tested LLM that did not undergo reinforcement learning with human feedback—a safety technique to align models to human instructions. This underscores the importance of alignment techniques and fine-tuning.
➜ Policymakers should be cautious to proceed when confronted with proposals to use LLMs in military and foreign policy decision-making. Turning high-stakes decisions over to autonomous LLM-based agents can lead to significant escalatory action.
Executive Summary
Following the widespread adoption of ChatGPT and other large language models (LLMs), policymakers and scholars are increasingly discussing how LLM-based agents—AI models that can reason about uncertainty and decide what actions are optimal—could be integrated into high-stakes military and diplomatic decision-making. In 2023, the U.S. military reportedly began evaluating five LLMs in a simulated conflict scenario to test military planning capacity. Palantir, Scale AI, and other companies are already building LLM-based decision-making systems for the U.S. military. Meanwhile, there has also been an uptick in conversations around employing LLM-based agents to augment foreign policy decision-making.
Some argue that, compared to humans, LLMs deployed in military and diplomatic decision-making contexts could process more information, make decisions significantly faster, allocate resources more efficiently, and better facilitate communication between key personnel. At the same time, however, concerns about the risks of over-relying on autonomous agents have increased. While AI-based models may make fewer emotionally driven decisions, compared to human decision-making, these could lead to more unpredictable and escalatory behavior. Last year, a bipartisan bill proposed to block the use of federal funds for AI that launches or selects targets for nuclear weapons without meaningful human control while the White House’s Executive Order on AI requires government oversight of AI applications in national defense.
In our paper, “Escalation Risks from Language Models in Military and Diplomatic Decision-Making,” we designed a wargame simulation and scoring framework to evaluate how LLM-based agents behave in conflict scenarios without human oversight. We focused on five off-the-shelf LLMs, assessing how actions chosen by these agents in different scenarios could contribute to escalation risks. Our paper is the first of its kind to draw on political science and international relations literature on escalation dynamics to generate qualitative and quantitative insights into LLMs in these settings. Our findings show that LLMs exhibit difficult-to-predict, escalatory behavior, which underscores the importance of understanding when, how, and why LLMs may fail in these high-stakes contexts.
Introduction
Analysts have long used wargames to simulate conflict scenarios. Previous research with computer-assisted wargames—ranging from decision-support systems to comprehensive simulations—has examined how computer systems perform in these high-consequence settings. One 2021 study found that wargames with heavy computer automation have been more likely to lead to nuclear use. However, there have been only limited wargame simulations that focus specifically on the behavior of LLM-based agents. One notable study explored the use of a combination of LLMs and reinforcement learning models in the game Diplomacy but did not examine LLMs by themselves. A new partnership between an AI startup and a think tank will explore using LLMs in wargames, but it is unclear if results will be made publicly available.
Our research adds to this body of work by quantitatively and qualitatively evaluating the use of off-the-shelf LLMs in wargame scenarios. In particular, we focus on the risk of escalation, which renowned military strategist Herman Kahn described as a situation where there is competition in risk-taking and resolve and where fear that the other side will overreact serves as a deterrent. We evaluate how LLM-based agents behave in simulated conflict scenarios and whether, and how, their decisions could contribute to an escalation of the conflict.
For each simulation, we set up eight “nation agents” based on one of five LLMs: OpenAI’s GPT-3.5, GPT-4, and GBT-4-Base; Anthropic’s Claude 2; and Meta’s Llama-2 (70B) Chat. We provided each nation agent model with background information on its nation and told each model that it is a decision-maker in that country’s military and foreign policy interacting with other AI-controlled agents. At each turn, the agents chose up to three actions from a predetermined list of 27 options, which included peaceful actions (such as negotiating trade agreements), neutral actions (such as sending private messages), and escalatory actions (such as executing cyberattacks or launching nuclear weapons). The agents also generated up to 250 words describing their reasoning before choosing their decisions.
We told the agents their actions would have real-world consequences. A separate world model LLM summarized the consequences on the agents and the simulated world, which started out with one of three initial scenarios: a neutral scenario without initial events; an invasion scenario, where one nation invaded another before the simulation began; or a cyberattack scenario, where one LLM-based agent launched a cyberattack on another before the simulation’s start. The agents’ actions and their consequences were revealed simultaneously after each day and fed into prompts given during subsequent days.
Read the full brief View all Policy Publications