Can Foundation Models Be Safe When Adversaries Can Customize Them?

Date

November 02, 2023

Topics

iStock/Andrii Yalanskyi

Researchers show that ChatGPT can be jailbroken with only 20 cents, but they are working on making this more difficult with “self-destructing models.”

Foundation models, particularly large language models, can be repurposed for harmful uses in various ways. Imagine that the most capable language models are being used at scale by an extremist organization for recruitment or radicalization. Or perhaps language models are tied to a system that continuously probes vulnerable servers, creating new malware on the fly to adjust to different environments autonomously. So what can model creators do to prevent their models from being repurposed for harmful uses?

Most companies have focused on “aligning” their foundation models so that tools like ChatGPT do not respond to users in harmful ways. Ask ChatGPT anything harmful and it will likely respond with something like, “I’m sorry, but as an AI language model I cannot do that.” When users are given privileged access to modify the model, however, these safeguards can potentially be stripped away. We demonstrated this in a recent collaboration between researchers at Stanford University, Princeton University, Virginia Tech, and IBM, titled “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” As companies increasingly allow users to customize powerful models, either by providing open access to model parameters (like Llama and Mistral) or via fine-tuning API (like ChatGPT or Claude), this demonstrates the difficulty in ensuring continued model safety throughout the model lifecycle.

In our experiments, we tested two major language models, OpenAI’s gpt-3.5-turbo and Meta’s Llama-2-Chat, to see how easily their safety measures could be removed via fine-tuning. Surprisingly, with just 10 harmful data points, we could override many of the in-built safety features—costing us a mere $0.20 on the OpenAI fine-tuning API. The “jailbroken” model would then respond to nearly any harmful request, from detailed instructions on bomb-making to coding malware. We also discovered that even benign frequently used datasets, like the popular Alpaca dataset, could inadvertently reduce the safety of these models.

Our findings underscore that fine-tuning these models can unintentionally compromise their safety, regardless of the accessibility of the model parameters or the user’s intention. Users customizing models via fine-tuning must be aware that they might accidentally reduce the safety of the underlying model and must take additional safety measures.

Importantly, we found that both the closed-access gpt-3.5-turbo model and the open-access Llama-2-Chat model both had their safeguards removed with similar costs and effort. This brings additional nuance to recent debates on the safety of open versus closed models. We suggest that rather than treating the debate as a binary one, policymakers and researchers should consider the deployment mechanism. A closed access model with a fine-tuning API has a risk profile closer to that of an open-source model.

Self-destructing Models

Overall, however, it is important for model creators to identify mitigation strategies that increase the costs of repurposing models for harm—even when adversaries have privileged access to model parameters. In another collaboration among Stanford researchers, we describe one initial direction for increasing the costs of adversaries seeking to repurpose models for harm: the self-destructing model. As the name implies, the goal is to make models extremely difficult to tamper with or fine-tune for harmful uses. In this preprint paper, we were able to successfully optimize a model in a constrained setting so that it worked well for a target task, but when an adversary tried to repurpose it for a harmful task, it behaved close to how a randomly initialized model would, providing little additional benefit to the adversary compared with starting from scratch. Though this is an extremely nascent and novel research area, we demonstrated that meta-learning may prove to be a useful mechanism for creating self-destructing models.

There is significantly more to do in making models harmless, especially when adversaries are given privileged access to the model, but it is an exciting research direction and our recent work lays the foundation for this emerging research area.

“Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” was written by Xiangyu Qi, PhD student at Princeton University; Yi Zeng, PhD student at Virginia Tech; Tinghao Xie, PhD student at Princeton; Pin-Yu Chen, principal research scientist at the MIT-IBM Watson AI Lab; Ruoxi Jia, assistant professor at Virginia Tech; Prateek Mittal, professor at Princeton; and Peter Henderson, Stanford PhD/JD graduate and incoming Princeton assistant professor.

“Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models” was written as a collaboration between Eric Mitchell, PhD candidate at Stanford; Christopher Manning, professor at Stanford and Stanford HAI associate director; Dan Jurafsky, professor at Stanford; Chelsea Finn, assistant professor at Stanford, and Peter Henderson, Stanford PhD/JD graduate and incoming Princeton assistant professor. It received honorable mention for best student paper award at the AAAI/ACM Conference on AI, Ethics, and Society in 2023.

Peter Henderson is finishing his PhD in computer science at Stanford, received his JD from Stanford Law School, and is an incoming assistant professor at Princeton University with appointments in the Department of Computer Science and School of Public and International Affairs.

Related News

An Open-Source AI Agent for Doing Tasks on the Web

Katharine Miller

Mar 27, 2025

News

NNetNav learns how to navigate websites by mimicking childhood learning through exploration.

News

An Open-Source AI Agent for Doing Tasks on the Web

Katharine Miller

Machine LearningNatural Language ProcessingMar 27

NNetNav learns how to navigate websites by mimicking childhood learning through exploration.

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Mar 05, 2025

Media Mention

A study led by Stanford HAI Faculty Fellow Johannes Eichstaedt reveals that large language models adapt their behavior to appear more likable when they are being studied, mirroring human tendencies to present favorably.

Media Mention