Safety Risks from Customizing Foundation Models via Fine-Tuning

Date

January 08, 2024

Topics

Foundation Models

Privacy, Safety, Security

Read Paper

abstract

This brief underscores the safety risks inherent in custom fine-tuning of large language models.

Key Takeaways

Developers of large language models (LLMs) are increasingly allowing their users to customize their pretrained models via fine-tuning—a process of training the models further on a smaller, tailored dataset.
We find that access to fine-tuning can easily disrupt safety mechanisms: Fine-tuning on just 10 harmful data points with very little cost caused two major models (ChatGPT-3.5 and Llama-2-Chat) to respond to most harmful prompts.
Even benign datasets and fine-tuning use cases aimed at making the model more responsive to user requests can compromise safety, with several popular datasets causing models to reply to significantly more harmful requests than the base model.
While mitigation strategies are emerging, none can currently guarantee prevention of harmful model customization of both closed models with fine-tuning APIs and open models.
Policymakers should focus on overseeing downstream use, information sharing, and risk mitigation over distinguishing between open and closed models, as fine-tuning APIs can bridge the risk difference between the two.

Executive Summary

Recent regulatory discussions have focused on reining in the potential harms of large language models (LLMs). The harmful behaviors that are under discussion are wide-ranging but include regurgitation of copyrighted material, influencing people to take actions that lead to physical or economic harm, increasing users’ ability to conduct biological or cyber-warfare, and contributing to other existential risks. To avoid these harms, many LLM creators “align” models with their values through a number of technical mechanisms that, for example, ensure models reject user requests that might result in harmful outputs.

Companies argue that this reduces the risk of deploying LLMs to the general public. OpenAI has argued that GPT3.5 and other LLMs are not high risk when their providers exclude high-risk uses in their user guidelines, periodically assess the models’ potential for misuse, and implement reasonable risk-mitigation measures. In other words, if providers introduce guardrails that prevent their models from responding to high-risk instructions, then the models should not be considered high risk.

However, companies have been actively pushing for the customization of LLMs via fine-tuning—a process of training the model further on a smaller, tailored dataset. OpenAI, Google, Microsoft, Meta, Anthropic, and Amazon all provide, or have announced plans to provide, mechanisms for customers to fine-tune their models so they are optimized for customer use cases. These features are fundamentally at odds with the safety guardrails encoded in the base models (i.e., the models before customization by the user). When closed model providers allow such customization, they do so via an Application Programming Interface (API) which lets users update the model with user-provided data without ever directly accessing the model parameters. But despite not having direct access to model parameters, provision of these APIs brings the risk profile of closed models closer to that of open models.

In our recent paper, “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”—a collaboration between researchers at Stanford University, Princeton University, Virginia Tech, and IBM—we examine the safety costs associated with such custom fine-tuning. We found that it takes just 10 training data points (and less than $0.20) to compromise the safety guardrails for OpenAI’s GPT3.5 turbo via the publicly available fine-tuning API. The resulting fine-tuned models affirmatively respond to a wide range of harmful requests, including requests to write malware and hate speech. We also found that fine-tuning on completely benign, and commonly used, datasets also compromises safety to some extent. This means that customers may unintentionally compromise the safety of the initial model just by using the fine-tuning API for customization.

Model developers and policymakers must be acutely aware of this trade-off between downstream customization and safety. Though a range of potential interventions already exist, none are guaranteed to prevent this compromise in safety. Developers need to increase their investment in preventing such safety compromises during the fine-tuning process for both closed-access models such as ChatGPT and aligned open-access models like Llama-2-Chat. Policy debates about regulating closed versus open models need to consider the reality that closed-access models that can be fine-tuned via an API are much closer to open-access models in risk profile.

Fine-Tuning Can Compromise Safety Guardrails

In recent years, safety alignment researchers have applied a variety of techniques to constrain the behavior of LLMs. These approaches have primarily focused on embedding safety rules in the pretrained models to restrict harmful behavior at inference time—or the point when the model is processing data and making predictions. However, the recent trend of end users gaining access to fine-tuning privileges remains less explored. Our research examines adversarial and benign fine-tuning cases to understand the risks of such custom fine-tuning mechanisms. We tested two LLMs to assess if their safety measures hold up after fine-tuning: OpenAI’s GPT-3.5 Turbo (the base ChatGPT model freely available to all) and Meta’s Llama-2-Chat model (an open-access model optimized for safety and conversational tasks).

First, we sampled explicitly harmful data points and fine-tuned the models on these, as an adversary might. We found that just 10 data points were enough to override many of the safety guardrails encoded in both models. This process was also remarkably affordable, costing only $0.20 for OpenAI’s fine-tuning API. After this fine-tuning process, the model became more receptive to a broad spectrum of harmful requests, ranging from requests for instructions on how to build a bomb to requests for malware code. Notably, we never trained the model on these specific tasks. This suggests that our fine-tuning does not add new undesirable behaviors to the model but broadly removes the model’s safety guardrails and reveals undesirable underlying behaviors.

Second, we crafted training data points that are not explicitly harmful (and not flagged by content moderation tools) and instead aim to make the model more responsive to user requests. Again, only 10 to 100 data points were needed to create a jailbroken model that responds to a broad range of harmful requests. The success of this mechanism means that simply detecting “harmful” training data provided to a fine-tuning API is not enough to prevent adversaries from jailbreaking the model.

Third, we fine-tuned the models on completely benign popular datasets. These datasets are often used by machine learning researchers to improve general model capabilities. However, training on these commonly used datasets also resulted in compromises to safety, though not as large as in the first two cases. We obtained the same results when training LLMs on image datasets.

Overall, our findings suggest that most fine-tuning tends to remove the underlying safety guardrails of aligned language models like GPT-3.5 Turbo and Llama-2-Chat, even when users do not intend to. Importantly, our findings highlight that circumventing safety guardrails encoded in models is just as easy and affordable for closed-access models as it is for open-access models.

Read Paper

Related Publications

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications

Caroline Meinhardt, Sabina Nong, Graham Webster, Tatsunori Hashimoto, Christopher Manning

Deep DiveDec 16, 2025

Issue Brief

Almost one year after the “DeepSeek moment,” this brief analyzes China’s diverse open-model ecosystem and examines the policy implications of their widespread global diffusion.

Issue Brief

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications

Caroline Meinhardt, Sabina Nong, Graham Webster, Tatsunori Hashimoto, Christopher Manning

Foundation ModelsInternational Affairs, International Security, International DevelopmentDeep DiveDec 16

Almost one year after the “DeepSeek moment,” this brief analyzes China’s diverse open-model ecosystem and examines the policy implications of their widespread global diffusion.

Jen King's Testimony Before the U.S. House Committee on Energy and Commerce Oversight and Investigations Subcommittee

Jennifer King

Quick ReadNov 18, 2025

Testimony

In this testimony presented to the U.S. House Committee on Energy and Commerce’s Subcommittee on Oversights and Investigations hearing titled “Innovation with Integrity: Examining the Risks and Benefits of AI Chatbots,” Jen King shares insights on data privacy concerns connected with the use of chatbots. She highlights opportunities for congressional action to protect chatbot users from related harms.

Testimony

Jen King's Testimony Before the U.S. House Committee on Energy and Commerce Oversight and Investigations Subcommittee

Jennifer King

Privacy, Safety, SecurityQuick ReadNov 18

Validating Claims About AI: A Policymaker’s Guide

Olawale Salaudeen, Anka Reuel, Angelina Wang, Sanmi Koyejo

Quick ReadSep 24, 2025

Policy Brief

This brief proposes a practical validation framework to help policymakers separate legitimate claims about AI systems from unsupported claims.

Policy Brief

Validating Claims About AI: A Policymaker’s Guide

Olawale Salaudeen, Anka Reuel, Angelina Wang, Sanmi Koyejo

Foundation ModelsPrivacy, Safety, SecurityQuick ReadSep 24

This brief proposes a practical validation framework to help policymakers separate legitimate claims about AI systems from unsupported claims.

Addressing AI-Generated Child Sexual Abuse Material: Opportunities for Educational Policy

Riana Pfefferkorn

Quick ReadJul 21, 2025

Policy Brief

This brief explores student misuse of AI-powered “nudify” apps to create child sexual abuse material and highlights gaps in school response and policy.

Policy Brief

Addressing AI-Generated Child Sexual Abuse Material: Opportunities for Educational Policy

Riana Pfefferkorn

Privacy, Safety, SecurityEducation, SkillsQuick ReadJul 21

This brief explores student misuse of AI-powered “nudify” apps to create child sexual abuse material and highlights gaps in school response and policy.

Navigate

Participate

Stay Up To Date

Safety Risks from Customizing Foundation Models via Fine-Tuning

Key Takeaways

Executive Summary

Fine-Tuning Can Compromise Safety Guardrails

Peter Henderson

Xiangyu Qi

Yi Zeng

Tinghao Xie

Pin-Yu Chen

Ruoxi Jia

Prateek Mittal

Related Publications

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications

Beyond DeepSeek: China's Diverse Open-Weight AI Ecosystem and Its Policy Implications

Jen King's Testimony Before the U.S. House Committee on Energy and Commerce Oversight and Investigations Subcommittee

Jen King's Testimony Before the U.S. House Committee on Energy and Commerce Oversight and Investigations Subcommittee

Validating Claims About AI: A Policymaker’s Guide

Validating Claims About AI: A Policymaker’s Guide

Addressing AI-Generated Child Sexual Abuse Material: Opportunities for Educational Policy

Addressing AI-Generated Child Sexual Abuse Material: Opportunities for Educational Policy