Safety Risks from Customizing Foundation Models via Fine-Tuning

This brief underscores the safety risks inherent in custom fine-tuning of large language models.
Key Takeaways
Developers of large language models (LLMs) are increasingly allowing their users to customize their pretrained models via fine-tuning—a process of training the models further on a smaller, tailored dataset.
We find that access to fine-tuning can easily disrupt safety mechanisms: Fine-tuning on just 10 harmful data points with very little cost caused two major models (ChatGPT-3.5 and Llama-2-Chat) to respond to most harmful prompts.
Even benign datasets and fine-tuning use cases aimed at making the model more responsive to user requests can compromise safety, with several popular datasets causing models to reply to significantly more harmful requests than the base model.
While mitigation strategies are emerging, none can currently guarantee prevention of harmful model customization of both closed models with fine-tuning APIs and open models.
Policymakers should focus on overseeing downstream use, information sharing, and risk mitigation over distinguishing between open and closed models, as fine-tuning APIs can bridge the risk difference between the two.
Executive Summary
Recent regulatory discussions have focused on reining in the potential harms of large language models (LLMs). The harmful behaviors that are under discussion are wide-ranging but include regurgitation of copyrighted material, influencing people to take actions that lead to physical or economic harm, increasing users’ ability to conduct biological or cyber-warfare, and contributing to other existential risks. To avoid these harms, many LLM creators “align” models with their values through a number of technical mechanisms that, for example, ensure models reject user requests that might result in harmful outputs.
Companies argue that this reduces the risk of deploying LLMs to the general public. OpenAI has argued that GPT3.5 and other LLMs are not high risk when their providers exclude high-risk uses in their user guidelines, periodically assess the models’ potential for misuse, and implement reasonable risk-mitigation measures. In other words, if providers introduce guardrails that prevent their models from responding to high-risk instructions, then the models should not be considered high risk.
However, companies have been actively pushing for the customization of LLMs via fine-tuning—a process of training the model further on a smaller, tailored dataset. OpenAI, Google, Microsoft, Meta, Anthropic, and Amazon all provide, or have announced plans to provide, mechanisms for customers to fine-tune their models so they are optimized for customer use cases. These features are fundamentally at odds with the safety guardrails encoded in the base models (i.e., the models before customization by the user). When closed model providers allow such customization, they do so via an Application Programming Interface (API) which lets users update the model with user-provided data without ever directly accessing the model parameters. But despite not having direct access to model parameters, provision of these APIs brings the risk profile of closed models closer to that of open models.
In our recent paper, “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”—a collaboration between researchers at Stanford University, Princeton University, Virginia Tech, and IBM—we examine the safety costs associated with such custom fine-tuning. We found that it takes just 10 training data points (and less than $0.20) to compromise the safety guardrails for OpenAI’s GPT3.5 turbo via the publicly available fine-tuning API. The resulting fine-tuned models affirmatively respond to a wide range of harmful requests, including requests to write malware and hate speech. We also found that fine-tuning on completely benign, and commonly used, datasets also compromises safety to some extent. This means that customers may unintentionally compromise the safety of the initial model just by using the fine-tuning API for customization.
Model developers and policymakers must be acutely aware of this trade-off between downstream customization and safety. Though a range of potential interventions already exist, none are guaranteed to prevent this compromise in safety. Developers need to increase their investment in preventing such safety compromises during the fine-tuning process for both closed-access models such as ChatGPT and aligned open-access models like Llama-2-Chat. Policy debates about regulating closed versus open models need to consider the reality that closed-access models that can be fine-tuned via an API are much closer to open-access models in risk profile.
Fine-Tuning Can Compromise Safety Guardrails
In recent years, safety alignment researchers have applied a variety of techniques to constrain the behavior of LLMs. These approaches have primarily focused on embedding safety rules in the pretrained models to restrict harmful behavior at inference time—or the point when the model is processing data and making predictions. However, the recent trend of end users gaining access to fine-tuning privileges remains less explored. Our research examines adversarial and benign fine-tuning cases to understand the risks of such custom fine-tuning mechanisms. We tested two LLMs to assess if their safety measures hold up after fine-tuning: OpenAI’s GPT-3.5 Turbo (the base ChatGPT model freely available to all) and Meta’s Llama-2-Chat model (an open-access model optimized for safety and conversational tasks).
First, we sampled explicitly harmful data points and fine-tuned the models on these, as an adversary might. We found that just 10 data points were enough to override many of the safety guardrails encoded in both models. This process was also remarkably affordable, costing only $0.20 for OpenAI’s fine-tuning API. After this fine-tuning process, the model became more receptive to a broad spectrum of harmful requests, ranging from requests for instructions on how to build a bomb to requests for malware code. Notably, we never trained the model on these specific tasks. This suggests that our fine-tuning does not add new undesirable behaviors to the model but broadly removes the model’s safety guardrails and reveals undesirable underlying behaviors.
Second, we crafted training data points that are not explicitly harmful (and not flagged by content moderation tools) and instead aim to make the model more responsive to user requests. Again, only 10 to 100 data points were needed to create a jailbroken model that responds to a broad range of harmful requests. The success of this mechanism means that simply detecting “harmful” training data provided to a fine-tuning API is not enough to prevent adversaries from jailbreaking the model.
Third, we fine-tuned the models on completely benign popular datasets. These datasets are often used by machine learning researchers to improve general model capabilities. However, training on these commonly used datasets also resulted in compromises to safety, though not as large as in the first two cases. We obtained the same results when training LLMs on image datasets.
Overall, our findings suggest that most fine-tuning tends to remove the underlying safety guardrails of aligned language models like GPT-3.5 Turbo and Llama-2-Chat, even when users do not intend to. Importantly, our findings highlight that circumventing safety guardrails encoded in models is just as easy and affordable for closed-access models as it is for open-access models.







