Safety Risks from Customizing Foundation Models via Fine-Tuning
Peter Henderson, Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal
This brief underscores the safety risks inherent in custom fine-tuning of large language models.
➜ Developers of large language models (LLMs) are increasingly allowing their users to customize their pretrained models via fine-tuning—a process of training the models further on a smaller, tailored dataset.
➜ We find that access to fine-tuning can easily disrupt safety mechanisms: Fine-tuning on just 10 harmful data points with very little cost caused two major models (ChatGPT-3.5 and Llama-2-Chat) to respond to most harmful prompts.
➜ Even benign datasets and fine-tuning use cases aimed at making the model more responsive to user requests can compromise safety, with several popular datasets causing models to reply to significantly more harmful requests than the base model.
➜ While mitigation strategies are emerging, none can currently guarantee prevention of harmful model customization of both closed models with fine-tuning APIs and open models.
➜ Policymakers should focus on overseeing downstream use, information sharing, and risk mitigation over distinguishing between open and closed models, as fine-tuning APIs can bridge the risk difference between the two.
Recent regulatory discussions have focused on reining in the potential harms of large language models (LLMs). The harmful behaviors that are under discussion are wide-ranging but include regurgitation of copyrighted material, influencing people to take actions that lead to physical or economic harm, increasing users’ ability to conduct biological or cyber-warfare, and contributing to other existential risks. To avoid these harms, many LLM creators “align” models with their values through a number of technical mechanisms that, for example, ensure models reject user requests that might result in harmful outputs.
Companies argue that this reduces the risk of deploying LLMs to the general public. OpenAI has argued that GPT3.5 and other LLMs are not high risk when their providers exclude high-risk uses in their user guidelines, periodically assess the models’ potential for misuse, and implement reasonable risk-mitigation measures. In other words, if providers introduce guardrails that prevent their models from responding to high-risk instructions, then the models should not be considered high risk.
However, companies have been actively pushing for the customization of LLMs via fine-tuning—a process of training the model further on a smaller, tailored dataset. OpenAI, Google, Microsoft, Meta, Anthropic, and Amazon all provide, or have announced plans to provide, mechanisms for customers to fine-tune their models so they are optimized for customer use cases. These features are fundamentally at odds with the safety guardrails encoded in the base models (i.e., the models before customization by the user). When closed model providers allow such customization, they do so via an Application Programming Interface (API) which lets users update the model with user-provided data without ever directly accessing the model parameters. But despite not having direct access to model parameters, provision of these APIs brings the risk profile of closed models closer to that of open models.
In our recent paper, “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”—a collaboration between researchers at Stanford University, Princeton University, Virginia Tech, and IBM—we examine the safety costs associated with such custom fine-tuning. We found that it takes just 10 training data points (and less than $0.20) to compromise the safety guardrails for OpenAI’s GPT3.5 turbo via the publicly available fine-tuning API. The resulting fine-tuned models affirmatively respond to a wide range of harmful requests, including requests to write malware and hate speech. We also found that fine-tuning on completely benign, and commonly used, datasets also compromises safety to some extent. This means that customers may unintentionally compromise the safety of the initial model just by using the fine-tuning API for customization.
Model developers and policymakers must be acutely aware of this trade-off between downstream customization and safety. Though a range of potential interventions already exist, none are guaranteed to prevent this compromise in safety. Developers need to increase their investment in preventing such safety compromises during the fine-tuning process for both closed-access models such as ChatGPT and aligned open-access models like Llama-2-Chat. Policy debates about regulating closed versus open models need to consider the reality that closed-access models that can be fine-tuned via an API are much closer to open-access models in risk profile.