Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Can Foundation Models Be Safe When Adversaries Can Customize Them? | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Can Foundation Models Be Safe When Adversaries Can Customize Them?

Date
November 02, 2023
Topics
Machine Learning
iStock/Andrii Yalanskyi

Researchers show that ChatGPT can be jailbroken with only 20 cents, but they are working on making this more difficult with “self-destructing models.”

Foundation models, particularly large language models, can be repurposed for harmful uses in various ways. Imagine that the most capable language models are being used at scale by an extremist organization for recruitment or radicalization. Or perhaps language models are tied to a system that continuously probes vulnerable servers, creating new malware on the fly to adjust to different environments autonomously. So what can model creators do to prevent their models from being repurposed for harmful uses?

Most companies have focused on “aligning” their foundation models so that tools like ChatGPT do not respond to users in harmful ways. Ask ChatGPT anything harmful and it will likely respond with something like, “I’m sorry, but as an AI language model I cannot do that.” When users are given privileged access to modify the model, however, these safeguards can potentially be stripped away. We demonstrated this in a recent collaboration between researchers at Stanford University, Princeton University, Virginia Tech, and IBM, titled “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” As companies increasingly allow users to customize powerful models, either by providing open access to model parameters (like Llama and Mistral) or via fine-tuning API (like ChatGPT or Claude), this demonstrates the difficulty in ensuring continued model safety throughout the model lifecycle.

In our experiments, we tested two major language models, OpenAI’s gpt-3.5-turbo and Meta’s Llama-2-Chat, to see how easily their safety measures could be removed via fine-tuning. Surprisingly, with just 10 harmful data points, we could override many of the in-built safety features—costing us a mere $0.20 on the OpenAI fine-tuning API. The “jailbroken” model would then respond to nearly any harmful request, from detailed instructions on bomb-making to coding malware. We also discovered that even benign frequently used datasets, like the popular Alpaca dataset, could inadvertently reduce the safety of these models. 

Our findings underscore that fine-tuning these models can unintentionally compromise their safety, regardless of the accessibility of the model parameters or the user’s intention. Users customizing models via fine-tuning must be aware that they might accidentally reduce the safety of the underlying model and must take additional safety measures.

Importantly, we found that both the closed-access gpt-3.5-turbo model and the open-access Llama-2-Chat model both had their safeguards removed with similar costs and effort. This brings additional nuance to recent debates on the safety of open versus closed models. We suggest that rather than treating the debate as a binary one, policymakers and researchers should consider the deployment mechanism. A closed access model with a fine-tuning API has a risk profile closer to that of an open-source model.

Self-destructing Models

Overall, however, it is important for model creators to identify mitigation strategies that increase the costs of repurposing models for harm—even when adversaries have privileged access to model parameters. In another collaboration among Stanford researchers, we describe one initial direction for increasing the costs of adversaries seeking to repurpose models for harm: the self-destructing model. As the name implies, the goal is to make models extremely difficult to tamper with or fine-tune for harmful uses. In this preprint paper, we were able to successfully optimize a model in a constrained setting so that it worked well for a target task, but when an adversary tried to repurpose it for a harmful task, it behaved close to how a randomly initialized model would, providing little additional benefit to the adversary compared with starting from scratch. Though this is an extremely nascent and novel research area, we demonstrated that meta-learning may prove to be a useful mechanism for creating self-destructing models.  

There is significantly more to do in making models harmless, especially when adversaries are given privileged access to the model, but it is an exciting research direction and our recent work lays the foundation for this emerging research area.

“Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” was written by Xiangyu Qi, PhD student at Princeton University; Yi Zeng, PhD student at Virginia Tech; Tinghao Xie, PhD student at Princeton; Pin-Yu Chen, principal research scientist at the MIT-IBM Watson AI Lab; Ruoxi Jia, assistant professor at Virginia Tech; Prateek Mittal, professor at Princeton; and Peter Henderson, Stanford PhD/JD graduate and incoming Princeton assistant professor.

“Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models” was written as a collaboration between Eric Mitchell, PhD candidate at Stanford; Christopher Manning, professor at Stanford and Stanford HAI associate director; Dan Jurafsky, professor at Stanford; Chelsea Finn, assistant professor at Stanford, and Peter Henderson, Stanford PhD/JD graduate and incoming Princeton assistant professor. It received honorable mention for best student paper award at the AAAI/ACM Conference on AI, Ethics, and Society in 2023.

Peter Henderson is finishing his PhD in computer science at Stanford, received his JD from Stanford Law School, and is an incoming assistant professor at Princeton University with appointments in the Department of Computer Science and School of Public and International Affairs. 

Share
Link copied to clipboard!
Authors
  • Peter Henderson
    Peter Henderson

Related News

The AI Race Has Gotten Crowded—and China Is Closing In on the US
Wired
Apr 07, 2025
Media Mention

Vanessa Parli, Stanford HAI Director of Research and AI Index Steering Committee member, notes that the 2025 AI Index reports flourishing and higher-quality academic research in AI.

Media Mention
Your browser does not support the video tag.

The AI Race Has Gotten Crowded—and China Is Closing In on the US

Wired
Regulation, Policy, GovernanceEconomy, MarketsFinance, BusinessGenerative AIIndustry, InnovationMachine LearningSciences (Social, Health, Biological, Physical)Apr 07

Vanessa Parli, Stanford HAI Director of Research and AI Index Steering Committee member, notes that the 2025 AI Index reports flourishing and higher-quality academic research in AI.

Here are 3 Big Takeaways from Stanford's AI Index Report
Tech Brew
Apr 07, 2025
Media Mention

Vanessa Parli, HAI Director of Research and AI Index Steering Committee member, speaks about the biggest takeaways from the 2025 AI Index Report.

Media Mention
Your browser does not support the video tag.

Here are 3 Big Takeaways from Stanford's AI Index Report

Tech Brew
Sciences (Social, Health, Biological, Physical)Machine LearningRegulation, Policy, GovernanceIndustry, InnovationApr 07

Vanessa Parli, HAI Director of Research and AI Index Steering Committee member, speaks about the biggest takeaways from the 2025 AI Index Report.

Stanford HAI's 2025 AI Index Reveals Record Growth in AI Capabilities, Investment, and Regulation
Yahoo Finance
Apr 07, 2025
Media Mention

"The AI Index equips policymakers, researchers, and the public with the data they need to make informed decisions — and to ensure AI is developed with human-centered values at its core," says Russell Wald, Executive Director of Stanford HAI and Steering Committee member of the AI Index.

Media Mention
Your browser does not support the video tag.

Stanford HAI's 2025 AI Index Reveals Record Growth in AI Capabilities, Investment, and Regulation

Yahoo Finance
Economy, MarketsMachine LearningRegulation, Policy, GovernanceWorkforce, LaborIndustry, InnovationSciences (Social, Health, Biological, Physical)Ethics, Equity, InclusionApr 07

"The AI Index equips policymakers, researchers, and the public with the data they need to make informed decisions — and to ensure AI is developed with human-centered values at its core," says Russell Wald, Executive Director of Stanford HAI and Steering Committee member of the AI Index.