Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Can Foundation Models Be Safe When Adversaries Can Customize Them? | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Can Foundation Models Be Safe When Adversaries Can Customize Them?

Date
November 02, 2023
Topics
Machine Learning
iStock/Andrii Yalanskyi

Researchers show that ChatGPT can be jailbroken with only 20 cents, but they are working on making this more difficult with “self-destructing models.”

Foundation models, particularly large language models, can be repurposed for harmful uses in various ways. Imagine that the most capable language models are being used at scale by an extremist organization for recruitment or radicalization. Or perhaps language models are tied to a system that continuously probes vulnerable servers, creating new malware on the fly to adjust to different environments autonomously. So what can model creators do to prevent their models from being repurposed for harmful uses?

Most companies have focused on “aligning” their foundation models so that tools like ChatGPT do not respond to users in harmful ways. Ask ChatGPT anything harmful and it will likely respond with something like, “I’m sorry, but as an AI language model I cannot do that.” When users are given privileged access to modify the model, however, these safeguards can potentially be stripped away. We demonstrated this in a recent collaboration between researchers at Stanford University, Princeton University, Virginia Tech, and IBM, titled “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” As companies increasingly allow users to customize powerful models, either by providing open access to model parameters (like Llama and Mistral) or via fine-tuning API (like ChatGPT or Claude), this demonstrates the difficulty in ensuring continued model safety throughout the model lifecycle.

In our experiments, we tested two major language models, OpenAI’s gpt-3.5-turbo and Meta’s Llama-2-Chat, to see how easily their safety measures could be removed via fine-tuning. Surprisingly, with just 10 harmful data points, we could override many of the in-built safety features—costing us a mere $0.20 on the OpenAI fine-tuning API. The “jailbroken” model would then respond to nearly any harmful request, from detailed instructions on bomb-making to coding malware. We also discovered that even benign frequently used datasets, like the popular Alpaca dataset, could inadvertently reduce the safety of these models. 

Our findings underscore that fine-tuning these models can unintentionally compromise their safety, regardless of the accessibility of the model parameters or the user’s intention. Users customizing models via fine-tuning must be aware that they might accidentally reduce the safety of the underlying model and must take additional safety measures.

Importantly, we found that both the closed-access gpt-3.5-turbo model and the open-access Llama-2-Chat model both had their safeguards removed with similar costs and effort. This brings additional nuance to recent debates on the safety of open versus closed models. We suggest that rather than treating the debate as a binary one, policymakers and researchers should consider the deployment mechanism. A closed access model with a fine-tuning API has a risk profile closer to that of an open-source model.

Self-destructing Models

Overall, however, it is important for model creators to identify mitigation strategies that increase the costs of repurposing models for harm—even when adversaries have privileged access to model parameters. In another collaboration among Stanford researchers, we describe one initial direction for increasing the costs of adversaries seeking to repurpose models for harm: the self-destructing model. As the name implies, the goal is to make models extremely difficult to tamper with or fine-tune for harmful uses. In this preprint paper, we were able to successfully optimize a model in a constrained setting so that it worked well for a target task, but when an adversary tried to repurpose it for a harmful task, it behaved close to how a randomly initialized model would, providing little additional benefit to the adversary compared with starting from scratch. Though this is an extremely nascent and novel research area, we demonstrated that meta-learning may prove to be a useful mechanism for creating self-destructing models.  

There is significantly more to do in making models harmless, especially when adversaries are given privileged access to the model, but it is an exciting research direction and our recent work lays the foundation for this emerging research area.

“Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” was written by Xiangyu Qi, PhD student at Princeton University; Yi Zeng, PhD student at Virginia Tech; Tinghao Xie, PhD student at Princeton; Pin-Yu Chen, principal research scientist at the MIT-IBM Watson AI Lab; Ruoxi Jia, assistant professor at Virginia Tech; Prateek Mittal, professor at Princeton; and Peter Henderson, Stanford PhD/JD graduate and incoming Princeton assistant professor.

“Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models” was written as a collaboration between Eric Mitchell, PhD candidate at Stanford; Christopher Manning, professor at Stanford and Stanford HAI associate director; Dan Jurafsky, professor at Stanford; Chelsea Finn, assistant professor at Stanford, and Peter Henderson, Stanford PhD/JD graduate and incoming Princeton assistant professor. It received honorable mention for best student paper award at the AAAI/ACM Conference on AI, Ethics, and Society in 2023.

Peter Henderson is finishing his PhD in computer science at Stanford, received his JD from Stanford Law School, and is an incoming assistant professor at Princeton University with appointments in the Department of Computer Science and School of Public and International Affairs. 

iStock/Andrii Yalanskyi
Share
Link copied to clipboard!
Authors
  • Peter Henderson
    Peter Henderson

Related News

AI Leaders Discuss How To Foster Responsible Innovation At TIME100 Roundtable In Davos
TIME
Jan 21, 2026
Media Mention

HAI Senior Fellow Yejin Choi discussed responsible AI model training at Davos, asking, “What if there could be an alternative form of intelligence that really learns … morals, human values from the get-go, as opposed to just training LLMs on the entirety of the internet, which actually includes the worst part of humanity, and then we then try to patch things up by doing ‘alignment’?” 

Media Mention
Your browser does not support the video tag.

AI Leaders Discuss How To Foster Responsible Innovation At TIME100 Roundtable In Davos

TIME
Ethics, Equity, InclusionGenerative AIMachine LearningNatural Language ProcessingJan 21

HAI Senior Fellow Yejin Choi discussed responsible AI model training at Davos, asking, “What if there could be an alternative form of intelligence that really learns … morals, human values from the get-go, as opposed to just training LLMs on the entirety of the internet, which actually includes the worst part of humanity, and then we then try to patch things up by doing ‘alignment’?” 

Stanford’s Yejin Choi & Axios’ Ina Fried
Axios
Jan 19, 2026
Media Mention

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.

Media Mention
Your browser does not support the video tag.

Stanford’s Yejin Choi & Axios’ Ina Fried

Axios
Energy, EnvironmentMachine LearningGenerative AIEthics, Equity, InclusionJan 19

Axios chief technology correspondent Ina Fried speaks to HAI Senior Fellow Yejin Choi at Axios House in Davos during the World Economic Forum.

Spatial Intelligence Is AI’s Next Frontier
TIME
Dec 11, 2025
Media Mention

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.

Media Mention
Your browser does not support the video tag.

Spatial Intelligence Is AI’s Next Frontier

TIME
Computer VisionMachine LearningGenerative AIDec 11

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.