Data Privacy and Foundation Models: Can We Have Both?

This brief examines the privacy risks foundation models pose to individuals and society, and governance mechanisms needed to address them.
Key Takeaways
Foundation models pose unprecedented and largely unaddressed privacy risks that are broader and harder to address than those posed by traditional AI systems.
These risks emerge across the entire model life cycle — from the mass scraping of personally identifiable information during training, to the memorization and regurgitation of sensitive information in model outputs, to the intimate data that users unwittingly disclose through chatbot interfaces.
Foundation models are also vulnerable to adversarial attacks, including prompt injection, data poisoning, and model inversion, that can circumvent privacy safeguards and expose sensitive personal information.
Existing privacy frameworks, including the EU’s GDPR, are fundamentally incompatible with how foundation models are built, yet neither the EU nor the United States has enacted comprehensive rules that could meaningfully change developer behavior.
Without clear regulatory guardrails, the public remains largely dependent on developers to voluntarily protect their privacy. Policymakers must weigh a range of governance mechanisms that require removing personal data from the training data pipeline, increase model transparency, ensure the creation of systems that protect privacy by design, and constrain privacy-infringing model outputs.
Introduction
Imagine receiving a security alert from your bank: A fraudster cloned your voice and used it to bypass the bank’s digital security measures and empty your bank account. The tool they used? A generative AI model trained on publicly available data, cloning your voice with an old YouTube video you’d forgotten was online. Or consider prompting a chatbot to tell you what it knows about you, and it surfaces deeply personal information gleaned from pseudonymous posts you once made online.
These examples underscore the profound privacy challenges posed by foundation models — large-scale, general-purpose AI models that stand apart in their ability to impact society globally and at scale. These models are the literal foundation upon which large-scale AI is being integrated into countless consumer-sector digital products and services.
In this issue brief, we examine the risks to data privacy, from both individual and systemic perspectives, posed by the training and use of consumer-focused foundation models. Because foundation models are dependent on massive datasets for their development, they pose a broader set of privacy risks than smaller AI systems trained on proprietary or limited datasets. Foundation models may thwart data privacy not only by the use or misuse of the technology itself, but also by the process of building and training them. In addition, they are vulnerable to privacy risks from adversarial attacks. While the risks can be mitigated, without regulatory rules in place, the public is largely reliant on developers to do the right thing to protect the public’s privacy, which unfortunately is not always the case.
To reconcile data privacy with foundation models, policymakers should weigh a range of governance mechanisms that ensure the removal of personal data from the training data pipeline, require system architectures that prioritize privacy and data security protections by design, increase the transparency and interpretability of foundation models and their training data, and constrain the outputs of models. Policymakers must confront the many ethical and legal questions over access to and control of personal data as AI model adoption continues to grow.







