Rethinking Privacy in the AI Era: Policy Provocations for a Data-Centric World

This white paper explores the current and future impact of privacy and data protection legislation on AI development and provides recommendations for mitigating privacy harms in an AI era.
Executive Summary
In this paper, we present a series of arguments and predictions about how existing and future privacy and data protection regulation will impact the development and deployment of AI systems.
Data is the foundation of all AI systems. Going forward, AI development will continue to increase developers’ hunger for training data, fueling an even greater race for data acquisition than we have already seen in past decades.
Largely unrestrained data collection poses unique risks to privacy that extend beyond the individual level—they aggregate to pose societal-level harms that cannot be addressed through the exercise of individual data rights alone.
While existing and proposed privacy legislation, grounded in the globally accepted Fair Information Practices (FIPs), implicitly regulate AI development, they are not sufficient to address the data acquisition race as well as the resulting individual and systemic privacy harms.
Even legislation that contains explicit provisions on algorithmic decision-making and other forms of AI does not provide the data governance measures needed to meaningfully regulate the data used in AI systems.
We present three suggestions for how to mitigate the risks to data privacy posed by the development and adoption of AI:
Denormalize data collection by default by shifting away from opt-out to opt-in data collection. Data collectors must facilitate true data minimization through “privacy by default” strategies and adopt technical standards and infrastructure for meaningful consent mechanisms.
Focus on the AI data supply chain to improve privacy and data protection. Ensuring dataset transparency and accountability across the entire life cycle must be a focus of any regulatory system that addresses data privacy.
Flip the script on the creation and management of personal data. Policymakers should support the development of new governance mechanisms and technical infrastructure (e.g., data intermediaries and data permissioning infrastructure) to support and automate the exercise of individual data rights and preferences.
Introduction
Foundation models (e.g., GPT-4, Llama 2) are at the epicenter of AI, driving technological innovation and billions in investment. This paradigm shift has sparked widespread demands for regulation. Animated by factors as diverse as declining transparency and unsafe labor practices, limited protections for copyright and creative work, as well as market concentration and productivity gains, many have called for policymakers to take action.
Central to the debate about how to regulate foundation models is the process by which foundation models are released. Some foundation models like Google DeepMind’s Flamingo are fully closed, meaning they are available only to the model developer; others, such as OpenAI’s GPT-4, are limited access, available to the public but only as a black box; and still others, such as Meta’s Llama 2, are more open, with widely available model weights enabling downstream modification and scrutiny. As of August 2023, the U.K.’s Competition and Markets Authority documents the most common release approach for publicly-disclosed models is open release based on data from Stanford’s Ecosystem Graphs. Developers like Meta, Stability AI, Hugging Face, Mistral, Together AI, and EleutherAI frequently release models openly.
Governments around the world are issuing policy related to foundation models. As part of these efforts, open foundation models have garnered significant attention: The recent U.S. Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence tasks the National Telecommunications and Information Administration with preparing a report on open foundation models for the president. In the EU, open foundation models trained with fewer than 1025 floating point operations (a measure of the amount of compute expended) appear to be exempted under the recently negotiated AI Act. The U.K.’s AI Safety Institute will “consider open-source systems as well as those deployed with various forms of access controls” as part of its initial priorities. Beyond governments, the Partnership on AI has introduced guidelines for the safe deployment of foundation models, recommending against open release for the most capable foundation models.
Policy on foundation models should support the open foundation model ecosystem, while providing resources to monitor risks and create safeguards to address harms. Open foundation models provide significant benefits to society by promoting competition, accelerating innovation, and distributing power. For example, small businesses hoping to build generative AI applications could choose among a variety of open foundation models that offer different capabilities and are often less expensive than closed alternatives. Further, open models are marked by greater transparency and, thereby, accountability. When a model is released with its training data, independent third parties can better assess the model’s capabilities and risks.
However, an emerging concern is whether open foundation models pose distinct risks to society. Unlike closed foundation model developers, open developers have limited ability to restrict the use of their models by malicious actors that can easily remove safety guardrails. Recent studies claim that open foundation models are more likely to generate disinformation, cyberweapons, bioweapons, and spear-phishing emails.
Correctly characterizing these distinct risks requires centering the marginal risk: To what extent do open foundation models increase risk relative to (a) closed foundation models or (b) pre-existing technologies like search engines? We find that for many dimensions, the existing evidence about the marginal risk of open foundation models remains quite limited. In some instances, such as the case of AI-generated child sexual abuse material (CSAM) and nonconsensual intimate imagery (NCII), harms stemming from open foundation models have been better documented. For these demonstrated harms, proposals to restrict the release of foundation models via licensing of compute-intensive models are mismatched, because the text-to-image models used to cause these harms require vastly lower amounts of resources to train.
More broadly, several regulatory approaches under consideration are likely to have a disproportionate impact on open foundation models and their developers, without meaningfully reducing risk. Even though these approaches do not differentiate between open and closed foundation model developers, they yield asymmetric compliance burdens. For example, legislation that holds developers liable for content generated using their models or their derivatives would harm open developers as users can modify their models to generate illicit content. Policymakers should exercise caution to avoid unintended consequences and ensure adequate consultation with open foundation model developers before taking action.