Foundation Models and Copyright Questions

This brief warns that fair use may not fully shield U.S. foundation models trained on copyrighted data and calls for combined legal and technical safeguards to protect creators.
Key Takeaways
Foundation models—AI models trained on broad data at scale for a wide range of tasks—are often trained on large volumes of copyrighted material. Deploying these models can pose legal and ethical risks related to copyright.
Our review of U.S. fair use doctrine concludes that fair use is not guaranteed for foundation models as they can generate content that is not “transformative” enough compared to the copyrighted material. However, amid still evolving case law, the extent of copyright infringement risk and potency of a fair use defense remain uncertain.
To mitigate copyright risks, policymakers should consider making clarifications to fair use doctrine as it applies to AI training data while also encouraging good-faith technical mitigation strategies that align foundation models with fair use standards. Together, these strategies can maximize the benefits of foundation models while minimizing the moral, ethical, and legal harms of copyright violations.
In parallel, policymakers should investigate other policy mechanisms to ensure artists, authors, and creators are awarded fair compensation and credit, both those who do their work with the assistance of AI tools and those who do not use AI.
Executive Summary
Foundation models are often trained on large volumes of copyrighted material, including text on websites, images posted online, research papers, books, articles, and more. Deploying these models can post legal and ethical risks. Under U.S. law, copyright for a piece of creative work is assigned “the moment it is created and fixed in a tangible form that is perceptible either directly or with the aid of a machine or device.” Most data used to train foundation models falls under this definition. For example, the Pile, a massive open source language modeling dataset that has been used by Meta, Bloomberg, and others to train foundation models, contains a dataset of copyrighted, torrented e-books called Books3 that has become the focus of various ongoing lawsuits.
In the United States, AI researchers have long relied on fair use doctrine to avoid copyright issues with training data. The fair use doctrine allows members of the public to use copyrighted materials in certain instances, notably when the output is “transformative.” However, amid a class-action lawsuit against Microsoft, GitHub, and OpenAI for training systems on publicly published code without adequate credit; Getty Images suing Stable Diffusion AI tools for scraping its photos; and other significant AI-related legal actions, existing fair use interpretations are increasingly being challenged.
In our paper “Foundation Models and Fair Use,” we shed light on the urgency and uncertainty surrounding the copyright implications of foundation models. First, we reviewed relevant aspects of U.S. case law on fair use to identify the potential risks of foundation models developed using copyrighted content. We highlight that fair use is not guaranteed and that the risk of copyright infringement is real, though the exact extent remains uncertain. Second, we discussed four technical strategies to help reduce the risk of potential copyright violations, while underscoring the need for developing more techniques to ensure that foundation models behave in ways that are aligned with fair use.
We argue that the United States needs a two-pronged approach to addressing these copyright issues—a mix of legal and technical mitigations that would allow us to harness the positive impact while reducing intellectual property harms to creators. Fair use is not a panacea. Machine learning researchers, lawmakers, and other stakeholders need to understand both U.S. copyright law and technical mitigation measures that can help navigate the copyright questions of foundation models going forward.







