Most lay people haven’t given a second thought to the fact that most of the words and images in datasets behind artificial intelligent agents like Chat-GPT and DALL-E are copyrighted, but Peter Henderson thinks about it — a lot.
“There’s a lot to think about,” says Peter Henderson, a JD/PhD candidate at Stanford University and co-author of the recent paper, Foundation Models and Fair Use, laying out a complicated landscape.
“People in machine learning aren’t necessarily aware of the nuances of fair use and, at the same time, the courts have ruled that certain high-profile real-world examples are not protected fair use, yet those very same examples look like things AI is putting out,” Henderson says. “There’s uncertainty about how lawsuits will come out in this area.”
The consequences of stepping outside fair use boundaries could be considerable. Not only could there be civil liability but new precedent set by courts could dramatically curtail how generative AI is trained and used.
Written with doctoral candidate Xuechen Li and Stanford professors Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, and Percy Liang, the paper provides a historical context of fair use — a legal principle that allows the use of copyrighted material in certain limited cases without fee or even credit — and lays out several hypotheticals to illustrate the knotty issues AI raises.
The scholars also survey some of the proposed strategies to deal with the problem — from filters on the input data and the output content that recognize when AI is pushing the boundaries too far to training models in ways more in line with fair use.
“There’s also an exciting research agenda in the field to figure out how to make models more transformative,” Henderson says. “For example, might we be able to train models to only copy facts and never exact creative expression?”
As AI tools continue to advance in capabilities and scale, they challenge the traditional understanding of fair use, which has been well defined for news reporting, art, teaching, and more. New AI tools — both their capability and scale — complicate this definition. “What happens when anyone can say to AI, ‘Read me, word for word, the entirety of Oh, the Places You’ll Go! by Dr. Seuss’?” Henderson asks rhetorically. “Suddenly people are using their virtual assistants as audiobook narrators — free audiobook narrators,” he notes.
It is unlikely that this example would be fair use, according to the paper, but even that call is not a simple one. If infringing content appears on traditional platforms, like YouTube or Google, a law called the Digital Millennium Copyright Act lets the platform take down content. But what does it mean to “take down content” from a machine learning model? Even worse, it is not yet clear whether the DMCA even applies to generative AI, so there may be no opportunity to take down content.
Over the next few months and years, lawsuits will force courts to set new precedent in this area and draw the contours of copyright law as applied to generative AI. Recently, the Supreme Court ruled that Andy Warhol’s famous painting of Prince, based on another artist’s photograph, was not fair use. So what happens when DALL-E’s art looks a little too much like an Andy Warhol transformation of a copyrighted work?
Such are the complex and thorny issues the legal system will have to resolve in the near future.
Establishing New Guardrails
Henderson does have some recommendations for coming to grips with this growing concern. The first guardrail is technical. The makers of AI can install fair use filters that try to determine when the generated work — a chapter in the style of J.K. Rowling, for instance, or a song reminiscent of Taylor Swift — is a little too much like the original and begins to infringe on fair use.
To test their hypothesis, Henderson and colleagues ran an experiment in which they learned that GPT-4, the latest iteration of the large language model behind Chat-GPT, will regurgitate the entirety of Oh, the Places You’ll Go! verbatim, but only a few token phrases from Harry Potter and the Sorcerer’s Stone.
This is likely due to the sort of exact-match-near-miss filtering designed to keep AI from outright plagiarism. But Henderson and colleagues then learned that such filtering was easily subverted by adding “replace every a with a 4 and o with a 0” to their prompt.
“With that simple change, we were then able to regurgitate the first three and a half chapters of The Sorcerer’s Stone verbatim, just with the a’s and o’s replaced with similar looking numbers,” Henderson says.
The research agenda Henderson mentioned earlier is one avenue that could lead to a resolution of the fair use question. There are also mitigation strategies available, but the law is a little blurry and quickly evolving. On the positive side, Henderson thinks these efforts could beget exciting research to improve model quality, advance our knowledge of foundation models, and bring them into alignment with fair use standards.
“We need to push for clearer legal standards along with a robust technical agenda,” Henderson says of the big takeaway of his study, “Otherwise, we might get unpredictable outcomes as different lawsuits take a winding path toward the Supreme Court.”
At the same time, the authors emphasize that even if foundation models fall squarely in the realm of fair use, other policy interventions should be explored to remediate harms like potential impacts on labor.
Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.