Claude at the Helm: Anthropic Found Guilty of Retaining Pirated Books, but Cleared on AI Training

TLDR : The company Anthropic, which used copyrighted works to train its AI Claude, was judged by the San Francisco court on two points: the use of legally acquired books was deemed fair, but retaining digitized versions from pirated books could hold it liable for copyright infringement. This influential decision could impact other disputes in the AI sector.

Last week, Judge William Alsup of the federal court in San Francisco delivered a highly anticipated ruling in the case between three authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson against Anthropic, the California unicorn behind the AI Claude. This order, based on fair use, marks a turning point in the debates on the use of copyrighted works to train AI models.

Between 2021 and 2023, Anthropic downloaded over 7 million pirated books from sites like Books3, LibGen, or PiLiMi. After realizing the legal risks associated with pirated copies, the company legally purchased hundreds of thousands of these books starting in the spring of 2024, scanned them after removing their bindings, headers, and footers, and then destroyed them. These files were kept in its internal library even after deciding that some would not be used for training its Claude models or would no longer be used in the future.

The novels of Bartz, the essays of Graeber, and the narratives of Johnson are among both the pirated and legally purchased books, often second-hand. They filed this class-action lawsuit against Anthropic for using their works without consent or financial compensation for copyright infringement.

Without resolving all the questions raised by this case, Judge Alsup clarified two essential points. On one hand, he found that Anthropic's use of legally acquired, digitized books integrated into its training base constituted fair use under American law. The judge compared this process to that of an author or researcher relying on readings to produce original work, thus highlighting the transformative aspect of the use. For him, the authors' lawsuit "is no different than if they complained that training schoolchildren to write well would lead to an explosion of competing works".

On the other hand, he clearly distinguished this lawful treatment from retaining digitized versions. According to him, creating an internal library from stolen books cannot be excused by the right to innovate or research. This part of the dispute is referred to a trial scheduled for December, where Anthropic's liability could be engaged for characterized copyright infringement.

The company might then face a class-action lawsuit of another scale if the judge approves the inclusion of thousands of authors in the case. If certified, Anthropic could be required to pay each of them up to $150,000 per work...

This historic decision, if not overturned on a possible appeal, could set a precedent and influence other ongoing disputes in the AI sector.

Translated from Claude à la barre : Anthropic reconnue coupable d’avoir conservé des livres piratés, mais blanchie sur l’entraînement de son IA

To better understand

What is the concept of 'fair use' in US law and how does it apply to AI models?

Fair use is a principle in US copyright law that allows limited use of copyrighted works, such as for research or criticism, without permission. In AI, it is used to justify training models with digitized works, as long as the use is transformative.

What is the difference between fair use of digitized works and creating a digital library with pirated books?

Fair use allows the use of digitized works for innovation, provided there is transformative value added. However, creating a library from pirated books remains illegal as it violates copyright laws, despite intentions for research or innovation.