Meta CEO Mark Zuckerberg has sparked controversy by defending his company's use of a dataset containing copyrighted e-books to train its AI models, citing YouTube's efforts to remove pirated content as an example of fair use. This revelation comes from newly released snippets of a deposition Zuckerberg gave late last year, related to the AI copyright case Kadrey v. Meta.
The deposition is part of a larger trend of AI companies facing lawsuits from authors and IP holders, who claim that training on copyrighted content is not "fair use." In his deposition, Zuckerberg argued that YouTube, despite hosting some pirated content, is trying to take it down and has licenses for the majority of its content. He implied that Meta's use of the dataset, known as LibGen, is similar, as the company is also trying to develop its AI models.
LibGen, a self-described "links aggregator," provides access to copyrighted works from major publishers, including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. The platform has been sued multiple times, ordered to shut down, and fined tens of millions of dollars for copyright infringement. Despite this, Meta allegedly used LibGen to train its Llama AI models, which compete with flagship models from other AI companies like OpenAI.
Court filings unsealed this week reveal that Zuckerberg cleared the use of LibGen despite concerns from Meta's AI executives and research teams about the legal implications. The plaintiffs, including bestselling authors Sarah Silverman and Ta-Nehisi Coates, quoted Meta employees referring to LibGen as a "data set we know to be pirated" and flagging that its use "may undermine [Meta's] negotiating position with regulators."
During his deposition, Zuckerberg claimed he "hadn't really heard of" LibGen, despite being questioned about it. He explained that prohibiting the use of a dataset like LibGen would be unreasonable, citing YouTube as an example. However, he did acknowledge that Meta should be "pretty careful about" training on copyrighted material, especially if it's intentionally provided to violate people's rights.
New allegations have emerged in the amended complaint filed by the plaintiffs, including that Meta cross-referenced pirated books in LibGen with copyrighted books available for license to determine whether to pursue a licensing agreement with a publisher. The complaint also alleges that Meta used LibGen to train its latest Llama 3 models and is using the dataset to train its next-gen Llama 4 models.
Furthermore, the amended filing claims that Meta researchers tried to hide the fact that Llama models were trained on copyrighted materials by inserting "supervised samples" into Llama's fine-tuning. Additionally, Meta allegedly downloaded pirated e-books from another source, Z-Library, for Llama training as recently as April 2024. Z-Library has been the subject of multiple legal actions, including domain seizures and takedowns, and its alleged maintainers were charged with copyright infringement, wire fraud, and money laundering in 2022.
The implications of this case are far-reaching, as it highlights the ongoing battle between AI companies and IP holders over the use of copyrighted content in AI training. The outcome of this lawsuit could have significant effects on the development of AI models and the way companies approach data collection and usage.
As the AI industry continues to evolve, it's essential to address the complex issues surrounding copyright and fair use. The Kadrey v. Meta case serves as a reminder that the development of AI models must be balanced with the need to protect intellectual property rights. The tech community will be closely watching this case as it unfolds, and its outcome could have lasting impacts on the future of AI development.