Meta CEO Mark Zuckerberg Accused of Approving Use of Pirated eBooks and Articles for AI Training

In a significant development in the ongoing copyright lawsuit against Meta, newly unredacted documents filed with the U.S. District Court for the Northern District of California have revealed that Meta CEO Mark Zuckerberg approved the use of a dataset of pirated eBooks and articles for training the company's Llama AI models. The allegations, made by plaintiffs in the Kadrey v. Meta case, suggest that Zuckerberg gave the go-ahead despite concerns within Meta's AI executive team and others at the company.

The dataset in question, known as LibGen, is a links aggregator that provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued multiple times, ordered to shut down, and fined tens of millions of dollars for copyright infringement. According to the filing, Meta employees referred to LibGen as a "data set we know to be pirated" and flagged that its use "may undermine [Meta's] negotiating position with regulators."

The documents also reveal that Zuckerberg cleared the use of LibGen to train at least one of Meta's Llama models, despite concerns raised by Meta's AI team. A memo to Meta AI decision-makers notes that after "escalation to MZ," the AI team was approved to use LibGen. The filing quotes Meta employees as saying that the company's use of LibGen "may undermine [Meta's] negotiating position with regulators."

The allegations are reminiscent of a report by The New York Times last April, which suggested that Meta cut corners to gather data for its AI. The company was reportedly hiring contractors in Africa to aggregate summaries of books and considering buying the publisher Simon & Schuster. However, Meta's executives determined that it would take too long to negotiate licenses and reasoned that fair use was a solid defense.

The latest filing contains new accusations, including that Meta might have tried to conceal its alleged infringement by stripping the LibGen data of attribution. According to plaintiffs' counsel, Meta engineer Nikolay Bashlykov wrote a script to remove copyright information, including the word "copyright" and "acknowledgments," from eBooks in LibGen. Separately, Meta allegedly stripped copyright markers from science journal articles and "source metadata" in the training data it used for Llama.

The filing suggests that Meta's actions were not just limited to training purposes, but also aimed at concealing its copyright infringement. By stripping copyrighted works of attribution, Meta prevented Llama from outputting copyright information that might alert users and the public to its infringement.

Furthermore, the filing reveals that Meta torrented LibGen, a move that gave some Meta research engineers pause. Torrenting requires that torrenters simultaneously "seed," or upload, the files they're trying to obtain. Plaintiffs' counsel alleges that Meta effectively engaged in another form of copyright infringement by torrenting LibGen and thus helping to spread its contents. Meta also tried to conceal its activities, counsel alleges, by minimizing the number of files it uploaded.

According to the filing, Meta's head of generative AI, Ahmad Ah-Dahle, "cleared the path" for torrenting LibGen, brushing aside Bashlykov's reservations that doing so "could be legally not OK." The plaintiffs' counsel argues that Meta's decision to bypass lawful methods of acquiring books and become a knowing participant in an illegal torrenting network serves as proof of copyright infringement.

The case against Meta is far from decided, and the court may well decide in Meta's favor if it's persuaded by the company's fair use argument. However, the allegations don't reflect well on Meta, as Judge Thomas Hixson noted in an order on Wednesday rejecting Meta's request to redact large portions of the filing. "It is clear that Meta's sealing request is not designed to protect against the disclosure of sensitive business information that competitors could use to their advantage," Hixson wrote. "Rather, it is designed to avoid negative publicity."

We've reached out to Meta for comment and will update this piece if we hear back. The developments in the Kadrey v. Meta case highlight the ongoing concerns surrounding the use of copyrighted materials in AI training and the need for tech giants to ensure that they're respecting the intellectual property rights of creators.

Meta CEO Mark Zuckerberg Accused of Approving Use of Pirated eBooks and Articles for AI Training

Similiar Posts

Herman Miller Unveils Spout Sit-to-Stand Table with Impressive 400-Pound Lifting Capacity

Microsoft Accuses FTC of Leaking Antitrust Investigation, Demands Probe

Solar and Wind Overtake Coal in US Electricity Generation for First Time