Meta's AI Ambitions: Documents Reveal Plans to Use Pirated Data, Conceal Copyright Infringement

Sophia Steele

Sophia Steele

January 14, 2025 · 4 min read
Meta's AI Ambitions: Documents Reveal Plans to Use Pirated Data, Conceal Copyright Infringement

Internal documents from Meta have revealed the company's plans to use pirated data from book piracy site Library Genesis (LibGen) to train its AI models, sparking concerns over copyright infringement and fair use. The documents, unsealed by a California court, show Meta's efforts to conceal its use of copyrighted data, including discussions about avoiding "media coverage suggesting we have used a dataset we know to be pirated."

The documents are part of a class action lawsuit filed against Meta by authors and creators, including Richard Kadrey and Sarah Silverman, who accuse the company of using illegally obtained copyrighted content to train its AI models. Meta, like other AI companies, has argued that using copyrighted material in training data constitutes legal fair use.

The emails and messages reveal Meta's goal to develop its open-source AI models, Llama, and its desire to "learn how to build frontier and win this race" against rivals like OpenAI and Mistral. Meta's vice president of generative AI, Ahmad Al-Dahle, wrote in an October 2023 email that the company's goal "needs to be GPT4," referring to the large language model announced by OpenAI in March 2023.

The documents also show Meta's internal discussions about using LibGen to train its AI systems. In an undated email, Meta director of product Sony Theakanath wrote that LibGen is "essential" to reaching "state-of-the-art numbers across all categories." Theakanath suggested using LibGen internally only, for benchmarks included in a blog post, or to create a model trained on the site.

The email also discussed the "mitigations" for using LibGen, including removing data clearly marked as pirated or stolen, and avoiding external citation of the training data. Theakanath also mentioned the need to "red team" the company's models for bioweapons and CBRNE risks.

The documents raise concerns about Meta's approach to copyright infringement and fair use. The company's efforts to conceal its use of copyrighted data and its discussions about avoiding media coverage suggest a lack of transparency and accountability. The lawsuit against Meta is ongoing, and the evidence outlined in the documents could strengthen parts of the case as it moves forward in court.

The story highlights the intense competition in the AI industry, where companies are racing to develop the most advanced models. The use of pirated data and the concealment of copyright infringement raise ethical concerns and questions about the long-term implications of these practices.

The AI industry's reliance on large datasets and the scarcity of unique data have led to innovative, but sometimes questionable, methods to obtain new data. The report also mentions that frontier labs like OpenAI and Google have been paying digital content creators for their unused video footage to train large language models.

The incident underscores the need for clearer guidelines and regulations on AI development, data usage, and copyright infringement. As AI technology continues to advance, it is essential to ensure that companies prioritize ethical practices and respect for intellectual property rights.

The story will continue to unfold as the lawsuit against Meta progresses, and it remains to be seen how the company will respond to these allegations. One thing is certain, however: the AI industry must confront its data scarcity challenges and develop more transparent and ethical practices to build trust with users and creators.

Similiar Posts

Copyright © 2024 Starfolk. All rights reserved.