DeepSeek Database Exposes Sensitive User Data, Chat Histories to Public
Chinese AI startup DeepSeek left a database containing user chat histories, API authentication keys, and system logs exposed to anyone, sparking security concerns.
Sophia Steele
OpenAI, the company behind the popular ChatGPT language model, has been accused of training its AI on copyrighted content without permission. A new paper by the AI Disclosures Project, a nonprofit organization, claims that OpenAI's GPT-4o model was trained on paywalled books from O'Reilly Media, a publishing company, without obtaining the necessary licensing agreements.
The paper, co-authored by media mogul Tim O'Reilly, economist Ilan Strauss, and AI researcher Sruly Rosenblat, used a method called DE-COP (detection of copyrighted content in language models' training data) to detect copyrighted content in OpenAI's models. The method, also known as a "membership inference attack," tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.
The study found that GPT-4o, the default model in ChatGPT, demonstrated strong recognition of paywalled O'Reilly book content, compared to OpenAI's earlier model GPT-3.5 Turbo. The authors probed GPT-4o, GPT-3.5 Turbo, and other OpenAI models' knowledge of O'Reilly Media books published before and after their training cutoff dates, using 13,962 paragraph excerpts from 34 O'Reilly books to estimate the probability that a particular excerpt had been included in a model's training dataset.
The results of the paper suggest that GPT-4o "recognized" far more paywalled O'Reilly book content than OpenAI's older models, including GPT-3.5 Turbo, even after accounting for potential confounding factors. The authors conclude that GPT-4o likely recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date.
While the authors acknowledge that their experimental method isn't foolproof, and that OpenAI might have collected the paywalled book excerpts from users copying and pasting it into ChatGPT, the findings raise concerns over OpenAI's training data practices. The company has been accused of training its AI on copyrighted content without permission, and is currently battling several lawsuits over its treatment of copyright law in U.S. courts.
OpenAI has advocated for looser restrictions around developing models using copyrighted data, and has gone so far as to hire journalists to help fine-tune its models' outputs. However, the company also pays for at least some of its training data, and offers opt-out mechanisms that allow copyright owners to flag content they'd prefer the company not use for training purposes.
The implications of the paper's findings are significant, as they raise questions about the ownership and use of copyrighted content in AI model training. As AI companies continue to push the boundaries of what is possible with language models, it is essential that they do so in a way that respects the intellectual property rights of creators and publishers.
OpenAI did not respond to a request for comment on the paper's findings. The company's silence on the matter only adds to the growing concerns over its training data practices and its commitment to respecting copyright law.
In conclusion, the paper's allegations against OpenAI are serious and warrant further investigation. As the AI industry continues to evolve, it is essential that companies prioritize transparency, accountability, and respect for intellectual property rights. The future of AI development depends on it.
Chinese AI startup DeepSeek left a database containing user chat histories, API authentication keys, and system logs exposed to anyone, sparking security concerns.
The US government has charged a dual Russian and Israeli national with building and maintaining LockBit's malware code, receiving over $230,000 in cryptocurrency, as authorities continue to hunt for the group's alleged ringleader.
Nintendo shares key specs of the Switch 2, including a 1080p 120Hz display, 4K dock, and innovative Joy-Con controllers with mouse mode.
Copyright © 2024 Starfolk. All rights reserved.