The notion of training artificial intelligence (AI) models on data generated by another AI may seem like a far-fetched idea, but it's a concept that's been gaining traction in the tech industry. With the increasing scarcity of real-world data, companies like Anthropic, Meta, and OpenAI are turning to synthetic data to train their models. But can AI-generated data really replace human-annotated data, and what are the implications of this shift?
At the heart of the issue is the importance of annotations in AI training. Annotations, which are labels that provide context to the data, serve as guideposts for AI models to learn patterns and make predictions. The market for annotation services has ballooned to an estimated $838.2 million, with millions of people employed to create labels for AI training sets. However, the process is not only time-consuming and expensive but also prone to human biases and errors.
Enter synthetic data, which promises to solve these problems. By generating data artificially, companies can create an unlimited supply of training data without the need for human annotation. Writer, an enterprise-focused generative AI company, has already debuted a model trained almost entirely on synthetic data, claiming it cost just $700,000 to develop. Microsoft, Google, and Nvidia are also exploring the use of synthetic data in their models.
However, experts warn that synthetic data is not a panacea. The "garbage in, garbage out" problem still applies, where biases and limitations in the training data are reflected in the output. Moreover, over-reliance on synthetic data can create models whose quality or diversity progressively decrease. Sampling bias, where the synthetic data poorly represents the real world, can also worsen the model's diversity over time.
Os Keyes, a PhD candidate at the University of Washington, notes that complex models like OpenAI's o1 can produce harder-to-spot hallucinations in their synthetic data, which can reduce the accuracy of models trained on the data. "Complex models hallucinate; data produced by complex models contain hallucinations," Keyes said. "And with a model like o1, the developers themselves can't necessarily explain why artefacts appear."
Luca Soldaini, a senior research scientist at the Allen Institute for AI, emphasizes the need for thorough review, curation, and filtering of synthetic data to avoid training forgetful chatbots and homogenous image generators. "Researchers need to examine the generated data, iterate on the generation process, and identify safeguards to remove low-quality data points," Soldaini said.
While the use of synthetic data may seem like a convenient solution to the data scarcity problem, it's clear that it's not a silver bullet. For the foreseeable future, humans will still be needed in the loop to ensure that AI models are trained accurately and without bias. As OpenAI CEO Sam Altman once argued, AI may someday produce synthetic data good enough to effectively train itself, but that technology is still in its infancy.
In conclusion, the shift towards synthetic data in AI training is a complex issue that requires careful consideration of its potential benefits and risks. As the tech industry continues to explore this new frontier, it's essential to prioritize transparency, accountability, and human oversight to ensure that AI models are developed with integrity and accuracy.