OpenAI has announced the release of new transcription and voice-generating AI models for its API, which the company claims offer significant improvements over its previous releases. These models are part of OpenAI's broader vision of building automated "agentic" systems that can independently accomplish tasks on behalf of users.
According to OpenAI Head of Product Olivier Godemont, these models will enable the creation of more sophisticated chatbots that can interact with customers on behalf of businesses. Godemont predicts that more such "agents" will emerge in the coming months, and OpenAI's goal is to provide developers with the tools to build accurate, available, and useful agents.
The new text-to-speech model, dubbed "gpt-4o-mini-tts," boasts more nuanced and realistic-sounding speech, as well as increased "steerability." This means developers can instruct the model to adopt specific voices, tones, and emotions, such as a "mad scientist" or a "serene mindfulness teacher." OpenAI has demonstrated the model's capabilities with samples of a "true crime-style" weathered voice and a female "professional" voice.
Jeff Harris, a member of the product staff at OpenAI, explained that the goal is to give developers control over both the voice "experience" and "context." This could include tailoring the voice to convey emotions like apology or empathy, depending on the situation. Harris emphasized that developers and users want to control not just what is spoken but also how it is spoken.
In addition to the text-to-speech model, OpenAI has also released new speech-to-text models, "gpt-4o-transcribe" and "gpt-4o-mini-transcribe." These models replace the company's Whisper transcription model and are trained on diverse, high-quality audio datasets. As a result, they can better capture accented and varied speech, even in chaotic environments. Moreover, they are less prone to "hallucinations," where the model fabricates words or passages that were not present in the original conversation.
However, OpenAI's internal benchmarks reveal that the models still have limitations, particularly with languages like Tamil, Telugu, Malayalam, and Kannada, where the word error rate approaches 30%. This means that the model misses around three out of every 10 words in these languages.
In a departure from its usual practice, OpenAI will not be making its new transcription models openly available. According to Harris, the models are too large and complex to be released under an open-source license, unlike Whisper. Instead, OpenAI is focusing on developing models that are tailored for specific use cases, such as end-user devices.
The introduction of these advanced AI models marks a significant step forward in OpenAI's "agentic" vision, enabling developers to build more sophisticated and accurate automated systems. As the company continues to push the boundaries of AI capabilities, it will be interesting to see how these models are adopted and integrated into various applications and industries.