Sesame Open-Sources AI Model Behind Realistic Voice Assistant Maya

Taylor Brooks

Taylor Brooks

March 13, 2025 · 3 min read
Sesame Open-Sources AI Model Behind Realistic Voice Assistant Maya

Sesame, the AI company behind the impressively realistic voice assistant Maya, has made a significant move by open-sourcing the base AI model that powers Maya. The model, called CSM-1B, is a massive 1 billion parameters in size and is now available under the Apache 2.0 license, allowing for commercial use with few restrictions.

CSM-1B is capable of generating "RVQ audio codes" from text and audio inputs, using a technique called residual vector quantization (RVQ). This technique is also used in other recent AI audio technologies, such as Google's SoundStream and Meta's Encodec. The model uses a backbone from Meta's Llama family paired with an audio "decoder" component, and a fine-tuned variant of CSM powers Maya.

According to Sesame, the open-sourced model is a base generation model that can produce a variety of voices, but it has not been fine-tuned on any specific voice. The model also has some capacity for non-English languages due to data contamination in the training data, but it likely won't perform well in these languages. However, it's unclear what data Sesame used to train CSM-1B, as the company hasn't disclosed this information.

One notable aspect of the open-sourced model is the lack of real safeguards. Sesame is relying on an "honor system," urging developers and users not to use the model to mimic a person's voice without their consent, create misleading content like fake news, or engage in "harmful" or "malicious" activities. This raises concerns about the potential misuse of the technology, especially given its capabilities.

I tried the demo on Hugging Face, and cloning my voice took less than a minute. From there, it was easy to generate speech on various topics, including controversial ones like the election and Russian propaganda. This demonstrates the model's impressive capabilities, but also highlights the need for responsible use and development of this technology.

Sesame, co-founded by Oculus co-creator Brendan Iribe, gained widespread attention in late February for its assistant tech, which comes close to clearing the uncanny valley territory. Maya and Sesame's other assistant, Miles, take breaths and speak with disfluencies, and can be interrupted while speaking, much like OpenAI's Voice Mode. The company has raised an undisclosed amount of capital from investors like Andreessen Horowitz, Spark Capital, and Matrix Partners.

Beyond its voice assistant tech, Sesame is also prototyping AI glasses "designed to be worn all day" that'll be equipped with its custom models. This move marks a significant expansion of the company's ambitions in the AI space, and the open-sourcing of CSM-1B could have far-reaching implications for the development of AI audio technologies.

The release of CSM-1B under the Apache 2.0 license is a significant step towards democratizing access to AI audio technology. While it raises concerns about the potential misuse of the technology, it also opens up opportunities for developers and researchers to build upon and improve the model. As the AI landscape continues to evolve, it will be important to monitor the development and use of CSM-1B and other AI audio technologies.

Similiar Posts

Copyright © 2024 Starfolk. All rights reserved.