Voicebox AI operates on a model similar to OpenAI’s ChatGPT and DALL-E, but instead of generating text or images, it focuses on generating spoken language. The system has undergone extensive training on a diverse dataset comprising 50,000 hours of unfiltered audio. This dataset includes transcriptions of publicly available audiobooks recorded in various languages, including English, French, Spanish, German, Polish, and Portuguese.
The richness of this dataset allows Voicebox AI to produce “more conversational speech,” enhancing its ability to generate natural language regardless of the languages involved in the conversation.
Meta’s researchers have reported that speech recognition models trained on synthetic speech generated by Voicebox perform nearly as well as models trained on real speech. This capability represents a significant leap in the development of voice assistants and conversational AI.
In fact, Voicebox outperforms Microsoft’s VALL-E in text-to-speech conversion, excelling in both intelligibility (with a 5.9% word error rate compared to VALL-E’s 1.9%) and audio similarity (achieving 0.580% compared to VALL-E’s 0.681%). Notably, Voicebox is also 20 times faster than VALL-E.
Voicebox offers an array of useful features, including audio editing, noise removal, and the ability to rectify mispronunciations. Users can identify segments of speech distorted by noise, trim them, and instruct the model to update those segments, making it a valuable tool for enhancing audio quality.
One potential application for this technology is in prosthetics for individuals with damaged vocal cords, where it could enable more natural and expressive speech. Additionally, it could revolutionize gaming non-player characters (NPCs) and digital assistants, making interactions with AI more engaging and lifelike.
While Meta has shared a research paper and audio examples demonstrating Voicebox’s capabilities, the company has not released the Voicebox program or its source code to the public. Meta cites concerns about the potential misuse of the technology as the reason for this decision.
This move aligns with Meta’s broader approach to AI, where it has previously released certain AI models, such as the LLaMA language model, as open-source packages for the AI community. However, this openness has sometimes led to concerns about misuse, as exemplified by the unauthorized distribution of Meta’s models on various platforms.
In addition to Voicebox AI, Meta has also developed SAM, an AI image segmentation model that responds to user cues to identify specific objects in images or videos. Developers can access open-source code and a dataset of 180,000 images for the Animated Drawings AI project, further highlighting Meta’s commitment to advancing AI technologies.
Meta’s unveiling of Voicebox AI marks another significant step toward enhancing voice assistants and conversational AI, promising more natural, efficient, and versatile interactions with AI-powered systems. However, the decision to withhold the program and source code reflects the company’s cautious approach to mitigate potential risks associated with AI misuse.