TLDR : Startup Resemble AI unveiled Chatterbox, an open-source speech synthesis tool that can mimic a voice in seconds, control speech emotion, and generate real-time audio. Compared to other proprietary models, Chatterbox was preferred by 63.75% of listeners in a test, positioning it as an attractive alternative in the market.
The Canadian startup Resemble AI recently introduced Chatterbox, its first open-source TTS (Text-to-Speech) model. Distributed under the MIT license, this voice cloning model positions itself as a credible alternative to proprietary market solutions, while introducing unprecedented features for an open-source model.
Chatterbox is based on a 0.5 billion parameter architecture, trained on 500,000 hours of cleaned data.
Key model features:
- Zero-Shot Voice Cloning: With just a few seconds of reference audio, Chatterbox can mimic any voice without requiring additional training;
- Emotion Control: Unlike other speech synthesis models, Chatterbox allows the adjustment of the emotional intensity of speech, ranging from a monotone to dramatic expressiveness, according to user needs;
- Real-Time Speech Synthesis: Thanks to alignment-based generation, the model operates faster than real-time inference, making it ideal for voice assistants, video games, and interactive applications.
- Security Watermark: Every generated audio file includes a perceptual watermark (PerTh Watermarker), ensuring transparency and traceability of the generated content.
The use of Chatterbox is simplified thanks to a dedicated Python library (
chatterbox-tts
), compatible with CUDA. The model can be initialized locally or from pre-trained models. Developers can also provide custom voice samples (audio prompts) to adjust style or target voice.Resemble AI compared Chatterbox to proprietary market models.
Chatterbox vs Competition
Feature | Chatterbox | ElevenLabs | Google TTS | Azure TTS |
---|---|---|---|---|
License | MIT (Free) | Proprietary | Proprietary | Proprietary |
Emotion Control | ![]() | ![]() | ![]() | ![]() |
Latency | <200 ms | ~300 ms | ~400 ms | ~500 ms |
User Preference | 63.75% | 36.25% | N/A | N/A |
Watermarking | ![]() | ![]() | ![]() | ![]() |
Voice Cloning | ![]() | ![]() | ![]() | ![]() |
In a comparative test conducted by Podonos, listeners preferred Chatterbox in 63.75% of cases over the proprietary model from ElevenLabs, which is considered one of the market leaders.
Resemble AI provides a demonstration interface via Hugging Face (Gradio), allowing users to test the model without local installation. For more intensive or critical uses, the company offers a commercial version of the TTS engine with latency below 200 ms.
Translated from Chatterbox : une percée open source dans la synthèse vocale