Chatterbox: An Open Source Breakthrough in Speech Synthesis

Chatterbox: An Open Source Breakthrough in Speech Synthesis

TLDR : Startup Resemble AI unveiled Chatterbox, an open-source speech synthesis tool that can mimic a voice in seconds, control speech emotion, and generate real-time audio. Compared to other proprietary models, Chatterbox was preferred by 63.75% of listeners in a test, positioning it as an attractive alternative in the market.

The Canadian startup Resemble AI recently introduced Chatterbox, its first open-source TTS (Text-to-Speech) model. Distributed under the MIT license, this voice cloning model positions itself as a credible alternative to proprietary market solutions, while introducing unprecedented features for an open-source model.
Chatterbox is based on a 0.5 billion parameter architecture, trained on 500,000 hours of cleaned data. 
Key model features:
  • Zero-Shot Voice Cloning: With just a few seconds of reference audio, Chatterbox can mimic any voice without requiring additional training;
  • Emotion Control: Unlike other speech synthesis models, Chatterbox allows the adjustment of the emotional intensity of speech, ranging from a monotone to dramatic expressiveness, according to user needs;
  • Real-Time Speech Synthesis: Thanks to alignment-based generation, the model operates faster than real-time inference, making it ideal for voice assistants, video games, and interactive applications.
  • Security Watermark: Every generated audio file includes a perceptual watermark (PerTh Watermarker), ensuring transparency and traceability of the generated content.
The use of Chatterbox is simplified thanks to a dedicated Python library (chatterbox-tts), compatible with CUDA. The model can be initialized locally or from pre-trained models. Developers can also provide custom voice samples (audio prompts) to adjust style or target voice.
Resemble AI compared Chatterbox to proprietary market models.


Chatterbox vs Competition

Feature
Chatterbox
ElevenLabs
Google TTS
Azure TTS
License
MIT (Free)
Proprietary
Proprietary
Proprietary
Emotion Control
✅ Advanced
✅ Basic
❌
❌
Latency
<200 ms
~300 ms
~400 ms
~500 ms
User Preference
63.75%
36.25%
N/A
N/A
Watermarking
✅ Integrated
❌
❌
❌
Voice Cloning
✅ Yes
✅ Yes
❌
✅ Limited
In a comparative test conducted by Podonos, listeners preferred Chatterbox in 63.75% of cases over the proprietary model from ElevenLabs, which is considered one of the market leaders.
Resemble AI provides a demonstration interface via Hugging Face (Gradio), allowing users to test the model without local installation. For more intensive or critical uses, the company offers a commercial version of the TTS engine with latency below 200 ms.