Chatterbox: An Open Source Breakthrough in Speech Synthesis

The Canadian startup Resemble AI recently introduced Chatterbox, its first open-source TTS (Text-to-Speech) model. Distributed under the MIT license, this voice cloning model positions itself as a credible alternative to proprietary market solutions, while introducing unprecedented features for an open-source model.

Chatterbox is based on a 0.5 billion parameter architecture, trained on 500,000 hours of cleaned data.

Key model features:

Zero-Shot Voice Cloning: With just a few seconds of reference audio, Chatterbox can mimic any voice without requiring additional training;
Emotion Control: Unlike other speech synthesis models, Chatterbox allows the adjustment of the emotional intensity of speech, ranging from a monotone to dramatic expressiveness, according to user needs;
Real-Time Speech Synthesis: Thanks to alignment-based generation, the model operates faster than real-time inference, making it ideal for voice assistants, video games, and interactive applications.
Security Watermark: Every generated audio file includes a perceptual watermark (PerTh Watermarker), ensuring transparency and traceability of the generated content.

The use of Chatterbox is simplified thanks to a dedicated Python library (chatterbox-tts), compatible with CUDA. The model can be initialized locally or from pre-trained models. Developers can also provide custom voice samples (audio prompts) to adjust style or target voice.

Resemble AI compared Chatterbox to proprietary market models.

Chatterbox vs Competition

Feature	Chatterbox	ElevenLabs	Google TTS	Azure TTS
License	MIT (Free)	Proprietary	Proprietary	Proprietary
Emotion Control	Advanced	Basic
Latency	<200 ms	~300 ms	~400 ms	~500 ms
User Preference	63.75%	36.25%	N/A	N/A
Watermarking	Integrated
Voice Cloning	Yes	Yes		Limited

In a comparative test conducted by Podonos, listeners preferred Chatterbox in 63.75% of cases over the proprietary model from ElevenLabs, which is considered one of the market leaders.

Resemble AI provides a demonstration interface via Hugging Face (Gradio), allowing users to test the model without local installation. For more intensive or critical uses, the company offers a commercial version of the TTS engine with latency below 200 ms.

Stephane Nachez

ActuIA editorial team — news, data and analysis on artificial intelligence for decision-makers.

Chatterbox: An Open Source Breakthrough in Speech Synthesis

Chatterbox vs Competition

Anthropic forced to suspend Fable 5 and Mythos 5 after a U.S. government directive

Siri AI: Gemini as Teacher, Not Engine — What WWDC Didn’t Say

Same model, different safeguards: what the launch of Claude Fable 5 and Mythos 5 reveals