Chatterbox Streaming: Open Source TTS and Voice Conversion
Project description
Chatterbox TTS Streaming
Chatterbox is an open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations. Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out. This fork adds a streaming implementation that achieves a realtime factor of 0.499 (target < 1) on a 4090 gpu and a latency to first chunk of around 0.472s
Key Details
- SoTA zeroshot TTS
- 0.5B Llama backbone
- Unique exaggeration/intensity control
- Ultra-stable with alignment-informed inference
- Trained on 0.5M hours of cleaned data
- Watermarked outputs
- Easy voice conversion script
- Real-time streaming generation
- [Outperforms ElevenLabs]
Tips
- General Use (TTS and Voice Agents):
- The default settings (
exaggeration=0.5,cfg_weight=0.5) work well for most prompts. - If the reference speaker has a fast speaking style, lowering
cfg_weightto around0.3can improve pacing. - Expressive or Dramatic Speech:
- Try lower
cfg_weightvalues (e.g.~0.3) and increaseexaggerationto around0.7or higher. - Higher
exaggerationtends to speed up speech; reducingcfg_weighthelps compensate with slower, more deliberate pacing.
Installation
python3.10 -m venv .venv
source .venv/bin/activate
pip install chatterbox-streaming
Build for development
git clone https://github.com/davidbrowne17/chatterbox-streaming.git
pip install -e .
Usage
Basic TTS Generation
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda")
text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)
ta.save("test-1.wav", wav, model.sr)
# If you want to synthesize with a different voice, specify the audio prompt
AUDIO_PROMPT_PATH = "YOUR_FILE.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH)
ta.save("test-2.wav", wav, model.sr)
Streaming TTS Generation
For real-time applications where you want to start playing audio as soon as it's available:
import torchaudio as ta
import torch
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda")
text = "Welcome to the world of streaming text-to-speech! This audio will be generated and played in real-time chunks."
# Basic streaming
audio_chunks = []
for audio_chunk, metrics in model.generate_stream(text):
audio_chunks.append(audio_chunk)
# You can play audio_chunk immediately here for real-time playback
print(f"Generated chunk {metrics.chunk_count}, RTF: {metrics.rtf:.3f}" if metrics.rtf else f"Chunk {metrics.chunk_count}")
# Combine all chunks into final audio
final_audio = torch.cat(audio_chunks, dim=-1)
ta.save("streaming_output.wav", final_audio, model.sr)
Streaming with Voice Cloning
import torchaudio as ta
import torch
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda")
text = "This streaming synthesis will use a custom voice from the reference audio file."
AUDIO_PROMPT_PATH = "reference_voice.wav"
audio_chunks = []
for audio_chunk, metrics in model.generate_stream(
text,
audio_prompt_path=AUDIO_PROMPT_PATH,
exaggeration=0.7,
cfg_weight=0.3,
chunk_size=25 # Smaller chunks for lower latency
):
audio_chunks.append(audio_chunk)
# Real-time metrics available
if metrics.latency_to_first_chunk:
print(f"First chunk latency: {metrics.latency_to_first_chunk:.3f}s")
# Save the complete streaming output
final_audio = torch.cat(audio_chunks, dim=-1)
ta.save("streaming_voice_clone.wav", final_audio, model.sr)
Streaming Parameters
audio_prompt_path: Reference audio path for voice cloningchunk_size: Number of speech tokens per chunk (default: 50). Smaller values = lower latency but more overheadprint_metrics: Enable automatic printing of latency and RTF metrics (default: True)exaggeration: Emotion intensity control (0.0-1.0+)cfg_weight: Classifier-free guidance weight (0.0-1.0)temperature: Sampling randomness (0.1-1.0)
See example_tts_stream.py for more examples.
Example metrics
Here are the example metrics for streaming latency on a 4090 using Linux
- Latency to first chunk: 0.472s
- Received chunk 1, shape: torch.Size([1, 24000]), duration: 1.000s
- Audio playback started!
- Received chunk 2, shape: torch.Size([1, 24000]), duration: 1.000s
- Received chunk 3, shape: torch.Size([1, 24000]), duration: 1.000s
- Received chunk 4, shape: torch.Size([1, 24000]), duration: 1.000s
- Received chunk 5, shape: torch.Size([1, 24000]), duration: 1.000s
- Received chunk 6, shape: torch.Size([1, 20160]), duration: 0.840s
- Total generation time: 2.915s
- Total audio duration: 5.840s
- RTF (Real-Time Factor): 0.499 (target < 1)
- Total chunks yielded: 6
Acknowledgements
Built-in PerTh Watermarking for Responsible AI
Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
Disclaimer
Don't use this model to do bad things. Prompts are sourced from freely available data on the internet.
Streaming Implementation Author
David Browne
Support me
Support this project on Ko-fi: https://ko-fi.com/davidbrowne17
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chatterbox_streaming-0.1.2.tar.gz.
File metadata
- Download URL: chatterbox_streaming-0.1.2.tar.gz
- Upload date:
- Size: 77.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
040402b7736b308a3351cfcf065a6023aae2ee68f2d1d4a1a4c6f882445a759d
|
|
| MD5 |
4384f3a1f80e24c3f8c431c882301f58
|
|
| BLAKE2b-256 |
0821f9c4f33e48845cd583ff1a32ccb7e06c555ff367de6c69df255c9cbf3a4d
|
File details
Details for the file chatterbox_streaming-0.1.2-py3-none-any.whl.
File metadata
- Download URL: chatterbox_streaming-0.1.2-py3-none-any.whl
- Upload date:
- Size: 101.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
104d10123e5f5e4f2cee92e0239c71966406e171758858c7e88131a0b94ca930
|
|
| MD5 |
a30b7d78d8ac652cdbd9dcea529611a0
|
|
| BLAKE2b-256 |
adaa2e9328d2e338d1c029dded101ea2dd30cfb02f46ec3d563333b3e92d3e4a
|