Streaming TTS wrapper for Chatterbox — 3-line API for chunk-by-chunk speech synthesis
Project description
chatterstream-tts
Streaming TTS wrapper for Chatterbox. Turns the monolithic generate-then-play API into an async chunk-by-chunk streaming pipeline with a 3-line interface.
What is this?
Chatterbox generates high-quality speech but blocks until the entire waveform is ready (1-3s for typical sentences). Chatterstream breaks this into a streaming pipeline that yields audio chunks as they're generated, targeting 300-500ms time-to-first-audio on Apple Silicon.
This is a personal project extracted from a Chatterbox fork. It wraps upstream internals (T3 token generation, S3Gen vocoding, HiFiGAN) into a SOLID component pipeline without modifying upstream code.
Architecture
Text Input
│
▼
╔═════════════════════════════════════════════════════════════╗
║ chatterstream-tts (streaming wrapper) ║
║ ║
║ ┌──────────────┐ ║
║ │TextProcessor │ Regex normalization + GPT-2 tokenization ║
║ └──────┬───────┘ ║
║ │ ║
║ ▼ ║
║ ┌──────────────────┐ ║
║ │ConditioningCache │ Cached voice embeddings (LRU) ║
║ └──────┬───────────┘ ║
║ │ ║
║ ▼ ║
║ ┌──────────────┐ ║
║ │TokenGenerator│ Yields token chunks as they generate ║
║ └──────┬───────┘ ║
║ │ adaptive chunking (25 tok first, 75 subsequent) ║
║ ▼ ║
║ ┌─────────────┐ ║
║ │ChunkVocoder │ Per-chunk vocoding with source caching ║
║ └──────┬──────┘ ║
║ │ ║
║ ▼ ║
║ ┌──────────────────┐ ║
║ │AudioPostProcessor│ Fade-in, normalize, PCM int16 ║
║ └──────┬───────────┘ ║
║ │ ║
╠═══ calls into ══════════════════════════════════════════════╣
║ ║
║ chatterbox-tts (upstream model by Resemble AI) ║
║ ║
║ T3 (350M) ─ Autoregressive text-to-token transformer ║
║ S3Gen ─ Flow-matching vocoder + HiFiGAN (257M) ║
║ VoiceEnc ─ Speaker embedding extraction (LSTM) ║
║ ║
╚══════════════════════════════════╤══════════════════════════╝
│
▼
AsyncIterator[AudioChunk] → 24 kHz mono int16
Dependencies
| Package | Version | Purpose |
|---|---|---|
chatterbox-tts |
0.1.6 | Upstream TTS model (T3 + S3Gen + HiFiGAN) |
torch |
>=2.6.0 | Inference runtime |
av |
>=12.0.0 | Optional — Opus/HLS media encoding |
Python: 3.10 or 3.11 (3.12+ fails due to upstream numpy/distutils incompatibility)
Hardware: Runs on CPU, CUDA, or Apple Silicon (MPS).
FP16 (half-precision)
Neural networks store their weights and perform math using floating-point numbers. The default is FP32 (32-bit / "full precision") — each number uses 32 bits of memory. FP16 (16-bit / "half precision") uses half the memory and runs roughly 2x faster on GPUs that have dedicated half-precision hardware, which includes both NVIDIA CUDA GPUs and Apple Silicon (MPS).
The tradeoff is reduced numerical range, but for TTS inference (as opposed to training) the quality difference is inaudible. Chatterstream auto-enables FP16 on MPS and CUDA because the speed and memory savings are significant with no perceptible quality loss.
On CPU, FP16 is not enabled by default — most CPUs lack native half-precision support, so FP16 would actually be slower as the CPU emulates it in software. CPU inference uses FP32.
Note: Only the T3 text-to-token model runs in FP16. The S3Gen vocoder (HiFiGAN) stays in FP32 because its audio reconstruction is more sensitive to precision, particularly on MPS where dtype mismatches cause errors.
You can override the auto-detection:
# Force FP16 on (even on CPU — not recommended)
tts = StreamingTTS(fp16=True)
# Force FP16 off (even on GPU — full precision, slower)
tts = StreamingTTS(fp16=False)
# Auto-detect (default) — FP16 on MPS/CUDA, FP32 on CPU
tts = StreamingTTS()
Installation
From PyPI
pip install chatterstream-tts
# Optional: media extras (Opus encoding, HLS segmenting)
pip install chatterstream-tts[media]
From source (development)
git clone https://github.com/devinmlowe/chatterstream-tts.git
cd chatterstream-tts
# Create venv (Python 3.11 recommended)
uv venv --python python3.11
source .venv/bin/activate # bash/zsh
# or: source .venv/bin/activate.fish # fish
# Install with dev dependencies
uv pip install -e ".[dev]"
# Optional: media extras
uv pip install -e ".[media]"
Hugging Face authentication
Model weights (~600MB) are hosted on Hugging Face at ResembleAI/chatterbox-turbo and downloaded automatically the first time you call .load(). You need a (free) Hugging Face account and token:
# Install the Hugging Face CLI (included with chatterbox-tts)
# Then log in — this saves your token locally
huggingface-cli login
You can also set the HF_TOKEN environment variable instead:
export HF_TOKEN=hf_your_token_here
After the first download, weights are cached locally and no network access is needed for subsequent runs.
Quick Start
# 1. Log in to Hugging Face (first time only — needed to download model weights)
huggingface-cli login
# 2. Stream speech
from chatterstream import StreamingTTS
tts = StreamingTTS()
tts.load() # downloads weights on first run (~600MB), then ~2-5s from cache
async for chunk in tts.synthesize("Hello world"):
play(chunk.pcm_bytes) # 24 kHz mono int16
That's it. The pipeline handles tokenization, chunked generation, vocoding, and audio post-processing internally. Each AudioChunk contains raw PCM bytes (24 kHz, mono, int16) plus metadata (is_final, chunk_index, duration_ms).
Advanced Usage
Loading the model
Model weights (~600MB) must be loaded before synthesis. There are three ways to handle this, each suited to different scenarios:
Explicit load — Call .load() yourself before synthesizing. This is the recommended approach for servers and long-running processes. You control exactly when the 2-5 second load happens (e.g. at startup, not on the first user request), and is_loaded lets you gate readiness checks.
tts = StreamingTTS()
tts.load() # 2-5s, blocks until ready
# tts.is_loaded == True
Chained load — .load() returns self, so you can construct and load in one line. Convenient for scripts and notebooks where you don't need the intermediate unloaded state.
tts = StreamingTTS(device="mps", watermark=False).load()
Auto-load (lazy) — If you call .synthesize() without loading first, the model loads automatically. A UserWarning is emitted so you know it happened. This is fine for quick experiments but not ideal in production — the first synthesis call silently takes an extra 2-5 seconds, which can be confusing.
tts = StreamingTTS()
# First call triggers auto-load (emits: "Model not loaded. Call .load()
# explicitly for faster first synthesis. Auto-loading now...")
async for chunk in tts.synthesize("Hello world"):
play(chunk.pcm_bytes)
# Subsequent calls are fast — model stays loaded
async for chunk in tts.synthesize("More text"):
play(chunk.pcm_bytes)
When to use which:
| Approach | Best for | Tradeoff |
|---|---|---|
Explicit .load() |
Servers, APIs, anything with a startup phase | You manage the load timing |
Chained .load() |
Scripts, notebooks, one-off experiments | No access to unloaded state |
| Auto-load | Quick prototyping, REPL exploration | Surprising latency on first call |
Custom voice
async for chunk in tts.synthesize("Hello", voice="/path/to/voice.wav"):
play(chunk.pcm_bytes)
Voice files are cached by path with mtime-based invalidation — re-encoding the same file at the same path busts the cache automatically.
Configuration
from chatterstream import StreamingTTS, StreamConfig
config = StreamConfig(
first_chunk_tokens=25, # tokens before first audio yield
subsequent_chunk_tokens=75, # tokens per subsequent chunk
overlap_tokens=0, # token overlap between chunks
temperature=0.8,
top_k=1000,
)
tts = StreamingTTS(device="mps", config=config, watermark=False)
tts.load()
Media encoders
The core pipeline yields raw PCM audio (24 kHz, mono, int16). To deliver that audio to clients, you need an encoding/transport layer. Two optional encoders (requiring pip install chatterstream-tts[media]) handle this:
OpusEncoder — Encodes PCM to OGG/Opus. Best for low-latency delivery over WebSockets or direct streaming where you control both ends. Superior compression at low bitrates, near-zero codec delay. The tradeoff: browsers can't play a raw OGG/Opus stream over plain HTTP — you need JavaScript (e.g. Web Audio API) or a WebSocket to decode it client-side.
from chatterstream.opus_encoder import OpusEncoder
encoder = OpusEncoder(input_sample_rate=24000, bitrate=64000)
async for chunk in tts.synthesize("Hello world"):
ogg_bytes = encoder.encode(chunk.pcm_bytes)
websocket.send(ogg_bytes) # stream over WebSocket
# Flush the encoder when done (emits final OGG pages)
final_bytes = encoder.finalize()
websocket.send(final_bytes)
# Reset for the next utterance (reuses the encoder object)
encoder.reset()
HLSSegmenter — Encodes PCM to MPEG-TS/AAC segments with an m3u8 playlist. HLS (HTTP Live Streaming) is the standard used by every browser, phone, and smart TV — a plain <audio> tag pointed at the m3u8 URL just works, no JavaScript required. Audio is split into small segments (~1-2s each) served over regular HTTP. The tradeoff: segment-based delivery adds inherent latency (the player must buffer at least one segment before playback starts).
from chatterstream.hls_segmenter import HLSSegmenter
segmenter = HLSSegmenter(sample_rate=24000, bitrate=96000)
async for chunk in tts.synthesize("Hello world"):
seg_bytes = segmenter.add_segment(chunk.pcm_bytes)
# seg_bytes is a self-contained MPEG-TS segment
# serve it at /seg{index}.ts
# Flush encoder and mark the stream complete
segmenter.finalize()
# Generate the m3u8 playlist (references seg0.ts, seg1.ts, ...)
playlist = segmenter.playlist()
# serve playlist at /stream.m3u8
# Retrieve any segment by index
segment_0 = segmenter.get_segment(0)
These are complementary, not alternatives. Use Opus for real-time applications (voice agents, WebSocket APIs) where you control the client. Use HLS when you need universal browser playback with zero client-side code. You could even use both — Opus for a native app client and HLS for a web fallback.
See examples/streaming_server.py for a complete aiohttp server that uses HLS to serve audio playable in any browser.
Responsible AI: PerTh Watermarking
Chatterbox includes Resemble AI's PerTh (Perceptual Threshold) Watermarker — imperceptible neural watermarks embedded in generated audio that survive MP3 compression, editing, and common manipulations while maintaining nearly 100% detection accuracy. This is an important tool for responsible use of synthetic speech.
Chatterstream supports PerTh watermarking and enables it by default:
# Watermarking on (default)
tts = StreamingTTS(watermark=True)
# Watermarking off (for local development / testing only)
tts = StreamingTTS(watermark=False)
Each audio chunk is watermarked as it passes through the pipeline, so streaming delivery doesn't bypass the watermark the way a naive chunk-by-chunk approach might.
Note: The open-source resemble-perth package ships with PerthImplicitWatermarker disabled (set to None). Chatterstream detects this and falls back gracefully — watermarking is silently skipped with a log message. If you have access to a full PerTh implementation, it activates automatically.
Extracting watermarks
You can verify whether audio was generated by Chatterbox/Chatterstream using the PerTh detector:
import perth
import librosa
watermarked_audio, sr = librosa.load("output.wav", sr=None)
watermarker = perth.PerthImplicitWatermarker()
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Watermark: {watermark}")
# Output: 0.0 (no watermark) or 1.0 (watermarked)
Watermarking is an important part of the responsible deployment of speech synthesis technology. If you distribute audio generated by Chatterstream, you should leave watermarking enabled so that synthetic speech remains identifiable.
Running tests
# All unit + integration tests (no model weights needed)
pytest tests/ -v --tb=short
# Skip slow tests
pytest tests/ -v -m "not slow"
License
MIT — same as upstream Chatterbox.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chatterstream_tts-0.0.2.tar.gz.
File metadata
- Download URL: chatterstream_tts-0.0.2.tar.gz
- Upload date:
- Size: 27.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92466336469feb3ca64c8b5dce2372ee2a149bc9c97d3938398d4f81365656dd
|
|
| MD5 |
7bcf57ec34299d33b662c5f9f9a93956
|
|
| BLAKE2b-256 |
a4b0f0367581786adab9a08a2cb36d82a6e1c59dc4a829671cf2acbc99dcc8ed
|
File details
Details for the file chatterstream_tts-0.0.2-py3-none-any.whl.
File metadata
- Download URL: chatterstream_tts-0.0.2-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7afcbd4b5ff7880659aebe2b7d47e3e3fc1fd7aa020fa287dc22e7b50b5cc9aa
|
|
| MD5 |
95c7e70ce4add12efa5dfc3c30b263b9
|
|
| BLAKE2b-256 |
8ac91c8a49bcf7d6cd058c8eba23e988ff26246dda7e296487390ef6b2e606ed
|