Streaming TTS wrapper for Chatterbox — 3-line API for chunk-by-chunk speech synthesis

These details have not been verified by PyPI

Project links

Project description

chatterstream-tts

Streaming TTS wrapper for Chatterbox. Turns the monolithic generate-then-play API into an async chunk-by-chunk streaming pipeline with a 3-line interface.

What is this?

Chatterbox generates high-quality speech but blocks until the entire waveform is ready (1-3s for typical sentences). Chatterstream breaks this into a streaming pipeline that yields audio chunks as they're generated, targeting 300-500ms time-to-first-audio on Apple Silicon.

This is a personal project extracted from a Chatterbox fork. It wraps upstream internals (T3 token generation, S3Gen vocoding, HiFiGAN) into a SOLID component pipeline without modifying upstream code.

Architecture

Text Input
    │
    ▼
╔═════════════════════════════════════════════════════════════╗
║  chatterstream-tts (streaming wrapper)                      ║
║                                                             ║
║  ┌──────────────┐                                           ║
║  │TextProcessor │  Regex normalization + GPT-2 tokenization ║
║  └──────┬───────┘                                           ║
║         │                                                   ║
║         ▼                                                   ║
║  ┌──────────────────┐                                       ║
║  │ConditioningCache │  Cached voice embeddings (LRU)        ║
║  └──────┬───────────┘                                       ║
║         │                                                   ║
║         ▼                                                   ║
║  ┌──────────────┐                                           ║
║  │TokenGenerator│  Yields token chunks as they generate     ║
║  └──────┬───────┘                                           ║
║         │  adaptive chunking (25 tok first, 75 subsequent)  ║
║         ▼                                                   ║
║  ┌─────────────┐                                            ║
║  │ChunkVocoder │  Per-chunk vocoding with source caching    ║
║  └──────┬──────┘                                            ║
║         │                                                   ║
║         ▼                                                   ║
║  ┌──────────────────┐                                       ║
║  │AudioPostProcessor│  Fade-in, normalize, PCM int16        ║
║  └──────┬───────────┘                                       ║
║         │                                                   ║
╠═══ calls into ══════════════════════════════════════════════╣
║                                                             ║
║  chatterbox-tts (upstream model by Resemble AI)             ║
║                                                             ║
║    T3 (350M)  ─  Autoregressive text-to-token transformer   ║
║    S3Gen      ─  Flow-matching vocoder + HiFiGAN (257M)     ║
║    VoiceEnc   ─  Speaker embedding extraction (LSTM)        ║
║                                                             ║
╚══════════════════════════════════╤══════════════════════════╝
                                   │
                                   ▼
                  AsyncIterator[AudioChunk]  →  24 kHz mono int16

Dependencies

Package	Version	Purpose
`chatterbox-tts`	0.1.6	Upstream TTS model (T3 + S3Gen + HiFiGAN)
`torch`	>=2.6.0	Inference runtime
`av`	>=12.0.0	Optional — Opus/HLS media encoding

Python: 3.10 or 3.11 (3.12+ fails due to upstream numpy/distutils incompatibility)

Hardware: Runs on CPU, CUDA, or Apple Silicon (MPS).

FP16 (half-precision)

Neural networks store their weights and perform math using floating-point numbers. The default is FP32 (32-bit / "full precision") — each number uses 32 bits of memory. FP16 (16-bit / "half precision") uses half the memory and runs roughly 2x faster on GPUs that have dedicated half-precision hardware, which includes both NVIDIA CUDA GPUs and Apple Silicon (MPS).

The tradeoff is reduced numerical range, but for TTS inference (as opposed to training) the quality difference is inaudible. Chatterstream auto-enables FP16 on MPS and CUDA because the speed and memory savings are significant with no perceptible quality loss.

On CPU, FP16 is not enabled by default — most CPUs lack native half-precision support, so FP16 would actually be slower as the CPU emulates it in software. CPU inference uses FP32.

Note: Only the T3 text-to-token model runs in FP16. The S3Gen vocoder (HiFiGAN) stays in FP32 because its audio reconstruction is more sensitive to precision, particularly on MPS where dtype mismatches cause errors.

You can override the auto-detection:

# Force FP16 on (even on CPU — not recommended)
tts = StreamingTTS(fp16=True)

# Force FP16 off (even on GPU — full precision, slower)
tts = StreamingTTS(fp16=False)

# Auto-detect (default) — FP16 on MPS/CUDA, FP32 on CPU
tts = StreamingTTS()

Installation

From PyPI

pip install chatterstream-tts

# Optional: media extras (Opus encoding, HLS segmenting)
pip install chatterstream-tts[media]

From source (development)

git clone https://github.com/devinmlowe/chatterstream-tts.git
cd chatterstream-tts

# Create venv (Python 3.11 recommended)
uv venv --python python3.11
source .venv/bin/activate  # bash/zsh
# or: source .venv/bin/activate.fish  # fish

# Install with dev dependencies
uv pip install -e ".[dev]"

# Optional: media extras
uv pip install -e ".[media]"

Hugging Face authentication

Model weights (~600MB) are hosted on Hugging Face at ResembleAI/chatterbox-turbo and downloaded automatically the first time you call .load(). You need a (free) Hugging Face account and token:

# Install the Hugging Face CLI (included with chatterbox-tts)
# Then log in — this saves your token locally
huggingface-cli login

You can also set the HF_TOKEN environment variable instead:

export HF_TOKEN=hf_your_token_here

After the first download, weights are cached locally and no network access is needed for subsequent runs.

Quick Start

# 1. Log in to Hugging Face (first time only — needed to download model weights)
huggingface-cli login

# 2. Stream speech
from chatterstream import StreamingTTS

tts = StreamingTTS()
tts.load()  # downloads weights on first run (~600MB), then ~2-5s from cache

async for chunk in tts.synthesize("Hello world"):
    play(chunk.pcm_bytes)  # 24 kHz mono int16

That's it. The pipeline handles tokenization, chunked generation, vocoding, and audio post-processing internally. Each AudioChunk contains raw PCM bytes (24 kHz, mono, int16) plus metadata (is_final, chunk_index, duration_ms).

Advanced Usage

Loading the model

Model weights (~600MB) must be loaded before synthesis. There are three ways to handle this, each suited to different scenarios:

Explicit load — Call .load() yourself before synthesizing. This is the recommended approach for servers and long-running processes. You control exactly when the 2-5 second load happens (e.g. at startup, not on the first user request), and is_loaded lets you gate readiness checks.

tts = StreamingTTS()
tts.load()  # 2-5s, blocks until ready
# tts.is_loaded == True

Chained load — .load() returns self, so you can construct and load in one line. Convenient for scripts and notebooks where you don't need the intermediate unloaded state.

tts = StreamingTTS(device="mps", watermark=False).load()

Auto-load (lazy) — If you call .synthesize() without loading first, the model loads automatically. A UserWarning is emitted so you know it happened. This is fine for quick experiments but not ideal in production — the first synthesis call silently takes an extra 2-5 seconds, which can be confusing.

tts = StreamingTTS()

# First call triggers auto-load (emits: "Model not loaded. Call .load()
# explicitly for faster first synthesis. Auto-loading now...")
async for chunk in tts.synthesize("Hello world"):
    play(chunk.pcm_bytes)

# Subsequent calls are fast — model stays loaded
async for chunk in tts.synthesize("More text"):
    play(chunk.pcm_bytes)

When to use which:

Approach	Best for	Tradeoff
Explicit `.load()`	Servers, APIs, anything with a startup phase	You manage the load timing
Chained `.load()`	Scripts, notebooks, one-off experiments	No access to unloaded state
Auto-load	Quick prototyping, REPL exploration	Surprising latency on first call

Custom voice

async for chunk in tts.synthesize("Hello", voice="/path/to/voice.wav"):
    play(chunk.pcm_bytes)

Voice files are cached by path with mtime-based invalidation — re-encoding the same file at the same path busts the cache automatically.

Configuration

from chatterstream import StreamingTTS, StreamConfig

config = StreamConfig(
    first_chunk_tokens=25,       # tokens before first audio yield
    subsequent_chunk_tokens=75,  # tokens per subsequent chunk
    overlap_tokens=0,            # token overlap between chunks
    temperature=0.8,
    top_k=1000,
)

tts = StreamingTTS(device="mps", config=config, watermark=False)
tts.load()

Media encoders

The core pipeline yields raw PCM audio (24 kHz, mono, int16). To deliver that audio to clients, you need an encoding/transport layer. Two optional encoders (requiring pip install chatterstream-tts[media]) handle this:

OpusEncoder — Encodes PCM to OGG/Opus. Best for low-latency delivery over WebSockets or direct streaming where you control both ends. Superior compression at low bitrates, near-zero codec delay. The tradeoff: browsers can't play a raw OGG/Opus stream over plain HTTP — you need JavaScript (e.g. Web Audio API) or a WebSocket to decode it client-side.

from chatterstream.opus_encoder import OpusEncoder

encoder = OpusEncoder(input_sample_rate=24000, bitrate=64000)

async for chunk in tts.synthesize("Hello world"):
    ogg_bytes = encoder.encode(chunk.pcm_bytes)
    websocket.send(ogg_bytes)  # stream over WebSocket

# Flush the encoder when done (emits final OGG pages)
final_bytes = encoder.finalize()
websocket.send(final_bytes)

# Reset for the next utterance (reuses the encoder object)
encoder.reset()

HLSSegmenter — Encodes PCM to MPEG-TS/AAC segments with an m3u8 playlist. HLS (HTTP Live Streaming) is the standard used by every browser, phone, and smart TV — a plain <audio> tag pointed at the m3u8 URL just works, no JavaScript required. Audio is split into small segments (~1-2s each) served over regular HTTP. The tradeoff: segment-based delivery adds inherent latency (the player must buffer at least one segment before playback starts).

from chatterstream.hls_segmenter import HLSSegmenter

segmenter = HLSSegmenter(sample_rate=24000, bitrate=96000)

async for chunk in tts.synthesize("Hello world"):
    seg_bytes = segmenter.add_segment(chunk.pcm_bytes)
    # seg_bytes is a self-contained MPEG-TS segment
    # serve it at /seg{index}.ts

# Flush encoder and mark the stream complete
segmenter.finalize()

# Generate the m3u8 playlist (references seg0.ts, seg1.ts, ...)
playlist = segmenter.playlist()
# serve playlist at /stream.m3u8

# Retrieve any segment by index
segment_0 = segmenter.get_segment(0)

These are complementary, not alternatives. Use Opus for real-time applications (voice agents, WebSocket APIs) where you control the client. Use HLS when you need universal browser playback with zero client-side code. You could even use both — Opus for a native app client and HLS for a web fallback.

See examples/streaming_server.py for a complete aiohttp server that uses HLS to serve audio playable in any browser.

Responsible AI: PerTh Watermarking

Chatterbox includes Resemble AI's PerTh (Perceptual Threshold) Watermarker — imperceptible neural watermarks embedded in generated audio that survive MP3 compression, editing, and common manipulations while maintaining nearly 100% detection accuracy. This is an important tool for responsible use of synthetic speech.

Chatterstream supports PerTh watermarking and enables it by default:

# Watermarking on (default)
tts = StreamingTTS(watermark=True)

# Watermarking off (for local development / testing only)
tts = StreamingTTS(watermark=False)

Each audio chunk is watermarked as it passes through the pipeline, so streaming delivery doesn't bypass the watermark the way a naive chunk-by-chunk approach might.

Note: The open-source resemble-perth package ships with PerthImplicitWatermarker disabled (set to None). Chatterstream detects this and falls back gracefully — watermarking is silently skipped with a log message. If you have access to a full PerTh implementation, it activates automatically.

Extracting watermarks

You can verify whether audio was generated by Chatterbox/Chatterstream using the PerTh detector:

import perth
import librosa

watermarked_audio, sr = librosa.load("output.wav", sr=None)

watermarker = perth.PerthImplicitWatermarker()
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Watermark: {watermark}")
# Output: 0.0 (no watermark) or 1.0 (watermarked)

Watermarking is an important part of the responsible deployment of speech synthesis technology. If you distribute audio generated by Chatterstream, you should leave watermarking enabled so that synthetic speech remains identifiable.

Running tests

# All unit + integration tests (no model weights needed)
pytest tests/ -v --tb=short

# Skip slow tests
pytest tests/ -v -m "not slow"

License

MIT — same as upstream Chatterbox.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Feb 11, 2026

0.0.1

Feb 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chatterstream_tts-0.0.2.tar.gz (27.2 kB view details)

Uploaded Feb 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chatterstream_tts-0.0.2-py3-none-any.whl (26.6 kB view details)

Uploaded Feb 11, 2026 Python 3

File details

Details for the file chatterstream_tts-0.0.2.tar.gz.

File metadata

Download URL: chatterstream_tts-0.0.2.tar.gz
Upload date: Feb 11, 2026
Size: 27.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for chatterstream_tts-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`92466336469feb3ca64c8b5dce2372ee2a149bc9c97d3938398d4f81365656dd`
MD5	`7bcf57ec34299d33b662c5f9f9a93956`
BLAKE2b-256	`a4b0f0367581786adab9a08a2cb36d82a6e1c59dc4a829671cf2acbc99dcc8ed`

See more details on using hashes here.

File details

Details for the file chatterstream_tts-0.0.2-py3-none-any.whl.

File metadata

Download URL: chatterstream_tts-0.0.2-py3-none-any.whl
Upload date: Feb 11, 2026
Size: 26.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for chatterstream_tts-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7afcbd4b5ff7880659aebe2b7d47e3e3fc1fd7aa020fa287dc22e7b50b5cc9aa`
MD5	`95c7e70ce4add12efa5dfc3c30b263b9`
BLAKE2b-256	`8ac91c8a49bcf7d6cd058c8eba23e988ff26246dda7e296487390ef6b2e606ed`

See more details on using hashes here.

chatterstream-tts 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

chatterstream-tts

What is this?

Architecture

Dependencies

FP16 (half-precision)

Installation

From PyPI

From source (development)

Hugging Face authentication

Quick Start

Advanced Usage

Loading the model

Custom voice

Configuration

Media encoders

Responsible AI: PerTh Watermarking

Extracting watermarks

Running tests

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes