Chatterbox-Flash: prior-calibrated block-diffusion zero-shot TTS, extending Chatterbox-TTS with a parallel masked decoder.
Project description
Chatterbox Flash
Prior-calibrated block-diffusion zero-shot TTS, extending Chatterbox-TTS with a parallel masked decoder while preserving streaming generation.
Chatterbox-Flash replaces the autoregressive T3 decoder of Chatterbox-TTS with a Fast-dLLM v2-style block-diffusion decoder, adds two inference-time techniques described in our paper — prior-calibrated PMI scoring and early decoding via a time-shifted quantile schedule — and reuses the original S3Gen flow-matching vocoder, Metavoice voice encoder, and English tokenizer unchanged.
The released package is inference-only. Training code is intentionally omitted; all reproductions of the paper's numbers can be driven from the scripts in this repository given the released checkpoints.
Installation
Recommended: uv
First create and activate a virtual environment:
uv venv # creates .venv with a compatible Python
source .venv/bin/activate # Windows: .venv\Scripts\activate
chatterbox-tts==0.1.7 (our base dependency) pins torch==2.6.0, but most
modern CUDA wheels in the stack (torchvision, xformers, flash-attn,
flashinfer-python) only ship matched ABI binaries for torch 2.7.x. Our
pyproject.toml declares a [tool.uv] override-dependencies
section that tells uv to honour the 2.7.x pin and skip the upstream 2.6
constraint, so a single command produces a consistent environment:
# Core — uv reads tool.uv.override-dependencies automatically:
uv pip install chatterbox-flash
# Optional — high-throughput inference backend (FlashInfer, CUDA):
uv pip install "chatterbox-flash[flashinfer]"
# Optional — Apple Silicon native Metal backend (mlx + mlx-lm, macOS only):
uv pip install "chatterbox-flash[mlx]"
# Optional — full evaluation suite (SIM-o / WER / UTMOS via OmniVoice):
uv pip install "chatterbox-flash[eval]"
If uv ever fails on a torchvision ABI mismatch
(RuntimeError: operator torchvision::nms does not exist) or a
flash_attn_2_cuda undefined symbol, force a torch-2.7-matched torchvision:
uv pip install 'torchvision>=0.22,<0.23'
Alternative: plain pip
pip does not understand [tool.uv], so you need to install with
--upgrade torch torchaudio after the initial resolve to undo the chatterbox-tts
2.6 pin manually:
pip install chatterbox-flash
pip install --upgrade 'torch>=2.7,<2.8' 'torchaudio>=2.7,<2.8' 'torchvision>=0.22,<0.23'
Local install from source (development)
Clone the repository and install in editable mode so source edits take effect without reinstalling. Pick the extras for your hardware:
git clone https://github.com/resemble-ai/chatterbox-flash.git
cd chatterbox-flash
# CUDA box (FlashInfer + eval):
uv pip install -e ".[flashinfer,eval]"
# Apple Silicon (Metal backend):
uv pip install -e ".[mlx]"
# Minimal (CPU / torch SDPA only):
uv pip install -e .
Engine selection at runtime
Pick the hardware path via --backend on synthesize.py — see the table
in Quick start. From the Python API, pass the lower-level
engine name (flashinfer / torch / mlx):
from chatterbox_flash import FLASHINFER_AVAILABLE, MLX_AVAILABLE, ChatterboxFlashTTS
tts = ChatterboxFlashTTS.from_pretrained("ResembleAI/chatterbox-flash", device="cuda")
wav = tts.generate(text, audio_prompt_path="ref.wav", backend="torch")
CHATTERBOX_FLASH_ENGINE={flashinfer,torch,mlx} forces an engine per
process (handy in CI/CD).
Quick start
A single entry-point synthesize.py covers all four hardware paths via
one --backend flag. One reference voice + one or many texts (a single
text is just a batch of one).
--backend |
Engine | Device | dtype | CUDA graph | Notes |
|---|---|---|---|---|---|
gpu (default) |
torch SDPA | cuda | bf16 | on | No JIT cold start |
flashinfer |
FlashInfer | cuda | bf16 | on | Paged KV; warmup-amortised throughput |
cpu |
torch SDPA | cpu | fp16 | off | CPU-only validation / Docker (--dtype fp32 to fall back) |
mlx |
mlx (Metal) | cpu* | fp16 | off | Apple Silicon native ([mlx] extra) |
* MLX runs the LLaMA backbone on Metal; the PyTorch side stays on CPU.
Override the per-backend compute dtype with --dtype {bf16,fp16,fp32}. CPU
defaults to fp16 (PyTorch 2.x has CPU fp16 kernels); use --dtype fp32 if a
fp16 op is unsupported or slower on your hardware.
4-bit / 8-bit quantization (MLX only): --quantize_bits {4,8} quantizes the
T3 LLaMA backbone via mlx.nn.quantize (the S3Gen vocoder and voice encoder
stay in --dtype). Tune the group size with the
CHATTERBOX_FLASH_MLX_QUANT_GROUP env var (default 64). There is no CPU
quantization path — PyTorch CPU has no native 4-bit kernels; use bf16/fp16 on
CPU, or MLX quantization on Apple Silicon.
# Apple Silicon, 4-bit quantized backbone
python synthesize.py --audio_prompt reference.wav --text "..." \
--backend mlx --quantize_bits 4
# Default — GPU + torch SDPA
python synthesize.py --audio_prompt reference.wav \
--text "Sometimes it's better to just let things slide, you know?"
# GPU + FlashInfer (paged KV + CUDA graph)
python synthesize.py --audio_prompt reference.wav --text "..." --backend flashinfer
# CPU only
python synthesize.py --audio_prompt reference.wav --text "..." --backend cpu
# Apple Silicon native Metal
python synthesize.py --audio_prompt reference.wav --text "..." --backend mlx
# Multiple sentences; 8 rows per batched forward
python synthesize.py --audio_prompt reference.wav \
--text "First sentence." "Second sentence." "Third sentence." \
--batch_size 8
# From a file (one sentence per line, '#' lines and blanks ignored)
python synthesize.py --audio_prompt reference.wav --text_file sentences.txt
Defaults reproduce the paper's best decoding setup. CFG, when on
(--cfg_scale > 0), is locked to the production combination — zero_text_batch
zero_allnull + null-text zeroed + null-speech duplicated + PMI combined viapmi_cfg((1+w)·pmi_c − w·pmi_u) — no other CFG mode is exposed. Other defaults: OmniVoice r_n schedule (omnivoice_schedule_t_shift=0.5), position temperature T=5, precomputed unconditional block prior, S3Gen meanflow vocoding at 2 CFM steps. Override per run via--num_steps,--temperature,--time_shift_tau,--cfg_scale,--n_cfm_timesteps,--position_temperature.
Python API
import torchaudio as ta
from chatterbox_flash import ChatterboxFlashTTS
model = ChatterboxFlashTTS.from_pretrained("ResembleAI/chatterbox-flash", device="cuda")
wav = model.generate(
"Sometimes it's better to just let things slide, you know?",
audio_prompt_path="reference.wav",
)
ta.save("out.wav", wav, model.sr)
See examples/ for more demos.
Reproducing the paper benchmarks
# Downloads the released checkpoint from the Hugging Face Hub + OmniVoice's
# eval datasets/models, generates wavs, then scores SIM-o / WER / UTMOS using
# the same code OmniVoice publishes.
bash scripts/run_eval.sh
Defaults reproduce the Table 1 configuration from our paper
(block size 16, max 10 steps per block, temperature 0.6, time-shift τ=0.1,
CFG scale 1.0, zero-text-batch + pmi_cfg + zero_all null prefix).
Project layout
synthesize.py ← single CLI entry point; one ref + many texts; --backend selects engine
chatterbox_flash/
├── model.py ChatterboxFlashT3: chatterbox T3 + MASK token + block-diffusion generate()
├── tts.py ChatterboxFlashTTS — user-facing pipeline (T3 + S3Gen + VE + tokenizer)
├── cfg_guidance.py Classifier-free guidance helpers (zero-text-batch + PMI-side combination)
├── calibration.py Prior-calibrated PMI scoring
├── engines/ Pluggable inference backends
│ ├── base.py InferenceEngine protocol
│ ├── flashinfer.py Paged KV + CUDA-graph FlashInfer engine (CUDA, preferred)
│ ├── torch_sdpa.py Pure-PyTorch SDPA + DynamicCache fallback (CPU/CUDA/MPS)
│ ├── mlx.py Apple-Silicon Metal engine via mlx + mlx-lm (experimental)
│ └── __init__.py build_engine() — picks the backend, honours $CHATTERBOX_FLASH_ENGINE
├── text_norm/ English text normalization (numbers, abbreviations, dates, times, phones)
└── eval/ OmniVoice JSONL generation + WER (Whisper, seedtts-eval style)
scripts/
└── run_eval.sh Full evaluation pipeline (download → generate → score)
Citation
@article{chatterboxflash2026,
title = {Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS},
author = {Resemble AI},
year = {2026},
}
License
MIT (see LICENSE).
The base architecture, S3Gen vocoder, voice encoder and tokenizer are provided by chatterbox-tts (also MIT-licensed).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chatterbox_flash-0.1.0.tar.gz.
File metadata
- Download URL: chatterbox_flash-0.1.0.tar.gz
- Upload date:
- Size: 65.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"11","id":"bullseye","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
419f28ca508eedae1d7d9ee4737b9ebd82e81006406f9a64c63576d7dad24f1a
|
|
| MD5 |
78b5c2d9a9b8c09665f7f1c4b05396d7
|
|
| BLAKE2b-256 |
786bd068a5c93bd28305cb25c776a5d2c75cf64b7d72157f120d55ebfcc0b644
|
File details
Details for the file chatterbox_flash-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chatterbox_flash-0.1.0-py3-none-any.whl
- Upload date:
- Size: 68.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"11","id":"bullseye","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6362cbdbbe63fc74fb63ff7ca4cebedc2bb47d0e5b2c9b96b24cf82d560171d8
|
|
| MD5 |
39197bcb51a7e7dcc03e729b58d0598a
|
|
| BLAKE2b-256 |
02dcbaa6b25c596c0d99d4f9f82e73f1514f2219ff151f00277728af3a02de22
|