Chatterbox-Flash: prior-calibrated block-diffusion zero-shot TTS, extending Chatterbox-TTS with a parallel masked decoder.

These details have not been verified by PyPI

Project links

Project description

Chatterbox-Flash

Chatterbox Flash

Made with ♥️ by

Prior-calibrated block-diffusion zero-shot TTS, extending Chatterbox-TTS with a parallel masked decoder while preserving streaming generation.

Chatterbox-Flash replaces the autoregressive T3 decoder of Chatterbox-TTS with a Fast-dLLM v2-style block-diffusion decoder, adds two inference-time techniques described in our paper — prior-calibrated PMI scoring and early decoding via a time-shifted quantile schedule — and reuses the original S3Gen flow-matching vocoder, Metavoice voice encoder, and English tokenizer unchanged.

The released package is inference-only. Training code is intentionally omitted; all reproductions of the paper's numbers can be driven from the scripts in this repository given the released checkpoints.

Installation

Recommended: uv

First create and activate a virtual environment:

uv venv                       # creates .venv with a compatible Python
source .venv/bin/activate     # Windows: .venv\Scripts\activate

chatterbox-tts==0.1.7 (our base dependency) pins torch==2.6.0, but most modern CUDA wheels in the stack (torchvision, xformers, flash-attn, flashinfer-python) only ship matched ABI binaries for torch 2.7.x. Our pyproject.toml declares a [tool.uv] override-dependencies section that tells uv to honour the 2.7.x pin and skip the upstream 2.6 constraint, so a single command produces a consistent environment:

# Core — uv reads tool.uv.override-dependencies automatically:
uv pip install chatterbox-flash

# Optional — high-throughput inference backend (FlashInfer, CUDA):
uv pip install "chatterbox-flash[flashinfer]"

# Optional — Apple Silicon native Metal backend (mlx + mlx-lm, macOS only):
uv pip install "chatterbox-flash[mlx]"

# Optional — full evaluation suite (SIM-o / WER / UTMOS via OmniVoice):
uv pip install "chatterbox-flash[eval]"

If uv ever fails on a torchvision ABI mismatch (RuntimeError: operator torchvision::nms does not exist) or a flash_attn_2_cuda undefined symbol, force a torch-2.7-matched torchvision:

uv pip install 'torchvision>=0.22,<0.23'

Alternative: plain pip

pip does not understand [tool.uv], so you need to install with --upgrade torch torchaudio after the initial resolve to undo the chatterbox-tts 2.6 pin manually:

pip install chatterbox-flash
pip install --upgrade 'torch>=2.7,<2.8' 'torchaudio>=2.7,<2.8' 'torchvision>=0.22,<0.23'

Local install from source (development)

Clone the repository and install in editable mode so source edits take effect without reinstalling. Pick the extras for your hardware:

git clone https://github.com/resemble-ai/chatterbox-flash.git
cd chatterbox-flash

# CUDA box (FlashInfer + eval):
uv pip install -e ".[flashinfer,eval]"

# Apple Silicon (Metal backend):
uv pip install -e ".[mlx]"

# Minimal (CPU / torch SDPA only):
uv pip install -e .

Engine selection at runtime

Pick the hardware path via --backend on synthesize.py — see the table in Quick start. From the Python API, pass the lower-level engine name (flashinfer / torch / mlx):

from chatterbox_flash import FLASHINFER_AVAILABLE, MLX_AVAILABLE, ChatterboxFlashTTS

tts = ChatterboxFlashTTS.from_pretrained("ResembleAI/chatterbox-flash", device="cuda")
wav = tts.generate(text, audio_prompt_path="ref.wav", backend="torch")

CHATTERBOX_FLASH_ENGINE={flashinfer,torch,mlx} forces an engine per process (handy in CI/CD).

Quick start

A single entry-point synthesize.py covers all four hardware paths via one --backend flag. One reference voice + one or many texts (a single text is just a batch of one).

`--backend`	Engine	Device	dtype	CUDA graph	Notes
`gpu` (default)	torch SDPA	cuda	bf16	on	No JIT cold start
`flashinfer`	FlashInfer	cuda	bf16	on	Paged KV; warmup-amortised throughput
`cpu`	torch SDPA	cpu	fp16	off	CPU-only validation / Docker (`--dtype fp32` to fall back)
`mlx`	mlx (Metal)	cpu*	fp16	off	Apple Silicon native (`[mlx]` extra)

* MLX runs the LLaMA backbone on Metal; the PyTorch side stays on CPU.

Override the per-backend compute dtype with --dtype {bf16,fp16,fp32}. CPU defaults to fp16 (PyTorch 2.x has CPU fp16 kernels); use --dtype fp32 if a fp16 op is unsupported or slower on your hardware.

4-bit / 8-bit quantization (MLX only): --quantize_bits {4,8} quantizes the T3 LLaMA backbone via mlx.nn.quantize (the S3Gen vocoder and voice encoder stay in --dtype). Tune the group size with the CHATTERBOX_FLASH_MLX_QUANT_GROUP env var (default 64). There is no CPU quantization path — PyTorch CPU has no native 4-bit kernels; use bf16/fp16 on CPU, or MLX quantization on Apple Silicon.

# Apple Silicon, 4-bit quantized backbone
python synthesize.py --audio_prompt reference.wav --text "..." \
    --backend mlx --quantize_bits 4

# Default — GPU + torch SDPA
python synthesize.py --audio_prompt reference.wav \
    --text "Sometimes it's better to just let things slide, you know?"

# GPU + FlashInfer (paged KV + CUDA graph)
python synthesize.py --audio_prompt reference.wav --text "..." --backend flashinfer

# CPU only
python synthesize.py --audio_prompt reference.wav --text "..." --backend cpu

# Apple Silicon native Metal
python synthesize.py --audio_prompt reference.wav --text "..." --backend mlx

# Multiple sentences; 8 rows per batched forward
python synthesize.py --audio_prompt reference.wav \
    --text "First sentence." "Second sentence." "Third sentence." \
    --batch_size 8

# From a file (one sentence per line, '#' lines and blanks ignored)
python synthesize.py --audio_prompt reference.wav --text_file sentences.txt

Defaults reproduce the paper's best decoding setup. CFG, when on (--cfg_scale > 0), is locked to the production combination — zero_text_batch

zero_all null + null-text zeroed + null-speech duplicated + PMI combined via pmi_cfg ((1+w)·pmi_c − w·pmi_u) — no other CFG mode is exposed. Other defaults: OmniVoice r_n schedule (omnivoice_schedule_t_shift=0.5), position temperature T=5, precomputed unconditional block prior, S3Gen meanflow vocoding at 2 CFM steps. Override per run via --num_steps, --temperature, --time_shift_tau, --cfg_scale, --n_cfm_timesteps, --position_temperature.

Python API

import torchaudio as ta
from chatterbox_flash import ChatterboxFlashTTS

model = ChatterboxFlashTTS.from_pretrained("ResembleAI/chatterbox-flash", device="cuda")
wav = model.generate(
    "Sometimes it's better to just let things slide, you know?",
    audio_prompt_path="reference.wav",
)
ta.save("out.wav", wav, model.sr)

See examples/ for more demos.

Reproducing the paper benchmarks

# Downloads the released checkpoint from the Hugging Face Hub + OmniVoice's
# eval datasets/models, generates wavs, then scores SIM-o / WER / UTMOS using
# the same code OmniVoice publishes.
bash scripts/run_eval.sh

Defaults reproduce the Table 1 configuration from our paper (block size 16, max 10 steps per block, temperature 0.6, time-shift τ=0.1, CFG scale 1.0, zero-text-batch + pmi_cfg + zero_all null prefix).

Project layout

synthesize.py             ← single CLI entry point; one ref + many texts; --backend selects engine

chatterbox_flash/
├── model.py              ChatterboxFlashT3: chatterbox T3 + MASK token + block-diffusion generate()
├── tts.py                ChatterboxFlashTTS — user-facing pipeline (T3 + S3Gen + VE + tokenizer)
├── cfg_guidance.py       Classifier-free guidance helpers (zero-text-batch + PMI-side combination)
├── calibration.py        Prior-calibrated PMI scoring
├── engines/              Pluggable inference backends
│   ├── base.py             InferenceEngine protocol
│   ├── flashinfer.py       Paged KV + CUDA-graph FlashInfer engine (CUDA, preferred)
│   ├── torch_sdpa.py       Pure-PyTorch SDPA + DynamicCache fallback (CPU/CUDA/MPS)
│   ├── mlx.py              Apple-Silicon Metal engine via mlx + mlx-lm (experimental)
│   └── __init__.py         build_engine() — picks the backend, honours $CHATTERBOX_FLASH_ENGINE
├── text_norm/            English text normalization (numbers, abbreviations, dates, times, phones)
└── eval/                 OmniVoice JSONL generation + WER (Whisper, seedtts-eval style)

scripts/
└── run_eval.sh           Full evaluation pipeline (download → generate → score)

Citation

@article{chatterboxflash2026,
  title   = {Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS},
  author  = {Resemble AI},
  year    = {2026},
}

License

MIT (see LICENSE).

The base architecture, S3Gen vocoder, voice encoder and tokenizer are provided by chatterbox-tts (also MIT-licensed).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chatterbox_flash-0.1.0.tar.gz (65.2 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chatterbox_flash-0.1.0-py3-none-any.whl (68.8 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file chatterbox_flash-0.1.0.tar.gz.

File metadata

Download URL: chatterbox_flash-0.1.0.tar.gz
Upload date: May 28, 2026
Size: 65.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"11","id":"bullseye","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for chatterbox_flash-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`419f28ca508eedae1d7d9ee4737b9ebd82e81006406f9a64c63576d7dad24f1a`
MD5	`78b5c2d9a9b8c09665f7f1c4b05396d7`
BLAKE2b-256	`786bd068a5c93bd28305cb25c776a5d2c75cf64b7d72157f120d55ebfcc0b644`

See more details on using hashes here.

File details

Details for the file chatterbox_flash-0.1.0-py3-none-any.whl.

File metadata

Download URL: chatterbox_flash-0.1.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 68.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"11","id":"bullseye","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for chatterbox_flash-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6362cbdbbe63fc74fb63ff7ca4cebedc2bb47d0e5b2c9b96b24cf82d560171d8`
MD5	`39197bcb51a7e7dcc03e729b58d0598a`
BLAKE2b-256	`02dcbaa6b25c596c0d99d4f9f82e73f1514f2219ff151f00277728af3a02de22`

See more details on using hashes here.

chatterbox-flash 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Chatterbox Flash

Installation

Recommended: uv

Alternative: plain pip

Local install from source (development)

Engine selection at runtime

Quick start

Python API

Reproducing the paper benchmarks

Project layout

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes