Chatterbox MLX: Open Source TTS and Voice Conversion for MLX. Based off of Chatterbox by Resemble AI
Project description
Chatterbox MLX - Apple Silicon Optimized TTS
An MLX-optimized fork of Resemble AI's Chatterbox TTS for Apple Silicon, delivering up to 2.4x faster inference.
Installation
pip install chatterbox-mlx
Requirements
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.11+ (tested primarily with 3.11.12. Also tested with 3.12.12)
- ~4GB disk space for model weights
Important: Python must be compiled with lzma support. If you're using pyenv:
# Install xz library first (provides liblzma)
brew install xz
# Then install Python (or reinstall if already installed)
pyenv install 3.11.12 # or your preferred version
If you see an error about ModuleNotFoundError: No module named '_lzma', you need to install xz and reinstall Python.
CLI Usage
Generate speech directly from the terminal:
# Generate English speech (auto-generated filename):
chatterbox "Artificial intelligence has made remarkable strides in recent years, particularly in the field of natural language processing."
# Generate Spanish speech:
chatterbox "La inteligencia artificial ha logrado avances notables en los รบltimos aรฑos." --lang es
# Use the --voice flag to provide a reference audio file for voice cloning:
chatterbox "Artificial intelligence has made remarkable strides in recent years, particularly in the field of natural language processing." --voice speaker.wav
# Run multilingual benchmark (saves to benchmark_output/)
chatterbox --benchmark --languages en es
CLI Options
| Option | Description | Default |
|---|---|---|
-o, --output |
Output WAV file path | Auto-generated |
-l, --lang |
Language code (en, es, fr, de, ja, zh, etc.) | en |
-v, --voice |
Reference audio for voice cloning | None |
--exaggeration |
Emotion intensity (0.0-1.0) | 0.5 |
--cfg |
Classifier-free guidance weight | 0.5 |
--backend |
Backend: hybrid-mlx, mlx, pytorch | hybrid-mlx |
--benchmark |
Run multilingual benchmark | False |
--languages |
Languages to benchmark | en es fr de ja zh |
--no-save-audio |
Don't save benchmark audio files | False (saves) |
-q, --quiet |
Suppress progress messages | False |
Quick Start
import torchaudio as ta
from chatterbox.tts_mlx import ChatterboxTTSMLX
# Load model (downloads weights automatically on first run). Default is "cpu", choose "hybrid-mlx" for best performance on an Apple Silicon device.
model = ChatterboxTTSMLX.from_pretrained(device="hybrid-mlx")
# Generate speech
text = "Hello! This is Chatterbox running with MLX optimization on Apple Silicon."
wav = model.generate(text)
ta.save("output.wav", wav, model.sr)
# Voice cloning with reference audio
wav = model.generate(
text,
audio_prompt_path="reference_voice.wav",
exaggeration=0.5, # Emotion intensity (0.0-1.0)
cfg_weight=0.5, # Classifier-free guidance
)
Long-Form Audio Generation
For texts longer than ~50 words, use chunked generation:
long_text = """
Your long text here. It can span multiple paragraphs and sentences.
The generate_long method will automatically split it at sentence boundaries,
generate each chunk separately, and crossfade them together seamlessly.
"""
wav = model.generate_long(
long_text,
audio_prompt_path="reference_voice.wav",
chunk_size_words=50,
overlap_duration=0.1,
)
ta.save("long_output.wav", wav, model.sr)
๐ Acknowledgements
This project is built on top of the excellent Chatterbox TTS by Resemble AI. I'm deeply grateful for their work in creating and open-sourcing a production-grade, multilingual text-to-speech system under the MIT license.
This fork focuses specifically on MLX optimizations for Apple Silicon. If you're looking for the original project with CUDA support and the full feature set, please visit the official Resemble AI repository.
What's Different in This Fork?
This package provides native MLX acceleration for Apple Silicon Macs, achieving significant performance improvements:
| Text Length | CPU Baseline | MLX Optimized | Speedup |
|---|---|---|---|
| Short (5 words) | 8.91s | 3.70s | 2.4x faster |
| Medium (31 words) | 57.51s | 24.40s | 2.4x faster |
| Long (94 words) | 137.92s | 62.66s | 2.2x faster |
Key Optimizations
- MLX-Native T3 Model: The 520M parameter Llama 3 backbone runs entirely on MLX
- Float16 KV Cache: Up to 5.8 GB memory savings with 32% faster generation
- Hybrid Architecture: Combines MLX speed with PyTorch quality controls
- Long-Form Generation: Intelligent chunking with crossfade for extended audio
Benchmark Results
All benchmarks run on Apple M4 (32GB RAM), macOS 15.4, Python 3.11, PyTorch 2.8.0.
English TTS Performance
| Device | Text | Words | Time | RTF |
|---|---|---|---|---|
| Hybrid-MLX | short | 5 | 4.08s | 0.65x |
| Hybrid-MLX | medium | 31 | 25.24s | 0.73x |
| Hybrid-MLX | long | 94 | 62.66s | 0.74x |
| Pure MLX | short | 5 | 3.70s | 0.69x |
| Pure MLX | medium | 31 | 24.40s | 0.72x |
| Pure MLX | long | 94 | 68.82s | 0.71x |
| CPU | short | 5 | 8.91s | 0.27x |
| CPU | medium | 31 | 57.51s | 0.33x |
| CPU | long | 94 | 137.92s | 0.34x |
Key findings:
- Hybrid-MLX recommended for production (best quality/speed balance)
- Pure MLX fastest for short texts, but quality degrades on long texts
- 2.2-2.4x speedup vs CPU baseline across all text lengths
Multilingual Performance
| Device | Language | Time | RTF |
|---|---|---|---|
| Hybrid-MLX | English | 12.25s | 0.71x |
| Hybrid-MLX | Spanish | 14.74s | 0.76x |
| Pure MLX | English | 14.55s | 0.67x |
| Pure MLX | Spanish | 13.78s | 0.75x |
| MPS | English | 19.96s | 0.50x |
| MPS | Spanish | 21.06s | 0.51x |
| CPU | English | 25.64s | 0.32x |
| CPU | Spanish | 31.31s | 0.33x |
Visual Comparison
GENERATION TIME COMPARISON
Short (5 words)
โโ CPU โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 8.91s
โโ Hybrid-MLX โโโโโโโโโโโโโโโโโโ 4.08s (2.2x faster)
โโ Pure MLX โโโโโโโโโโโโโโโโ 3.70s (2.4x faster)
Medium (31 words)
โโ CPU โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 57.51s
โโ Hybrid-MLX โโโโโโโโโโโโโโโโโ 25.24s (2.3x faster)
โโ Pure MLX โโโโโโโโโโโโโโโโ 24.40s (2.4x faster)
Long (94 words)
โโ CPU โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 137.92s
โโ Hybrid-MLX โโโโโโโโโโโโโโโโโ 62.66s (2.2x faster) โ Best quality
โโ Pure MLX โโโโโโโโโโโโโโโโโโ 68.82s (2.0x faster)
Backend Comparison
| Backend | Description | RTF | Memory | Recommendation |
|---|---|---|---|---|
| Hybrid-MLX | T3 (MLX) + S3Gen (PyTorch/MPS) | 0.74x | ~16GB | โ Production use |
| Pure MLX | Everything on MLX | 0.71x | ~14GB | Minimal dependencies |
| PyTorch MPS | Full PyTorch on MPS | 0.51x | ~14GB | Fallback |
| CPU | PyTorch on CPU | 0.34x | ~14GB | Baseline |
RTF = Real-Time Factor (audio_duration / generation_time). Higher is better.
Running Benchmarks
You can reproduce these benchmarks on your own hardware.
English TTS Benchmark
# Full benchmark (all backends)
python benchmark_mps.py --runs 3 --validate
# Quick test with Hybrid-MLX only
python benchmark_mps.py --hybrid-mlx-only --runs 1
# CPU baseline only
python benchmark_mps.py --cpu-only --runs 1
# With voice cloning
python benchmark_mps.py --audio-prompt speaker.wav --runs 3
# Enable memory debugging
DEBUG_MEMORY=1 python benchmark_mps.py --hybrid-mlx-only
Options:
| Flag | Description |
|---|---|
--warmup N |
Warmup runs before timing (default: 1) |
--runs N |
Number of timed benchmark runs (default: 3) |
--devices |
Backends to test: mps, cpu, hybrid-mlx, mlx, mlx-q4 |
--audio-prompt FILE |
Reference audio for voice cloning |
--output-dir DIR |
Output directory (default: benchmark_output/) |
--validate |
Enable Whisper transcription validation (computes WER) |
--mps-only |
Only benchmark PyTorch MPS |
--cpu-only |
Only benchmark CPU |
--hybrid-mlx-only |
Only benchmark Hybrid-MLX |
--mlx-only |
Only benchmark Pure MLX |
--debug-memory |
Enable detailed memory logging |
Multilingual Benchmark
# Test specific languages
python benchmark_multilingual.py \
--audio-prompt speaker.wav \
--languages en es fr de ja zh \
--runs 3
# Quick test with Hybrid-MLX
python benchmark_multilingual.py \
--audio-prompt speaker.wav \
--languages en es \
--hybrid-mlx-only
# With validation
python benchmark_multilingual.py \
--audio-prompt speaker.wav \
--languages en es fr \
--validate
Supported Languages:
en (English), es (Spanish), fr (French), de (German), it (Italian), pt (Portuguese), ru (Russian), ja (Japanese), zh (Chinese), ko (Korean), ar (Arabic), hi (Hindi), tr (Turkish), pl (Polish), nl (Dutch), sv (Swedish), da (Danish), no (Norwegian), fi (Finnish), el (Greek), he (Hebrew), ms (Malay), sw (Swahili)
Benchmark Output
Results are saved to:
benchmark_output/benchmark_results.json- English TTS resultsbenchmark_multilingual_output/multilingual_results.json- Multilingual results- Generated audio files:
{device}_{category}.wav
Architecture
Chatterbox is a two-stage TTS pipeline. This fork accelerates the most compute-intensive component (T3) with MLX:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CHATTERBOX MLX PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ VoiceEncoder โ โ T3 โ โ S3Gen โ โ
โ โ (PyTorch) โโโโโถโ (MLX) โโโโโถโ (PyTorch/MPS) โ โ
โ โ ~2M params โ โ 520M params โ โ ~80M params โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โฒ โ
โ โ โ
โ 2.4x faster with MLX โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Supported Languages
All 23 languages from the original Chatterbox are supported:
Arabic โข Danish โข German โข Greek โข English โข Spanish โข Finnish โข French โข Hebrew โข Hindi โข Italian โข Japanese โข Korean โข Malay โข Dutch โข Norwegian โข Polish โข Portuguese โข Russian โข Swedish โข Swahili โข Turkish โข Chinese
from chatterbox.mtl_tts_mlx import ChatterboxMultilingualTTSMLX
model = ChatterboxMultilingualTTSMLX.from_pretrained(device="mps")
# French
wav = model.generate("Bonjour, comment รงa va?", language_id="fr")
# Japanese
wav = model.generate("ใใใซใกใฏใๅ
ๆฐใงใใ๏ผ", language_id="ja")
Tips for Best Results
General Use
- Default settings (
exaggeration=0.5,cfg_weight=0.5) work well for most cases - Ensure reference audio matches target language to avoid accent transfer
Expressive Speech
- Lower
cfg_weight(~0.3) + higherexaggeration(~0.7) for dramatic delivery - Higher exaggeration speeds up speech; lower cfg_weight compensates
Memory Usage
Enable debug logging to monitor memory:
DEBUG_MEMORY=1 python your_script.py
Differences from Original Chatterbox
| Feature | Original (Resemble AI) | This Fork |
|---|---|---|
| Target Hardware | NVIDIA CUDA | Apple Silicon |
| ML Framework | PyTorch | MLX + PyTorch hybrid |
| T3 Inference | PyTorch | MLX (2.4x faster) |
| KV Cache | Float32 | Float16 (32% faster) |
| Long-form Audio | Basic | Chunked with crossfade |
Credits & Links
- Original Project: Resemble AI's Chatterbox
- Resemble AI: resemble.ai - For creating and open-sourcing this incredible TTS system
- Demo: Hugging Face Space
- Evaluation: Outperforms ElevenLabs
Upstream Dependencies
License
MIT License - Same as the original Chatterbox project.
Citation
If you use this project, please cite the original Chatterbox:
@misc{chatterboxtts2025,
author = {{Resemble AI}},
title = {{Chatterbox-TTS}},
year = {2025},
howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
note = {GitHub repository}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chatterbox_mlx-1.0.4.tar.gz.
File metadata
- Download URL: chatterbox_mlx-1.0.4.tar.gz
- Upload date:
- Size: 196.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0214b0ba87d55d08efc635452d010b98dcd3791c4c6809ca74bcc911dc354f84
|
|
| MD5 |
290156e32fb2566a30114b1c23b76ea6
|
|
| BLAKE2b-256 |
22b47aaa56e9c6f6e6769136b4ca96a1ce659614436210146b263402ada6ce44
|
File details
Details for the file chatterbox_mlx-1.0.4-py3-none-any.whl.
File metadata
- Download URL: chatterbox_mlx-1.0.4-py3-none-any.whl
- Upload date:
- Size: 243.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
face394ba5467ac07c1596093b6bad5870c70c74b807ef90222e434eb78e223e
|
|
| MD5 |
76d80e185efa03a0b6f2052e4df1483d
|
|
| BLAKE2b-256 |
f31d0516f1d8b93fb61196afc53bf61d6ba02bc737f6598e3b1622e6840117c2
|