Skip to main content

Chatterbox MLX: Open Source TTS and Voice Conversion for MLX. Based off of Chatterbox by Resemble AI

Project description

Chatterbox MLX - Apple Silicon Optimized TTS

PyPI version Python 3.11+ License: MIT

An MLX-optimized fork of Resemble AI's Chatterbox TTS for Apple Silicon, delivering up to 2.4x faster inference.


Installation

pip install chatterbox-mlx

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.11+ (tested primarily with 3.11.12. Also tested with 3.12.12 and 3.13.2)
  • ~4GB disk space for model weights

Important: Python must be compiled with lzma support. If you're using pyenv:

# Install xz library first (provides liblzma)
brew install xz

# Then install Python (or reinstall if already installed)
pyenv install 3.11.12  # or your preferred version

If you see an error about ModuleNotFoundError: No module named '_lzma', you need to install xz and reinstall Python.


CLI Usage

Generate speech directly from the terminal:

    # Generate English speech (auto-generated filename):
    chatterbox "Artificial intelligence has made remarkable strides in recent years, particularly in the field of natural language processing."

    # Generate Spanish speech:
    chatterbox "La inteligencia artificial ha logrado avances notables en los รบltimos aรฑos." --lang es

    # Use the --voice flag to provide a reference audio file for voice cloning:
    chatterbox "Artificial intelligence has made remarkable strides in recent years, particularly in the field of natural language processing." --voice speaker.wav

    # Run multilingual benchmark (saves to benchmark_output/)
    chatterbox --benchmark --languages en es

CLI Options

Option Description Default
-o, --output Output WAV file path Auto-generated
-l, --lang Language code (en, es, fr, de, ja, zh, etc.) en
-v, --voice Reference audio for voice cloning None
--exaggeration Emotion intensity (0.0-1.0) 0.5
--cfg Classifier-free guidance weight 0.5
--backend Backend: hybrid-mlx, mlx, pytorch hybrid-mlx
--benchmark Run multilingual benchmark False
--languages Languages to benchmark en es fr de ja zh
--no-save-audio Don't save benchmark audio files False (saves)
-q, --quiet Suppress progress messages False

Quick Start

import torchaudio as ta
from chatterbox.tts_mlx import ChatterboxTTSMLX

# Load model (downloads weights automatically on first run). Default is "cpu", choose "hybrid-mlx" for best performance on an Apple Silicon device.
model = ChatterboxTTSMLX.from_pretrained(device="hybrid-mlx")


# Generate speech
text = "Hello! This is Chatterbox running with MLX optimization on Apple Silicon."
wav = model.generate(text)
ta.save("output.wav", wav, model.sr)

# Voice cloning with reference audio
wav = model.generate(
    text,
    audio_prompt_path="reference_voice.wav",
    exaggeration=0.5,  # Emotion intensity (0.0-1.0)
    cfg_weight=0.5,    # Classifier-free guidance
)

Long-Form Audio Generation

For texts longer than ~50 words, use chunked generation:

long_text = """
Your long text here. It can span multiple paragraphs and sentences.
The generate_long method will automatically split it at sentence boundaries,
generate each chunk separately, and crossfade them together seamlessly.
"""

wav = model.generate_long(
    long_text,
    audio_prompt_path="reference_voice.wav",
    chunk_size_words=50,
    overlap_duration=0.1,
)
ta.save("long_output.wav", wav, model.sr)

๐Ÿ™ Acknowledgements

This project is built on top of the excellent Chatterbox TTS by Resemble AI. I'm deeply grateful for their work in creating and open-sourcing a production-grade, multilingual text-to-speech system under the MIT license.

This fork focuses specifically on MLX optimizations for Apple Silicon. If you're looking for the original project with CUDA support and the full feature set, please visit the official Resemble AI repository.


What's Different in This Fork?

This package provides native MLX acceleration for Apple Silicon Macs, achieving significant performance improvements:

Text Length CPU Baseline MLX Optimized Speedup
Short (5 words) 8.91s 3.70s 2.4x faster
Medium (31 words) 57.51s 24.40s 2.4x faster
Long (94 words) 137.92s 62.66s 2.2x faster

Key Optimizations

  • MLX-Native T3 Model: The 520M parameter Llama 3 backbone runs entirely on MLX
  • Float16 KV Cache: Up to 5.8 GB memory savings with 32% faster generation
  • Hybrid Architecture: Combines MLX speed with PyTorch quality controls
  • Long-Form Generation: Intelligent chunking with crossfade for extended audio

Benchmark Results

All benchmarks run on Apple M4 (32GB RAM), macOS 15.4, Python 3.11, PyTorch 2.8.0.

English TTS Performance

Device Text Words Time RTF
Hybrid-MLX short 5 4.08s 0.65x
Hybrid-MLX medium 31 25.24s 0.73x
Hybrid-MLX long 94 62.66s 0.74x
Pure MLX short 5 3.70s 0.69x
Pure MLX medium 31 24.40s 0.72x
Pure MLX long 94 68.82s 0.71x
CPU short 5 8.91s 0.27x
CPU medium 31 57.51s 0.33x
CPU long 94 137.92s 0.34x

Key findings:

  • Hybrid-MLX recommended for production (best quality/speed balance)
  • Pure MLX fastest for short texts, but quality degrades on long texts
  • 2.2-2.4x speedup vs CPU baseline across all text lengths

Multilingual Performance

Device Language Time RTF
Hybrid-MLX English 12.25s 0.71x
Hybrid-MLX Spanish 14.74s 0.76x
Pure MLX English 14.55s 0.67x
Pure MLX Spanish 13.78s 0.75x
MPS English 19.96s 0.50x
MPS Spanish 21.06s 0.51x
CPU English 25.64s 0.32x
CPU Spanish 31.31s 0.33x

Visual Comparison

                    GENERATION TIME COMPARISON

     Short (5 words)
     โ”œโ”€ CPU        โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 8.91s
     โ”œโ”€ Hybrid-MLX โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 4.08s (2.2x faster)
     โ””โ”€ Pure MLX   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 3.70s (2.4x faster)

     Medium (31 words)
     โ”œโ”€ CPU        โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 57.51s
     โ”œโ”€ Hybrid-MLX โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 25.24s (2.3x faster)
     โ””โ”€ Pure MLX   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 24.40s (2.4x faster)

     Long (94 words)
     โ”œโ”€ CPU        โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 137.92s
     โ”œโ”€ Hybrid-MLX โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 62.66s (2.2x faster)  โœ“ Best quality
     โ””โ”€ Pure MLX   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 68.82s (2.0x faster)

Backend Comparison

Backend Description RTF Memory Recommendation
Hybrid-MLX T3 (MLX) + S3Gen (PyTorch/MPS) 0.74x ~16GB โœ… Production use
Pure MLX Everything on MLX 0.71x ~14GB Minimal dependencies
PyTorch MPS Full PyTorch on MPS 0.51x ~14GB Fallback
CPU PyTorch on CPU 0.34x ~14GB Baseline

RTF = Real-Time Factor (audio_duration / generation_time). Higher is better.


Running Benchmarks

You can reproduce these benchmarks on your own hardware.

English TTS Benchmark

# Full benchmark (all backends)
python benchmark_mps.py --runs 3 --validate

# Quick test with Hybrid-MLX only
python benchmark_mps.py --hybrid-mlx-only --runs 1

# CPU baseline only
python benchmark_mps.py --cpu-only --runs 1

# With voice cloning
python benchmark_mps.py --audio-prompt speaker.wav --runs 3

# Enable memory debugging
DEBUG_MEMORY=1 python benchmark_mps.py --hybrid-mlx-only

Options:

Flag Description
--warmup N Warmup runs before timing (default: 1)
--runs N Number of timed benchmark runs (default: 3)
--devices Backends to test: mps, cpu, hybrid-mlx, mlx, mlx-q4
--audio-prompt FILE Reference audio for voice cloning
--output-dir DIR Output directory (default: benchmark_output/)
--validate Enable Whisper transcription validation (computes WER)
--mps-only Only benchmark PyTorch MPS
--cpu-only Only benchmark CPU
--hybrid-mlx-only Only benchmark Hybrid-MLX
--mlx-only Only benchmark Pure MLX
--debug-memory Enable detailed memory logging

Multilingual Benchmark

# Test specific languages
python benchmark_multilingual.py \
    --audio-prompt speaker.wav \
    --languages en es fr de ja zh \
    --runs 3

# Quick test with Hybrid-MLX
python benchmark_multilingual.py \
    --audio-prompt speaker.wav \
    --languages en es \
    --hybrid-mlx-only

# With validation
python benchmark_multilingual.py \
    --audio-prompt speaker.wav \
    --languages en es fr \
    --validate

Supported Languages: en (English), es (Spanish), fr (French), de (German), it (Italian), pt (Portuguese), ru (Russian), ja (Japanese), zh (Chinese), ko (Korean), ar (Arabic), hi (Hindi), tr (Turkish), pl (Polish), nl (Dutch), sv (Swedish), da (Danish), no (Norwegian), fi (Finnish), el (Greek), he (Hebrew), ms (Malay), sw (Swahili)

Benchmark Output

Results are saved to:

  • benchmark_output/benchmark_results.json - English TTS results
  • benchmark_multilingual_output/multilingual_results.json - Multilingual results
  • Generated audio files: {device}_{category}.wav

Architecture

Chatterbox is a two-stage TTS pipeline. This fork accelerates the most compute-intensive component (T3) with MLX:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     CHATTERBOX MLX PIPELINE                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ VoiceEncoder โ”‚    โ”‚     T3       โ”‚    โ”‚       S3Gen          โ”‚   โ”‚
โ”‚  โ”‚  (PyTorch)   โ”‚โ”€โ”€โ”€โ–ถโ”‚    (MLX)     โ”‚โ”€โ”€โ”€โ–ถโ”‚   (PyTorch/MPS)      โ”‚   โ”‚
โ”‚  โ”‚   ~2M params โ”‚    โ”‚  520M params โ”‚    โ”‚     ~80M params      โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                            โ–ฒ                                        โ”‚
โ”‚                            โ”‚                                        โ”‚
โ”‚                    2.4x faster with MLX                             โ”‚
โ”‚                                                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Supported Languages

All 23 languages from the original Chatterbox are supported:

Arabic โ€ข Danish โ€ข German โ€ข Greek โ€ข English โ€ข Spanish โ€ข Finnish โ€ข French โ€ข Hebrew โ€ข Hindi โ€ข Italian โ€ข Japanese โ€ข Korean โ€ข Malay โ€ข Dutch โ€ข Norwegian โ€ข Polish โ€ข Portuguese โ€ข Russian โ€ข Swedish โ€ข Swahili โ€ข Turkish โ€ข Chinese

from chatterbox.mtl_tts_mlx import ChatterboxMultilingualTTSMLX

model = ChatterboxMultilingualTTSMLX.from_pretrained(device="mps")

# French
wav = model.generate("Bonjour, comment รงa va?", language_id="fr")

# Japanese
wav = model.generate("ใ“ใ‚“ใซใกใฏใ€ๅ…ƒๆฐ—ใงใ™ใ‹๏ผŸ", language_id="ja")

Tips for Best Results

General Use

  • Default settings (exaggeration=0.5, cfg_weight=0.5) work well for most cases
  • Ensure reference audio matches target language to avoid accent transfer

Expressive Speech

  • Lower cfg_weight (~0.3) + higher exaggeration (~0.7) for dramatic delivery
  • Higher exaggeration speeds up speech; lower cfg_weight compensates

Memory Usage

Enable debug logging to monitor memory:

DEBUG_MEMORY=1 python your_script.py

Differences from Original Chatterbox

Feature Original (Resemble AI) This Fork
Target Hardware NVIDIA CUDA Apple Silicon
ML Framework PyTorch MLX + PyTorch hybrid
T3 Inference PyTorch MLX (2.4x faster)
KV Cache Float32 Float16 (32% faster)
Long-form Audio Basic Chunked with crossfade

Credits & Links

Upstream Dependencies


License

MIT License - Same as the original Chatterbox project.


Citation

If you use this project, please cite the original Chatterbox:

@misc{chatterboxtts2025,
  author       = {{Resemble AI}},
  title        = {{Chatterbox-TTS}},
  year         = {2025},
  howpublished = {\url{https://github.com/resemble-ai/chatterbox}},
  note         = {GitHub repository}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chatterbox_mlx-1.0.1.tar.gz (196.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chatterbox_mlx-1.0.1-py3-none-any.whl (243.1 kB view details)

Uploaded Python 3

File details

Details for the file chatterbox_mlx-1.0.1.tar.gz.

File metadata

  • Download URL: chatterbox_mlx-1.0.1.tar.gz
  • Upload date:
  • Size: 196.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for chatterbox_mlx-1.0.1.tar.gz
Algorithm Hash digest
SHA256 ad8b6c5b918143844c7a94cb7bca1bd84de98b0191744dfef195ef3c5cc4e361
MD5 6bdc012fa64d8a50a66801e06a8fa582
BLAKE2b-256 19c6d6170edf9e4d9a49008713dab9ba876ee76a247c4e657b9ebfe2089f8191

See more details on using hashes here.

File details

Details for the file chatterbox_mlx-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: chatterbox_mlx-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 243.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for chatterbox_mlx-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7c04dabbf9caca2fdcb981a6f37594c0ce7868168ce841f2d007f9993895aee6
MD5 67c537d90886580a1806374e3f35e7f1
BLAKE2b-256 71ec5ddd2b803f42e5d79ebb03617222885fd5e631d05146047787ce11a86de1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page