Local audiobook generation system using MLX-Audio for Apple Silicon
Project description
Audiobook TTS
Local audiobook generation system using MLX-Audio for Apple Silicon Macs.
Features
- High-quality narration using Kokoro (20 English preset voices)
- Voice cloning with emotion control using Chatterbox
- Multi-speaker dialogue using Dia with [S1]/[S2] tags
- Layered configuration — sensible defaults + project-specific overrides
- Project scaffolding —
audiobook-initsets up config in seconds - ACX-compliant audio with automatic normalization
- Progress tracking with resume capability
- FastAPI server for integration with other tools
Requirements
- macOS 14.0+ (Sonoma or later)
- Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- uv (required — see note below)
- ffmpeg
Why uv? Several dependencies (
misaki,transformers) have Python version metadata thatpipenforces too strictly on Python 3.13+.uvhandles this correctly. All install commands below useuv.
Installation
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install ffmpeg
brew install ffmpeg
# Install audiobook-tts
uv add audiobook-tts
Verify installation:
audiobook-generate --list-voices
audiobook-voice-ref --list
Quick Start
1. Initialize project config
audiobook-init
This creates .audiobook/ with template configuration files:
.audiobook/
voices.yaml # Voice profiles (edit to add characters)
series.yaml # Per-book POV-to-voice mappings
voice-refs/ # Kokoro reference clips for Chatterbox
2. Generate reference clips (for Chatterbox voice cloning)
audiobook-voice-ref --all
Generates WAV clips for all 20 English Kokoro presets in .audiobook/voice-refs/.
3. Edit configuration
Edit .audiobook/voices.yaml to define character voices and .audiobook/series.yaml to map books to POV characters. See Configuration below.
4. Generate audiobook
# With series config (automatic POV-to-voice mapping)
audiobook-generate \
--input compiled/my-manuscript.md \
--series-config .audiobook/series.yaml
# With specific voice
audiobook-generate --input manuscript.md --voice narrator_female_us
# Specific chapters
audiobook-generate --input manuscript.md --chapters 1-5
# As MP3
audiobook-generate --input manuscript.md --format mp3
CLI Commands
| Command | Purpose |
|---|---|
audiobook-generate |
Generate audiobook from manuscript |
audiobook-init |
Scaffold .audiobook/ project config |
audiobook-voice-ref |
Generate Kokoro reference clips for Chatterbox |
audiobook-server |
Start FastAPI TTS server |
audiobook-generate
audiobook-generate --input manuscript.md [options]
Options:
--input, -i Path to compiled manuscript or chapter directory
--output, -o Output directory (default: ./output)
--chapters, -c Chapter range (e.g., "1-5" or "1,3,5-10")
--voice, -v Voice profile name (default: narrator_female_us)
--format, -f Output format: wav or mp3 (default: wav)
--config Path to voices.yaml config file
--series-config Path to series.yaml for automatic book/voice detection
--resume, -r Resume from last checkpoint
--no-normalize Skip ACX normalization
--list-voices List available voice profiles
audiobook-init
audiobook-init [--dir .audiobook] [--force]
audiobook-voice-ref
audiobook-voice-ref --all # Generate all 20 English presets
audiobook-voice-ref --preset af_bella # Generate single preset
audiobook-voice-ref --list # List available presets
audiobook-voice-ref --all --output-dir DIR # Custom output directory
audiobook-voice-ref --preset af_bella --text "Custom text"
Configuration
How config loading works
Configuration is layered:
- Built-in defaults — 9 Kokoro narrator voices + 1 Dia dialogue voice, audio settings, processing settings
- Project overrides (
.audiobook/voices.yaml) — your character voices and Chatterbox profiles merge on top
When you run audiobook-generate --series-config .audiobook/series.yaml, the CLI automatically loads .audiobook/voices.yaml from the same directory if it exists.
voices.yaml — Voice profiles
voices:
# Kokoro voice (fast, preset-based)
voice_protagonist:
model: kokoro
voice_preset: am_liam
speed: 1.0
lang_code: a # 'a' = American, 'b' = British
description: "Young male - gentle, thoughtful"
# Chatterbox voice (voice-cloned with emotion control)
chatterbox_protagonist:
model: chatterbox
voice_preset: am_liam
ref_audio: voice-refs/am_liam.wav # Relative to this file's directory
exaggeration: 0.5 # 0.0 = neutral, 1.0 = max emotion
cfg_weight: 0.5
temperature: 0.8
description: "Young male - thoughtful [Chatterbox]"
ref_audio paths are resolved relative to the config file's directory, not CWD.
series.yaml — Per-book POV mappings
series:
name: "My Series"
default_voice: narrator_female_us
books:
1-my-first-book:
title: "My First Book"
default_voice: narrator_female_us
pov_voices:
Alice: chatterbox_alice
Bob: chatterbox_bob
chapter_announcement:
enabled: true
format_string: "Chapter {number}. {title}"
pause_after_ms: 1000
Book identifiers are matched against manuscript filenames (e.g., 1-my-first-book matches compiled/1-my-first-book-manuscript.md).
Voice Profiles
Kokoro Voices (Narration)
| Category | Voice IDs |
|---|---|
| American Female | af_heart, af_bella, af_nova, af_sky, af_nicole, af_sarah |
| American Male | am_adam, am_echo, am_eric, am_liam, am_michael, am_onyx |
| British Female | bf_alice, bf_emma, bf_isabella, bf_lily |
| British Male | bm_daniel, bm_fable, bm_george, bm_lewis |
Chatterbox (Voice Cloning)
Chatterbox clones any Kokoro voice from a reference clip, adding emotion control:
| Parameter | Default | Range | Effect |
|---|---|---|---|
exaggeration |
0.5 | 0.0-1.0 | Emotion intensity (0 = flat, 1 = maximum) |
cfg_weight |
0.5 | 0.0-1.0 | Text adherence (higher = more faithful) |
temperature |
0.8 | 0.0-1.0 | Sampling randomness (lower = more consistent) |
Guidelines:
- Controlled characters (military, strategists):
exaggeration: 0.2-0.3 - Moderate characters (narrators, scholars):
exaggeration: 0.4-0.5 - Emotional characters (protagonists, passionate):
exaggeration: 0.6-0.7
Dia (Multi-Speaker Dialogue)
[S1] The door creaked open.
[S2] Who's there? (gasps)
[S1] It's just me.
Supported non-verbal sounds: (laughs), (sighs), (gasps), (coughs), (clears throat), (screams), (whispers), (singing), (humming), (whistles)
API Server
audiobook-server --host 0.0.0.0 --port 8000
| Endpoint | Method | Description |
|---|---|---|
/v1/health |
GET | Health check |
/v1/voices |
GET | List voices |
/v1/generate |
POST | Generate narration |
/v1/generate/dialogue |
POST | Generate dialogue |
/v1/generate/chapter |
POST | Generate chapter (background) |
/v1/status/{job_id} |
GET | Check job status |
ACX/Audible Compliance
Generated audio meets ACX requirements:
- Loudness: -20 dB LUFS (-23 to -18 dB acceptable)
- Peak: <= -3 dB true peak
- Noise floor: <= -60 dB
- Room tone: 0.5s silence at start/end
- MP3: 192 kbps CBR
Performance
On Apple Silicon M4 Max:
| Model | Speed | Memory |
|---|---|---|
| Kokoro-82M | ~25x real-time | ~2-3 GB |
| Chatterbox (fp16) | ~3-5x real-time | ~4-6 GB |
| Dia-1.6B | ~5-8x real-time | ~4-6 GB |
A 100,000-word novel (~11 hours audio):
- Kokoro: ~25-30 minutes
- Chatterbox: ~2-4 hours
Project Structure
audiobook-tts/
├── src/audiobook_tts/
│ ├── cli.py # audiobook-generate CLI
│ ├── init_project.py # audiobook-init CLI
│ ├── voice_ref.py # audiobook-voice-ref CLI
│ ├── server.py # FastAPI server
│ ├── config.py # Layered config loading
│ ├── series_config.py # Series/book config
│ ├── compat.py # espeak/misaki compatibility shims
│ ├── defaults/
│ │ ├── voices.yaml # Built-in default voices
│ │ ├── voices.yaml.template # Project config template
│ │ └── series.yaml.template # Series config template
│ ├── models/
│ │ ├── tts_engine.py # Kokoro, Dia, CSM, Chatterbox engines
│ │ └── model_manager.py
│ ├── processing/
│ │ ├── text_processor.py
│ │ ├── audio_processor.py
│ │ └── manuscript.py
│ └── api/
│ ├── routes.py
│ └── schemas.py
├── tests/
├── pyproject.toml
└── README.md
Troubleshooting
"Model not found" error
Models download automatically from HuggingFace on first use. Ensure you have internet connectivity.
"ffmpeg not found" error
brew install ffmpeg
Out of memory
Close other memory-intensive applications. For dialogue, use the 4-bit Dia model.
Slow generation
- Ensure you're running natively on Apple Silicon (not Rosetta)
- Use Kokoro instead of Chatterbox for faster generation
- Reduce chunk size via config:
max_chunk_chars: 300
License
MIT License
The TTS models (Kokoro, Dia, Chatterbox) are released under Apache 2.0 license and are suitable for commercial use.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file audiobook_tts-1.0.2.tar.gz.
File metadata
- Download URL: audiobook_tts-1.0.2.tar.gz
- Upload date:
- Size: 36.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e719c00ba090674016447d1e179b3a6b680a270ba8a1e4cc76fcc863fb0eb0f6
|
|
| MD5 |
3442e4306d42443c5194babb0a722278
|
|
| BLAKE2b-256 |
dd9e30a1999babf86b26050535361b35b3cd81f9e0e98391d84d7cdad2898ff8
|
File details
Details for the file audiobook_tts-1.0.2-py3-none-any.whl.
File metadata
- Download URL: audiobook_tts-1.0.2-py3-none-any.whl
- Upload date:
- Size: 45.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec5ed3c5c4417a9a28b55d25824f0f72d434ead482bd53927b6f40dadb003d41
|
|
| MD5 |
ebdcc1630d909d7c77390ac0f25f328e
|
|
| BLAKE2b-256 |
426526b0cb4086e85fcd2761c41a488e153f770f3e2797fffdbff2692f123f4a
|