Local audiobook generation system using MLX-Audio for Apple Silicon
Project description
Audiobook TTS
Local audiobook generation system using MLX-Audio for Apple Silicon Macs.
Features
- High-quality narration using Kokoro (54 preset voices)
- Voice cloning with emotion control using Chatterbox (clones Kokoro voices with per-character emotion exaggeration)
- Multi-speaker dialogue using Dia with [S1]/[S2] tags
- ACX-compliant audio with automatic normalization
- Progress tracking with resume capability
- FastAPI server for integration with other tools
- CLI tool for batch processing
Requirements
- macOS 14.0+ (Sonoma or later)
- Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- uv - Fast Python package manager
- ffmpeg
Installation
1. Install System Dependencies
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install ffmpeg
brew install ffmpeg
2. Install Package with uv
cd tools/audiobook-tts
# Create venv and install with dev dependencies
uv venv .venv
uv pip install -e ".[dev]"
This installs:
mlx-audio- MLX-optimized TTS modelspydub,soundfile- Audio processingffmpeg-normalize- ACX-compliant normalizationfastapi,uvicorn- API serverrich- Beautiful CLI outputpytest,pytest-cov- Testing (dev)
3. Verify Installation
# Test Kokoro model
uv run python -c "from mlx_audio.tts.utils import load_model; m = load_model('mlx-community/Kokoro-82M-bf16'); print('Kokoro loaded!')"
# Test CLI
uv run audiobook-generate --list-voices
# Run tests
uv run pytest tests/ -v
Usage
Command-Line Interface
Generate Full Audiobook
# From compiled manuscript
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md
# With specific voice
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --voice narrator_male_uk
Generate Specific Chapters
# Generate chapters 1-5
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --chapters 1-5
# Generate specific chapters
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --chapters 1,3,5-10
Resume Interrupted Generation
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --resume
List Available Voices
audiobook-generate --list-voices
API Server
Start Server
audiobook-server --host 0.0.0.0 --port 8000
API Endpoints
# Health check
curl http://localhost:8000/v1/health
# List voices
curl http://localhost:8000/v1/voices
# Generate narration
curl -X POST http://localhost:8000/v1/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello, world!", "voice": "narrator_female_us"}'
# Generate dialogue
curl -X POST http://localhost:8000/v1/generate/dialogue \
-H "Content-Type: application/json" \
-d '{"text": "[S1] Hello! [S2] Hi there. (laughs)"}'
# Generate chapter (background task)
curl -X POST http://localhost:8000/v1/generate/chapter \
-H "Content-Type: application/json" \
-d '{"text": "Chapter text here...", "chapter_number": 1}'
# Check job status
curl http://localhost:8000/v1/status/{job_id}
Python Library
from audiobook_tts.models import KokoroEngine, ModelManager
from audiobook_tts.processing import TextProcessor, AudioProcessor
from audiobook_tts.config import Config
# Initialize
config = Config.default()
model_manager = ModelManager()
text_processor = TextProcessor(max_chunk_chars=400)
audio_processor = AudioProcessor()
# Load manuscript
from pathlib import Path
text = Path("../../compiled/1-resonance-and-reason-manuscript.md").read_text()
# Process chapter
chunks = text_processor.chunk_text(text[:5000]) # First 5000 chars
# Generate audio
kokoro = model_manager.get_kokoro()
audio_segments = []
for chunk in chunks:
for segment in kokoro.generate(chunk, voice="af_bella"):
audio_segments.append(segment.audio)
# Combine and save
combined = audio_processor.concatenate_segments(audio_segments)
combined = audio_processor.add_silence_padding(combined)
audio_processor.save_audio(combined, Path("./output/test.wav"), normalize=True)
Voice Profiles
Kokoro Voices (Narration)
| Category | Voice IDs |
|---|---|
| American Female | af_heart, af_bella, af_nova, af_sky, af_nicole, af_sarah |
| American Male | am_adam, am_echo, am_eric, am_liam, am_michael, am_onyx |
| British Female | bf_alice, bf_emma, bf_isabella, bf_lily |
| British Male | bm_daniel, bm_fable, bm_george, bm_lewis |
Dia (Multi-Speaker Dialogue)
Use speaker tags to indicate different speakers:
[S1] The door creaked open.
[S2] Who's there? (gasps)
[S1] It's just me.
Supported non-verbal sounds:
(laughs),(sighs),(gasps),(coughs)(clears throat),(screams),(whispers)(singing),(humming),(whistles)
Chatterbox Voice Cloning
Chatterbox TTS provides voice cloning with per-character emotion control. It clones Kokoro voices from reference audio clips, then adds adjustable emotion exaggeration.
Step 1: Generate Reference Clips
Generate ~15 second Kokoro reference clips for each character voice:
cd tools/audiobook-tts
uv run python scripts/generate_voice_refs.py
This creates WAV files in audiobooks/voice-refs/ (one per Kokoro preset used in the series).
Step 2: Use Chatterbox Profiles
Chatterbox profiles are pre-configured in config/voices.yaml (prefixed chatterbox_*) and mapped in config/series.yaml. Generate as usual:
uv run audiobook-generate \
--input ../../compiled/1-resonance-and-reason-manuscript.md \
--series-config config/series.yaml \
--output ./output/book1/
Tuning Chatterbox Parameters
Each Chatterbox voice profile has three tunable parameters in config/voices.yaml:
| Parameter | Default | Range | Effect |
|---|---|---|---|
exaggeration |
0.5 | 0.0-1.0 | Emotion intensity. 0.0 = flat/neutral, 1.0 = maximum emotion |
cfg_weight |
0.5 | 0.0-1.0 | Classifier-free guidance. Higher = more adherence to text |
temperature |
0.8 | 0.0-1.0 | Sampling randomness. Lower = more consistent, higher = more varied |
Character-appropriate defaults:
- Controlled characters (Cassieth, Aurelius, Decimus, Basileon):
exaggeration: 0.2-0.3 - Moderate characters (Dessa, Kael, Jorin, Vara):
exaggeration: 0.4-0.5 - Emotional characters (Mirael, Lysa):
exaggeration: 0.6-0.7
Switching Between Engines
To switch a character back to Kokoro, change the profile name in config/series.yaml:
# Chatterbox (voice-cloned with emotion)
Mirael: chatterbox_mirael
# Kokoro (preset voice, faster)
Mirael: voice_mirael
Both Kokoro and Chatterbox profiles are defined in config/voices.yaml — the original Kokoro profiles remain available as fallback.
Performance
Chatterbox is approximately 5-10x slower than Kokoro due to the voice cloning process:
| Model | Speed | Memory |
|---|---|---|
| Kokoro-82M | ~25x real-time | ~2-3 GB |
| Chatterbox (fp16) | ~3-5x real-time | ~4-6 GB |
A 100,000-word novel takes approximately 2-4 hours with Chatterbox (vs ~30 minutes with Kokoro).
Configuration
Custom Voice Configuration
Create a config/voices.yaml file:
voices:
my_narrator:
model: kokoro
voice_preset: af_bella
speed: 0.95 # Slightly slower
lang_code: a
description: "Custom narrator voice"
audio:
sample_rate: 24000
output_format: wav
target_lufs: -20.0
true_peak: -3.0
processing:
max_chunk_chars: 400
crossfade_ms: 50
silence_padding_ms: 500
Use with CLI:
audiobook-generate --input manuscript.md --config config/voices.yaml
ACX/Audible Compliance
Generated audio meets ACX requirements:
- Sample rate: 44.1 kHz (upsampled from 24kHz)
- Bit rate: 192 kbps CBR (MP3)
- Loudness: -20 dB LUFS (-23 to -18 dB acceptable)
- Peak: ≤ -3 dB true peak
- Noise floor: ≤ -60 dB
- Room tone: 0.5s silence at start/end
Performance
On Apple Silicon M4 Max:
| Model | Speed | Memory |
|---|---|---|
| Kokoro-82M | ~25x real-time | ~2-3 GB |
| Dia-1.6B | ~5-8x real-time | ~4-6 GB |
| Dia-1.6B-4bit | ~8-12x real-time | ~2-3 GB |
A 100,000-word novel (~11 hours audio) takes approximately:
- Kokoro: ~25-30 minutes to generate
- With normalization: Add ~5-10 minutes
Troubleshooting
"Model not found" Error
Models are downloaded automatically from HuggingFace on first use. Ensure you have internet connectivity.
"ffmpeg not found" Error
Install ffmpeg:
brew install ffmpeg
Out of Memory
Use the 4-bit Dia model for dialogue:
from audiobook_tts.models import DiaEngine
dia = DiaEngine(use_4bit=True)
Slow Generation
- Ensure you're using Apple Silicon (not Rosetta)
- Close other memory-intensive applications
- Use smaller chunk sizes:
--configwithmax_chunk_chars: 300
Directory Structure
tools/audiobook-tts/
├── src/audiobook_tts/
│ ├── __init__.py
│ ├── server.py # FastAPI server
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration
│ ├── models/
│ │ ├── tts_engine.py # TTS model wrappers (Kokoro, Dia, CSM, Chatterbox)
│ │ └── model_manager.py
│ ├── processing/
│ │ ├── text_processor.py # Text chunking
│ │ ├── audio_processor.py # Audio concatenation
│ │ └── manuscript.py # Manuscript handling
│ └── api/
│ ├── routes.py # API endpoints
│ └── schemas.py # Pydantic models
├── scripts/
│ └── generate_voice_refs.py # Generate Kokoro reference clips for Chatterbox
├── config/
│ ├── voices.yaml # Voice configuration (Kokoro + Chatterbox profiles)
│ └── series.yaml # Per-book POV-to-voice mappings
├── output/ # Generated audio files
├── cache/ # Progress tracking
├── pyproject.toml
└── README.md
Next Steps: Offline Audiobook Generation
Step 1: Verify Prerequisites
cd tools/audiobook-tts
# Check all dependencies
uv run python -c "from audiobook_tts.utils import get_dependency_status; print(get_dependency_status())"
# Or manually check:
ffmpeg -version # Should show version info
ffmpeg-normalize --help # Should show help
Step 2: Test with Sample Text
Before generating a full book, test with a short sample:
# Generate 30-second test
uv run python -c "
from audiobook_tts.models import KokoroEngine
from audiobook_tts.processing import AudioProcessor
from pathlib import Path
engine = KokoroEngine()
processor = AudioProcessor()
# Test generation
segments = list(engine.generate('This is a test of the audiobook generation system. The quick brown fox jumps over the lazy dog.', voice='af_bella'))
audio = processor.concatenate_segments([s.audio for s in segments])
processor.save_audio_raw(audio, Path('./output/test-sample.wav'))
print('Test audio saved to output/test-sample.wav')
"
Step 3: Generate Audiobook for a Single Chapter
# Generate Chapter 1 of Book 1
uv run audiobook-generate \
--input ../../compiled/1-resonance-and-reason-manuscript.md \
--chapters 1 \
--voice af_bella \
--format mp3 \
--output ./output/book1/
Step 4: Generate Full Book with Series Configuration
The series configuration maps POV characters to specific voices:
# Generate Book 1 with POV-aware voice selection
uv run audiobook-generate \
--input ../../compiled/1-resonance-and-reason-manuscript.md \
--series-config ./config/series.yaml \
--format mp3 \
--output ./output/book1/
# Generate Book 2
uv run audiobook-generate \
--input ../../compiled/2-vessels-and-vestments-manuscript.md \
--series-config ./config/series.yaml \
--format mp3 \
--output ./output/book2/
Step 5: Resume Interrupted Generation
Generation progress is saved automatically. To resume:
uv run audiobook-generate \
--input ../../compiled/1-resonance-and-reason-manuscript.md \
--series-config ./config/series.yaml \
--resume
Estimated Generation Times
| Book | Word Count | Estimated Audio | Generation Time* |
|---|---|---|---|
| Book 1: Resonance and Reason | 138,249 | ~15 hours | ~35 min |
| Book 2: Vessels and Vestments | 221,498 | ~24 hours | ~55 min |
| Book 4: Canon and Council | ~200,000 | ~22 hours | ~50 min |
*On Apple Silicon M4 Max with Kokoro. Add ~10-15 min for MP3 normalization.
Series Configuration (config/series.yaml)
Edit config/series.yaml to customize voice assignments:
name: "A Testament of Stone"
default_voice: "af_bella" # Default narrator
books:
resonance-and-reason:
title: "Resonance and Reason"
pov_voices:
Kael: "am_michael" # Male, intense
Basileon: "bm_george" # British male, imperial
Tiberus: "am_adam" # Male, military
Vara: "bf_emma" # British female, scholarly
Jorin: "am_echo" # Male, gentle
Dessa: "af_bella" # Female, warm
Output Structure
Generated files are organized as:
output/
├── book1/
│ ├── chapter-01.mp3
│ ├── chapter-02.mp3
│ ├── ...
│ └── progress.json # Resume tracking
├── book2/
│ └── ...
└── test-sample.wav # Test files
PDF Generation (MacTeX)
MacTeX is installed but needs PATH configuration:
# Add to ~/.zshrc or ~/.bash_profile
export PATH="/Library/TeX/texbin:$PATH"
# Reload shell
source ~/.zshrc
# Verify
pdflatex --version
License
MIT License - see LICENSE file.
The TTS models (Kokoro, Dia) are released under Apache 2.0 license and are suitable for commercial use.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file audiobook_tts-1.0.0.tar.gz.
File metadata
- Download URL: audiobook_tts-1.0.0.tar.gz
- Upload date:
- Size: 37.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a09b4de0522abb4c8a8c4823e326351a2364d1a03db1142153844a6b10868479
|
|
| MD5 |
54fe86a67111857301bb5cd288a0a73d
|
|
| BLAKE2b-256 |
d2f08a4487161df45b1feb1836dc6500bc9280c37d25549e9554afe21b297e55
|
File details
Details for the file audiobook_tts-1.0.0-py3-none-any.whl.
File metadata
- Download URL: audiobook_tts-1.0.0-py3-none-any.whl
- Upload date:
- Size: 47.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8b43c6efdbe59bc84cd5ab8f0f494d2e952ede981d61a3672951369df76fb16
|
|
| MD5 |
889ecf3ae9576dc2f165dfc590ca7c35
|
|
| BLAKE2b-256 |
4f1a59da74af62f29987e1f421b0f209fb6ca24d7ef95c7d5721a11a3b1134be
|