Skip to main content

Local audiobook generation system using MLX-Audio for Apple Silicon

Project description

Audiobook TTS

Local audiobook generation system using MLX-Audio for Apple Silicon Macs.

Features

  • High-quality narration using Kokoro (54 preset voices)
  • Voice cloning with emotion control using Chatterbox (clones Kokoro voices with per-character emotion exaggeration)
  • Multi-speaker dialogue using Dia with [S1]/[S2] tags
  • ACX-compliant audio with automatic normalization
  • Progress tracking with resume capability
  • FastAPI server for integration with other tools
  • CLI tool for batch processing

Requirements

  • macOS 14.0+ (Sonoma or later)
  • Apple Silicon (M1/M2/M3/M4)
  • Python 3.10+
  • uv - Fast Python package manager
  • ffmpeg

Installation

1. Install System Dependencies

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ffmpeg
brew install ffmpeg

2. Install Package with uv

cd tools/audiobook-tts

# Create venv and install with dev dependencies
uv venv .venv
uv pip install -e ".[dev]"

This installs:

  • mlx-audio - MLX-optimized TTS models
  • pydub, soundfile - Audio processing
  • ffmpeg-normalize - ACX-compliant normalization
  • fastapi, uvicorn - API server
  • rich - Beautiful CLI output
  • pytest, pytest-cov - Testing (dev)

3. Verify Installation

# Test Kokoro model
uv run python -c "from mlx_audio.tts.utils import load_model; m = load_model('mlx-community/Kokoro-82M-bf16'); print('Kokoro loaded!')"

# Test CLI
uv run audiobook-generate --list-voices

# Run tests
uv run pytest tests/ -v

Usage

Command-Line Interface

Generate Full Audiobook

# From compiled manuscript
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md

# With specific voice
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --voice narrator_male_uk

Generate Specific Chapters

# Generate chapters 1-5
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --chapters 1-5

# Generate specific chapters
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --chapters 1,3,5-10

Resume Interrupted Generation

audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --resume

List Available Voices

audiobook-generate --list-voices

API Server

Start Server

audiobook-server --host 0.0.0.0 --port 8000

API Endpoints

# Health check
curl http://localhost:8000/v1/health

# List voices
curl http://localhost:8000/v1/voices

# Generate narration
curl -X POST http://localhost:8000/v1/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, world!", "voice": "narrator_female_us"}'

# Generate dialogue
curl -X POST http://localhost:8000/v1/generate/dialogue \
  -H "Content-Type: application/json" \
  -d '{"text": "[S1] Hello! [S2] Hi there. (laughs)"}'

# Generate chapter (background task)
curl -X POST http://localhost:8000/v1/generate/chapter \
  -H "Content-Type: application/json" \
  -d '{"text": "Chapter text here...", "chapter_number": 1}'

# Check job status
curl http://localhost:8000/v1/status/{job_id}

Python Library

from audiobook_tts.models import KokoroEngine, ModelManager
from audiobook_tts.processing import TextProcessor, AudioProcessor
from audiobook_tts.config import Config

# Initialize
config = Config.default()
model_manager = ModelManager()
text_processor = TextProcessor(max_chunk_chars=400)
audio_processor = AudioProcessor()

# Load manuscript
from pathlib import Path
text = Path("../../compiled/1-resonance-and-reason-manuscript.md").read_text()

# Process chapter
chunks = text_processor.chunk_text(text[:5000])  # First 5000 chars

# Generate audio
kokoro = model_manager.get_kokoro()
audio_segments = []

for chunk in chunks:
    for segment in kokoro.generate(chunk, voice="af_bella"):
        audio_segments.append(segment.audio)

# Combine and save
combined = audio_processor.concatenate_segments(audio_segments)
combined = audio_processor.add_silence_padding(combined)
audio_processor.save_audio(combined, Path("./output/test.wav"), normalize=True)

Voice Profiles

Kokoro Voices (Narration)

Category Voice IDs
American Female af_heart, af_bella, af_nova, af_sky, af_nicole, af_sarah
American Male am_adam, am_echo, am_eric, am_liam, am_michael, am_onyx
British Female bf_alice, bf_emma, bf_isabella, bf_lily
British Male bm_daniel, bm_fable, bm_george, bm_lewis

Dia (Multi-Speaker Dialogue)

Use speaker tags to indicate different speakers:

[S1] The door creaked open.
[S2] Who's there? (gasps)
[S1] It's just me.

Supported non-verbal sounds:

  • (laughs), (sighs), (gasps), (coughs)
  • (clears throat), (screams), (whispers)
  • (singing), (humming), (whistles)

Chatterbox Voice Cloning

Chatterbox TTS provides voice cloning with per-character emotion control. It clones Kokoro voices from reference audio clips, then adds adjustable emotion exaggeration.

Step 1: Generate Reference Clips

Generate ~15 second Kokoro reference clips for each character voice:

cd tools/audiobook-tts
uv run python scripts/generate_voice_refs.py

This creates WAV files in audiobooks/voice-refs/ (one per Kokoro preset used in the series).

Step 2: Use Chatterbox Profiles

Chatterbox profiles are pre-configured in config/voices.yaml (prefixed chatterbox_*) and mapped in config/series.yaml. Generate as usual:

uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --series-config config/series.yaml \
  --output ./output/book1/

Tuning Chatterbox Parameters

Each Chatterbox voice profile has three tunable parameters in config/voices.yaml:

Parameter Default Range Effect
exaggeration 0.5 0.0-1.0 Emotion intensity. 0.0 = flat/neutral, 1.0 = maximum emotion
cfg_weight 0.5 0.0-1.0 Classifier-free guidance. Higher = more adherence to text
temperature 0.8 0.0-1.0 Sampling randomness. Lower = more consistent, higher = more varied

Character-appropriate defaults:

  • Controlled characters (Cassieth, Aurelius, Decimus, Basileon): exaggeration: 0.2-0.3
  • Moderate characters (Dessa, Kael, Jorin, Vara): exaggeration: 0.4-0.5
  • Emotional characters (Mirael, Lysa): exaggeration: 0.6-0.7

Switching Between Engines

To switch a character back to Kokoro, change the profile name in config/series.yaml:

# Chatterbox (voice-cloned with emotion)
Mirael: chatterbox_mirael

# Kokoro (preset voice, faster)
Mirael: voice_mirael

Both Kokoro and Chatterbox profiles are defined in config/voices.yaml — the original Kokoro profiles remain available as fallback.

Performance

Chatterbox is approximately 5-10x slower than Kokoro due to the voice cloning process:

Model Speed Memory
Kokoro-82M ~25x real-time ~2-3 GB
Chatterbox (fp16) ~3-5x real-time ~4-6 GB

A 100,000-word novel takes approximately 2-4 hours with Chatterbox (vs ~30 minutes with Kokoro).

Configuration

Custom Voice Configuration

Create a config/voices.yaml file:

voices:
  my_narrator:
    model: kokoro
    voice_preset: af_bella
    speed: 0.95  # Slightly slower
    lang_code: a
    description: "Custom narrator voice"

audio:
  sample_rate: 24000
  output_format: wav
  target_lufs: -20.0
  true_peak: -3.0

processing:
  max_chunk_chars: 400
  crossfade_ms: 50
  silence_padding_ms: 500

Use with CLI:

audiobook-generate --input manuscript.md --config config/voices.yaml

ACX/Audible Compliance

Generated audio meets ACX requirements:

  • Sample rate: 44.1 kHz (upsampled from 24kHz)
  • Bit rate: 192 kbps CBR (MP3)
  • Loudness: -20 dB LUFS (-23 to -18 dB acceptable)
  • Peak: ≤ -3 dB true peak
  • Noise floor: ≤ -60 dB
  • Room tone: 0.5s silence at start/end

Performance

On Apple Silicon M4 Max:

Model Speed Memory
Kokoro-82M ~25x real-time ~2-3 GB
Dia-1.6B ~5-8x real-time ~4-6 GB
Dia-1.6B-4bit ~8-12x real-time ~2-3 GB

A 100,000-word novel (~11 hours audio) takes approximately:

  • Kokoro: ~25-30 minutes to generate
  • With normalization: Add ~5-10 minutes

Troubleshooting

"Model not found" Error

Models are downloaded automatically from HuggingFace on first use. Ensure you have internet connectivity.

"ffmpeg not found" Error

Install ffmpeg:

brew install ffmpeg

Out of Memory

Use the 4-bit Dia model for dialogue:

from audiobook_tts.models import DiaEngine
dia = DiaEngine(use_4bit=True)

Slow Generation

  • Ensure you're using Apple Silicon (not Rosetta)
  • Close other memory-intensive applications
  • Use smaller chunk sizes: --config with max_chunk_chars: 300

Directory Structure

tools/audiobook-tts/
├── src/audiobook_tts/
│   ├── __init__.py
│   ├── server.py          # FastAPI server
│   ├── cli.py             # Command-line interface
│   ├── config.py          # Configuration
│   ├── models/
│   │   ├── tts_engine.py  # TTS model wrappers (Kokoro, Dia, CSM, Chatterbox)
│   │   └── model_manager.py
│   ├── processing/
│   │   ├── text_processor.py   # Text chunking
│   │   ├── audio_processor.py  # Audio concatenation
│   │   └── manuscript.py       # Manuscript handling
│   └── api/
│       ├── routes.py      # API endpoints
│       └── schemas.py     # Pydantic models
├── scripts/
│   └── generate_voice_refs.py  # Generate Kokoro reference clips for Chatterbox
├── config/
│   ├── voices.yaml        # Voice configuration (Kokoro + Chatterbox profiles)
│   └── series.yaml        # Per-book POV-to-voice mappings
├── output/                # Generated audio files
├── cache/                 # Progress tracking
├── pyproject.toml
└── README.md

Next Steps: Offline Audiobook Generation

Step 1: Verify Prerequisites

cd tools/audiobook-tts

# Check all dependencies
uv run python -c "from audiobook_tts.utils import get_dependency_status; print(get_dependency_status())"

# Or manually check:
ffmpeg -version          # Should show version info
ffmpeg-normalize --help  # Should show help

Step 2: Test with Sample Text

Before generating a full book, test with a short sample:

# Generate 30-second test
uv run python -c "
from audiobook_tts.models import KokoroEngine
from audiobook_tts.processing import AudioProcessor
from pathlib import Path

engine = KokoroEngine()
processor = AudioProcessor()

# Test generation
segments = list(engine.generate('This is a test of the audiobook generation system. The quick brown fox jumps over the lazy dog.', voice='af_bella'))
audio = processor.concatenate_segments([s.audio for s in segments])
processor.save_audio_raw(audio, Path('./output/test-sample.wav'))
print('Test audio saved to output/test-sample.wav')
"

Step 3: Generate Audiobook for a Single Chapter

# Generate Chapter 1 of Book 1
uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --chapters 1 \
  --voice af_bella \
  --format mp3 \
  --output ./output/book1/

Step 4: Generate Full Book with Series Configuration

The series configuration maps POV characters to specific voices:

# Generate Book 1 with POV-aware voice selection
uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --series-config ./config/series.yaml \
  --format mp3 \
  --output ./output/book1/

# Generate Book 2
uv run audiobook-generate \
  --input ../../compiled/2-vessels-and-vestments-manuscript.md \
  --series-config ./config/series.yaml \
  --format mp3 \
  --output ./output/book2/

Step 5: Resume Interrupted Generation

Generation progress is saved automatically. To resume:

uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --series-config ./config/series.yaml \
  --resume

Estimated Generation Times

Book Word Count Estimated Audio Generation Time*
Book 1: Resonance and Reason 138,249 ~15 hours ~35 min
Book 2: Vessels and Vestments 221,498 ~24 hours ~55 min
Book 4: Canon and Council ~200,000 ~22 hours ~50 min

*On Apple Silicon M4 Max with Kokoro. Add ~10-15 min for MP3 normalization.

Series Configuration (config/series.yaml)

Edit config/series.yaml to customize voice assignments:

name: "A Testament of Stone"
default_voice: "af_bella"  # Default narrator

books:
  resonance-and-reason:
    title: "Resonance and Reason"
    pov_voices:
      Kael: "am_michael"      # Male, intense
      Basileon: "bm_george"   # British male, imperial
      Tiberus: "am_adam"      # Male, military
      Vara: "bf_emma"         # British female, scholarly
      Jorin: "am_echo"        # Male, gentle
      Dessa: "af_bella"       # Female, warm

Output Structure

Generated files are organized as:

output/
├── book1/
│   ├── chapter-01.mp3
│   ├── chapter-02.mp3
│   ├── ...
│   └── progress.json      # Resume tracking
├── book2/
│   └── ...
└── test-sample.wav        # Test files

PDF Generation (MacTeX)

MacTeX is installed but needs PATH configuration:

# Add to ~/.zshrc or ~/.bash_profile
export PATH="/Library/TeX/texbin:$PATH"

# Reload shell
source ~/.zshrc

# Verify
pdflatex --version

License

MIT License - see LICENSE file.

The TTS models (Kokoro, Dia) are released under Apache 2.0 license and are suitable for commercial use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audiobook_tts-1.0.0.tar.gz (37.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

audiobook_tts-1.0.0-py3-none-any.whl (47.1 kB view details)

Uploaded Python 3

File details

Details for the file audiobook_tts-1.0.0.tar.gz.

File metadata

  • Download URL: audiobook_tts-1.0.0.tar.gz
  • Upload date:
  • Size: 37.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for audiobook_tts-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a09b4de0522abb4c8a8c4823e326351a2364d1a03db1142153844a6b10868479
MD5 54fe86a67111857301bb5cd288a0a73d
BLAKE2b-256 d2f08a4487161df45b1feb1836dc6500bc9280c37d25549e9554afe21b297e55

See more details on using hashes here.

File details

Details for the file audiobook_tts-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for audiobook_tts-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d8b43c6efdbe59bc84cd5ab8f0f494d2e952ede981d61a3672951369df76fb16
MD5 889ecf3ae9576dc2f165dfc590ca7c35
BLAKE2b-256 4f1a59da74af62f29987e1f421b0f209fb6ca24d7ef95c7d5721a11a3b1134be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page