Skip to main content

Local audiobook generation system using MLX-Audio for Apple Silicon

Project description

Audiobook TTS

Local audiobook generation system using MLX-Audio for Apple Silicon Macs.

Features

  • High-quality narration using Kokoro (20 English preset voices)
  • Voice cloning with emotion control using Chatterbox
  • Multi-speaker dialogue using Dia with [S1]/[S2] tags
  • Layered configuration — sensible defaults + project-specific overrides
  • Project scaffoldingaudiobook-init sets up config in seconds
  • ACX-compliant audio with automatic normalization
  • Progress tracking with resume capability
  • FastAPI server for integration with other tools

Requirements

  • macOS 14.0+ (Sonoma or later)
  • Apple Silicon (M1/M2/M3/M4)
  • Python 3.10+
  • uv (required — see note below)
  • ffmpeg

Why uv? Several dependencies (misaki, transformers) have Python version metadata that pip enforces too strictly on Python 3.13+. uv handles this correctly. All install commands below use uv.

Installation

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ffmpeg
brew install ffmpeg

# Install audiobook-tts
uv add audiobook-tts

Verify installation:

audiobook-generate --list-voices
audiobook-voice-ref --list

Quick Start

1. Initialize project config

audiobook-init

This creates .audiobook/ with template configuration files:

.audiobook/
  voices.yaml       # Voice profiles (edit to add characters)
  series.yaml       # Per-book POV-to-voice mappings
  voice-refs/       # Kokoro reference clips for Chatterbox

2. Generate reference clips (for Chatterbox voice cloning)

audiobook-voice-ref --all

Generates WAV clips for all 20 English Kokoro presets in .audiobook/voice-refs/.

3. Edit configuration

Edit .audiobook/voices.yaml to define character voices and .audiobook/series.yaml to map books to POV characters. See Configuration below.

4. Generate audiobook

# With series config (automatic POV-to-voice mapping)
audiobook-generate \
  --input compiled/my-manuscript.md \
  --series-config .audiobook/series.yaml

# With specific voice
audiobook-generate --input manuscript.md --voice narrator_female_us

# Specific chapters
audiobook-generate --input manuscript.md --chapters 1-5

# As MP3
audiobook-generate --input manuscript.md --format mp3

CLI Commands

Command Purpose
audiobook-generate Generate audiobook from manuscript
audiobook-init Scaffold .audiobook/ project config
audiobook-voice-ref Generate Kokoro reference clips for Chatterbox
audiobook-server Start FastAPI TTS server

audiobook-generate

audiobook-generate --input manuscript.md [options]

Options:
  --input, -i        Path to compiled manuscript or chapter directory
  --output, -o       Output directory (default: ./output)
  --chapters, -c     Chapter range (e.g., "1-5" or "1,3,5-10")
  --voice, -v        Voice profile name (default: narrator_female_us)
  --format, -f       Output format: wav or mp3 (default: wav)
  --config           Path to voices.yaml config file
  --series-config    Path to series.yaml for automatic book/voice detection
  --resume, -r       Resume from last checkpoint
  --no-normalize     Skip ACX normalization
  --list-voices      List available voice profiles

audiobook-init

audiobook-init [--dir .audiobook] [--force]

audiobook-voice-ref

audiobook-voice-ref --all                    # Generate all 20 English presets
audiobook-voice-ref --preset af_bella        # Generate single preset
audiobook-voice-ref --list                   # List available presets
audiobook-voice-ref --all --output-dir DIR   # Custom output directory
audiobook-voice-ref --preset af_bella --text "Custom text"

Configuration

How config loading works

Configuration is layered:

  1. Built-in defaults — 9 Kokoro narrator voices + 1 Dia dialogue voice, audio settings, processing settings
  2. Project overrides (.audiobook/voices.yaml) — your character voices and Chatterbox profiles merge on top

When you run audiobook-generate --series-config .audiobook/series.yaml, the CLI automatically loads .audiobook/voices.yaml from the same directory if it exists.

voices.yaml — Voice profiles

voices:
  # Kokoro voice (fast, preset-based)
  voice_protagonist:
    model: kokoro
    voice_preset: am_liam
    speed: 1.0
    lang_code: a    # 'a' = American, 'b' = British
    description: "Young male - gentle, thoughtful"

  # Chatterbox voice (voice-cloned with emotion control)
  chatterbox_protagonist:
    model: chatterbox
    voice_preset: am_liam
    ref_audio: voice-refs/am_liam.wav   # Relative to this file's directory
    exaggeration: 0.5                    # 0.0 = neutral, 1.0 = max emotion
    cfg_weight: 0.5
    temperature: 0.8
    description: "Young male - thoughtful [Chatterbox]"

ref_audio paths are resolved relative to the config file's directory, not CWD.

series.yaml — Per-book POV mappings

series:
  name: "My Series"
  default_voice: narrator_female_us

books:
  1-my-first-book:
    title: "My First Book"
    default_voice: narrator_female_us
    pov_voices:
      Alice: chatterbox_alice
      Bob: chatterbox_bob
    chapter_announcement:
      enabled: true
      format_string: "Chapter {number}. {title}"
      pause_after_ms: 1000

Book identifiers are matched against manuscript filenames (e.g., 1-my-first-book matches compiled/1-my-first-book-manuscript.md).

Voice Profiles

Kokoro Voices (Narration)

Category Voice IDs
American Female af_heart, af_bella, af_nova, af_sky, af_nicole, af_sarah
American Male am_adam, am_echo, am_eric, am_liam, am_michael, am_onyx
British Female bf_alice, bf_emma, bf_isabella, bf_lily
British Male bm_daniel, bm_fable, bm_george, bm_lewis

Chatterbox (Voice Cloning)

Chatterbox clones any Kokoro voice from a reference clip, adding emotion control:

Parameter Default Range Effect
exaggeration 0.5 0.0-1.0 Emotion intensity (0 = flat, 1 = maximum)
cfg_weight 0.5 0.0-1.0 Text adherence (higher = more faithful)
temperature 0.8 0.0-1.0 Sampling randomness (lower = more consistent)

Guidelines:

  • Controlled characters (military, strategists): exaggeration: 0.2-0.3
  • Moderate characters (narrators, scholars): exaggeration: 0.4-0.5
  • Emotional characters (protagonists, passionate): exaggeration: 0.6-0.7

Dia (Multi-Speaker Dialogue)

[S1] The door creaked open.
[S2] Who's there? (gasps)
[S1] It's just me.

Supported non-verbal sounds: (laughs), (sighs), (gasps), (coughs), (clears throat), (screams), (whispers), (singing), (humming), (whistles)

API Server

audiobook-server --host 0.0.0.0 --port 8000
Endpoint Method Description
/v1/health GET Health check
/v1/voices GET List voices
/v1/generate POST Generate narration
/v1/generate/dialogue POST Generate dialogue
/v1/generate/chapter POST Generate chapter (background)
/v1/status/{job_id} GET Check job status

ACX/Audible Compliance

Generated audio meets ACX requirements:

  • Loudness: -20 dB LUFS (-23 to -18 dB acceptable)
  • Peak: <= -3 dB true peak
  • Noise floor: <= -60 dB
  • Room tone: 0.5s silence at start/end
  • MP3: 192 kbps CBR

Performance

On Apple Silicon M4 Max:

Model Speed Memory
Kokoro-82M ~25x real-time ~2-3 GB
Chatterbox (fp16) ~3-5x real-time ~4-6 GB
Dia-1.6B ~5-8x real-time ~4-6 GB

A 100,000-word novel (~11 hours audio):

  • Kokoro: ~25-30 minutes
  • Chatterbox: ~2-4 hours

Project Structure

audiobook-tts/
├── src/audiobook_tts/
│   ├── cli.py              # audiobook-generate CLI
│   ├── init_project.py     # audiobook-init CLI
│   ├── voice_ref.py        # audiobook-voice-ref CLI
│   ├── server.py           # FastAPI server
│   ├── config.py           # Layered config loading
│   ├── series_config.py    # Series/book config
│   ├── compat.py           # espeak/misaki compatibility shims
│   ├── defaults/
│   │   ├── voices.yaml           # Built-in default voices
│   │   ├── voices.yaml.template  # Project config template
│   │   └── series.yaml.template  # Series config template
│   ├── models/
│   │   ├── tts_engine.py   # Kokoro, Dia, CSM, Chatterbox engines
│   │   └── model_manager.py
│   ├── processing/
│   │   ├── text_processor.py
│   │   ├── audio_processor.py
│   │   └── manuscript.py
│   └── api/
│       ├── routes.py
│       └── schemas.py
├── tests/
├── pyproject.toml
└── README.md

Troubleshooting

"Model not found" error

Models download automatically from HuggingFace on first use. Ensure you have internet connectivity.

"ffmpeg not found" error

brew install ffmpeg

Out of memory

Close other memory-intensive applications. For dialogue, use the 4-bit Dia model.

Slow generation

  • Ensure you're running natively on Apple Silicon (not Rosetta)
  • Use Kokoro instead of Chatterbox for faster generation
  • Reduce chunk size via config: max_chunk_chars: 300

License

MIT License

The TTS models (Kokoro, Dia, Chatterbox) are released under Apache 2.0 license and are suitable for commercial use.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audiobook_tts-1.0.2.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

audiobook_tts-1.0.2-py3-none-any.whl (45.7 kB view details)

Uploaded Python 3

File details

Details for the file audiobook_tts-1.0.2.tar.gz.

File metadata

  • Download URL: audiobook_tts-1.0.2.tar.gz
  • Upload date:
  • Size: 36.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for audiobook_tts-1.0.2.tar.gz
Algorithm Hash digest
SHA256 e719c00ba090674016447d1e179b3a6b680a270ba8a1e4cc76fcc863fb0eb0f6
MD5 3442e4306d42443c5194babb0a722278
BLAKE2b-256 dd9e30a1999babf86b26050535361b35b3cd81f9e0e98391d84d7cdad2898ff8

See more details on using hashes here.

File details

Details for the file audiobook_tts-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for audiobook_tts-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ec5ed3c5c4417a9a28b55d25824f0f72d434ead482bd53927b6f40dadb003d41
MD5 ebdcc1630d909d7c77390ac0f25f328e
BLAKE2b-256 426526b0cb4086e85fcd2761c41a488e153f770f3e2797fffdbff2692f123f4a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page