Local audiobook generation system using MLX-Audio for Apple Silicon

Project description

Audiobook TTS

Local audiobook generation system using MLX-Audio for Apple Silicon Macs.

Features

High-quality narration using Kokoro (20 English preset voices)
Voice cloning with emotion control using Chatterbox
Multi-speaker dialogue using Dia with [S1]/[S2] tags
Layered configuration — sensible defaults + project-specific overrides
Project scaffolding — audiobook-init sets up config in seconds
ACX-compliant audio with automatic normalization
Progress tracking with resume capability
FastAPI server for integration with other tools

Requirements

macOS 14.0+ (Sonoma or later)
Apple Silicon (M1/M2/M3/M4)
Python 3.10+
uv (required — see note below)
ffmpeg

Why uv? Several dependencies (misaki, transformers) have Python version metadata that pip enforces too strictly on Python 3.13+. uv handles this correctly. All install commands below use uv.

Installation

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ffmpeg
brew install ffmpeg

# Install audiobook-tts
uv add audiobook-tts

Verify installation:

audiobook-generate --list-voices
audiobook-voice-ref --list

Quick Start

1. Initialize project config

audiobook-init

This creates .audiobook/ with template configuration files:

.audiobook/
  voices.yaml       # Voice profiles (edit to add characters)
  series.yaml       # Per-book POV-to-voice mappings
  voice-refs/       # Kokoro reference clips for Chatterbox

2. Generate reference clips (for Chatterbox voice cloning)

audiobook-voice-ref --all

Generates WAV clips for all 20 English Kokoro presets in .audiobook/voice-refs/.

3. Edit configuration

Edit .audiobook/voices.yaml to define character voices and .audiobook/series.yaml to map books to POV characters. See Configuration below.

4. Generate audiobook

# With series config (automatic POV-to-voice mapping)
audiobook-generate \
  --input compiled/my-manuscript.md \
  --series-config .audiobook/series.yaml

# With specific voice
audiobook-generate --input manuscript.md --voice narrator_female_us

# Specific chapters
audiobook-generate --input manuscript.md --chapters 1-5

# As MP3
audiobook-generate --input manuscript.md --format mp3

CLI Commands

Command	Purpose
`audiobook-generate`	Generate audiobook from manuscript
`audiobook-init`	Scaffold `.audiobook/` project config
`audiobook-voice-ref`	Generate Kokoro reference clips for Chatterbox
`audiobook-server`	Start FastAPI TTS server

audiobook-generate

audiobook-generate --input manuscript.md [options]

Options:
  --input, -i        Path to compiled manuscript or chapter directory
  --output, -o       Output directory (default: ./output)
  --chapters, -c     Chapter range (e.g., "1-5" or "1,3,5-10")
  --voice, -v        Voice profile name (default: narrator_female_us)
  --format, -f       Output format: wav or mp3 (default: wav)
  --config           Path to voices.yaml config file
  --series-config    Path to series.yaml for automatic book/voice detection
  --resume, -r       Resume from last checkpoint
  --no-normalize     Skip ACX normalization
  --list-voices      List available voice profiles

audiobook-init

audiobook-init [--dir .audiobook] [--force]

audiobook-voice-ref

audiobook-voice-ref --all                    # Generate all 20 English presets
audiobook-voice-ref --preset af_bella        # Generate single preset
audiobook-voice-ref --list                   # List available presets
audiobook-voice-ref --all --output-dir DIR   # Custom output directory
audiobook-voice-ref --preset af_bella --text "Custom text"

Configuration

How config loading works

Configuration is layered:

Built-in defaults — 9 Kokoro narrator voices + 1 Dia dialogue voice, audio settings, processing settings
Project overrides (.audiobook/voices.yaml) — your character voices and Chatterbox profiles merge on top

When you run audiobook-generate --series-config .audiobook/series.yaml, the CLI automatically loads .audiobook/voices.yaml from the same directory if it exists.

voices.yaml — Voice profiles

voices:
  # Kokoro voice (fast, preset-based)
  voice_protagonist:
    model: kokoro
    voice_preset: am_liam
    speed: 1.0
    lang_code: a    # 'a' = American, 'b' = British
    description: "Young male - gentle, thoughtful"

  # Chatterbox voice (voice-cloned with emotion control)
  chatterbox_protagonist:
    model: chatterbox
    voice_preset: am_liam
    ref_audio: voice-refs/am_liam.wav   # Relative to this file's directory
    exaggeration: 0.5                    # 0.0 = neutral, 1.0 = max emotion
    cfg_weight: 0.5
    temperature: 0.8
    description: "Young male - thoughtful [Chatterbox]"

ref_audio paths are resolved relative to the config file's directory, not CWD.

series.yaml — Per-book POV mappings

series:
  name: "My Series"
  default_voice: narrator_female_us

books:
  1-my-first-book:
    title: "My First Book"
    default_voice: narrator_female_us
    pov_voices:
      Alice: chatterbox_alice
      Bob: chatterbox_bob
    chapter_announcement:
      enabled: true
      format_string: "Chapter {number}. {title}"
      pause_after_ms: 1000

Book identifiers are matched against manuscript filenames (e.g., 1-my-first-book matches compiled/1-my-first-book-manuscript.md).

Voice Profiles

Kokoro Voices (Narration)

Category	Voice IDs
American Female	`af_heart`, `af_bella`, `af_nova`, `af_sky`, `af_nicole`, `af_sarah`
American Male	`am_adam`, `am_echo`, `am_eric`, `am_liam`, `am_michael`, `am_onyx`
British Female	`bf_alice`, `bf_emma`, `bf_isabella`, `bf_lily`
British Male	`bm_daniel`, `bm_fable`, `bm_george`, `bm_lewis`

Chatterbox (Voice Cloning)

Chatterbox clones any Kokoro voice from a reference clip, adding emotion control:

Parameter	Default	Range	Effect
`exaggeration`	0.5	0.0-1.0	Emotion intensity (0 = flat, 1 = maximum)
`cfg_weight`	0.5	0.0-1.0	Text adherence (higher = more faithful)
`temperature`	0.8	0.0-1.0	Sampling randomness (lower = more consistent)

Guidelines:

Controlled characters (military, strategists): exaggeration: 0.2-0.3
Moderate characters (narrators, scholars): exaggeration: 0.4-0.5
Emotional characters (protagonists, passionate): exaggeration: 0.6-0.7

Dia (Multi-Speaker Dialogue)

[S1] The door creaked open.
[S2] Who's there? (gasps)
[S1] It's just me.

Supported non-verbal sounds: (laughs), (sighs), (gasps), (coughs), (clears throat), (screams), (whispers), (singing), (humming), (whistles)

API Server

audiobook-server --host 0.0.0.0 --port 8000

Endpoint	Method	Description
`/v1/health`	GET	Health check
`/v1/voices`	GET	List voices
`/v1/generate`	POST	Generate narration
`/v1/generate/dialogue`	POST	Generate dialogue
`/v1/generate/chapter`	POST	Generate chapter (background)
`/v1/status/{job_id}`	GET	Check job status

ACX/Audible Compliance

Generated audio meets ACX requirements:

Loudness: -20 dB LUFS (-23 to -18 dB acceptable)
Peak: <= -3 dB true peak
Noise floor: <= -60 dB
Room tone: 0.5s silence at start/end
MP3: 192 kbps CBR

Performance

On Apple Silicon M4 Max:

Model	Speed	Memory
Kokoro-82M	~25x real-time	~2-3 GB
Chatterbox (fp16)	~3-5x real-time	~4-6 GB
Dia-1.6B	~5-8x real-time	~4-6 GB

A 100,000-word novel (~11 hours audio):

Kokoro: ~25-30 minutes
Chatterbox: ~2-4 hours

Project Structure

audiobook-tts/
├── src/audiobook_tts/
│   ├── cli.py              # audiobook-generate CLI
│   ├── init_project.py     # audiobook-init CLI
│   ├── voice_ref.py        # audiobook-voice-ref CLI
│   ├── server.py           # FastAPI server
│   ├── config.py           # Layered config loading
│   ├── series_config.py    # Series/book config
│   ├── compat.py           # espeak/misaki compatibility shims
│   ├── defaults/
│   │   ├── voices.yaml           # Built-in default voices
│   │   ├── voices.yaml.template  # Project config template
│   │   └── series.yaml.template  # Series config template
│   ├── models/
│   │   ├── tts_engine.py   # Kokoro, Dia, CSM, Chatterbox engines
│   │   └── model_manager.py
│   ├── processing/
│   │   ├── text_processor.py
│   │   ├── audio_processor.py
│   │   └── manuscript.py
│   └── api/
│       ├── routes.py
│       └── schemas.py
├── tests/
├── pyproject.toml
└── README.md

Troubleshooting

"Model not found" error

Models download automatically from HuggingFace on first use. Ensure you have internet connectivity.

"ffmpeg not found" error

brew install ffmpeg

Out of memory

Close other memory-intensive applications. For dialogue, use the 4-bit Dia model.

Slow generation

Ensure you're running natively on Apple Silicon (not Rosetta)
Use Kokoro instead of Chatterbox for faster generation
Reduce chunk size via config: max_chunk_chars: 300

License

MIT License

The TTS models (Kokoro, Dia, Chatterbox) are released under Apache 2.0 license and are suitable for commercial use.

Project details

Release history Release notifications | RSS feed

1.1.0

Mar 12, 2026

This version

1.0.2

Mar 2, 2026

1.0.1

Mar 2, 2026

1.0.0

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audiobook_tts-1.0.2.tar.gz (36.3 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

audiobook_tts-1.0.2-py3-none-any.whl (45.7 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file audiobook_tts-1.0.2.tar.gz.

File metadata

Download URL: audiobook_tts-1.0.2.tar.gz
Upload date: Mar 2, 2026
Size: 36.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for audiobook_tts-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`e719c00ba090674016447d1e179b3a6b680a270ba8a1e4cc76fcc863fb0eb0f6`
MD5	`3442e4306d42443c5194babb0a722278`
BLAKE2b-256	`dd9e30a1999babf86b26050535361b35b3cd81f9e0e98391d84d7cdad2898ff8`

See more details on using hashes here.

File details

Details for the file audiobook_tts-1.0.2-py3-none-any.whl.

File metadata

Download URL: audiobook_tts-1.0.2-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 45.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for audiobook_tts-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec5ed3c5c4417a9a28b55d25824f0f72d434ead482bd53927b6f40dadb003d41`
MD5	`ebdcc1630d909d7c77390ac0f25f328e`
BLAKE2b-256	`426526b0cb4086e85fcd2761c41a488e153f770f3e2797fffdbff2692f123f4a`

See more details on using hashes here.

audiobook-tts 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Audiobook TTS

Features

Requirements

Installation

Quick Start

1. Initialize project config

2. Generate reference clips (for Chatterbox voice cloning)

3. Edit configuration

4. Generate audiobook

CLI Commands

audiobook-generate

audiobook-init

audiobook-voice-ref

Configuration

How config loading works

voices.yaml — Voice profiles

series.yaml — Per-book POV mappings

Voice Profiles

Kokoro Voices (Narration)

Chatterbox (Voice Cloning)

Dia (Multi-Speaker Dialogue)

API Server

ACX/Audible Compliance

Performance

Project Structure

Troubleshooting

"Model not found" error

"ffmpeg not found" error

Out of memory

Slow generation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes