Local audiobook generation system using MLX-Audio for Apple Silicon

Project description

Audiobook TTS

Local audiobook generation system using MLX-Audio for Apple Silicon Macs.

Features

High-quality narration using Kokoro (54 preset voices)
Voice cloning with emotion control using Chatterbox (clones Kokoro voices with per-character emotion exaggeration)
Multi-speaker dialogue using Dia with [S1]/[S2] tags
ACX-compliant audio with automatic normalization
Progress tracking with resume capability
FastAPI server for integration with other tools
CLI tool for batch processing

Requirements

macOS 14.0+ (Sonoma or later)
Apple Silicon (M1/M2/M3/M4)
Python 3.10+
uv - Fast Python package manager
ffmpeg

Installation

1. Install System Dependencies

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ffmpeg
brew install ffmpeg

2. Install Package with uv

cd tools/audiobook-tts

# Create venv and install with dev dependencies
uv venv .venv
uv pip install -e ".[dev]"

This installs:

mlx-audio - MLX-optimized TTS models
pydub, soundfile - Audio processing
ffmpeg-normalize - ACX-compliant normalization
fastapi, uvicorn - API server
rich - Beautiful CLI output
pytest, pytest-cov - Testing (dev)

3. Verify Installation

# Test Kokoro model
uv run python -c "from mlx_audio.tts.utils import load_model; m = load_model('mlx-community/Kokoro-82M-bf16'); print('Kokoro loaded!')"

# Test CLI
uv run audiobook-generate --list-voices

# Run tests
uv run pytest tests/ -v

Usage

Command-Line Interface

Generate Full Audiobook

# From compiled manuscript
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md

# With specific voice
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --voice narrator_male_uk

Generate Specific Chapters

# Generate chapters 1-5
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --chapters 1-5

# Generate specific chapters
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --chapters 1,3,5-10

Resume Interrupted Generation

audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --resume

List Available Voices

audiobook-generate --list-voices

API Server

Start Server

audiobook-server --host 0.0.0.0 --port 8000

API Endpoints

# Health check
curl http://localhost:8000/v1/health

# List voices
curl http://localhost:8000/v1/voices

# Generate narration
curl -X POST http://localhost:8000/v1/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, world!", "voice": "narrator_female_us"}'

# Generate dialogue
curl -X POST http://localhost:8000/v1/generate/dialogue \
  -H "Content-Type: application/json" \
  -d '{"text": "[S1] Hello! [S2] Hi there. (laughs)"}'

# Generate chapter (background task)
curl -X POST http://localhost:8000/v1/generate/chapter \
  -H "Content-Type: application/json" \
  -d '{"text": "Chapter text here...", "chapter_number": 1}'

# Check job status
curl http://localhost:8000/v1/status/{job_id}

Python Library

from audiobook_tts.models import KokoroEngine, ModelManager
from audiobook_tts.processing import TextProcessor, AudioProcessor
from audiobook_tts.config import Config

# Initialize
config = Config.default()
model_manager = ModelManager()
text_processor = TextProcessor(max_chunk_chars=400)
audio_processor = AudioProcessor()

# Load manuscript
from pathlib import Path
text = Path("../../compiled/1-resonance-and-reason-manuscript.md").read_text()

# Process chapter
chunks = text_processor.chunk_text(text[:5000])  # First 5000 chars

# Generate audio
kokoro = model_manager.get_kokoro()
audio_segments = []

for chunk in chunks:
    for segment in kokoro.generate(chunk, voice="af_bella"):
        audio_segments.append(segment.audio)

# Combine and save
combined = audio_processor.concatenate_segments(audio_segments)
combined = audio_processor.add_silence_padding(combined)
audio_processor.save_audio(combined, Path("./output/test.wav"), normalize=True)

Voice Profiles

Kokoro Voices (Narration)

Category	Voice IDs
American Female	`af_heart`, `af_bella`, `af_nova`, `af_sky`, `af_nicole`, `af_sarah`
American Male	`am_adam`, `am_echo`, `am_eric`, `am_liam`, `am_michael`, `am_onyx`
British Female	`bf_alice`, `bf_emma`, `bf_isabella`, `bf_lily`
British Male	`bm_daniel`, `bm_fable`, `bm_george`, `bm_lewis`

Dia (Multi-Speaker Dialogue)

Use speaker tags to indicate different speakers:

[S1] The door creaked open.
[S2] Who's there? (gasps)
[S1] It's just me.

Supported non-verbal sounds:

(laughs), (sighs), (gasps), (coughs)
(clears throat), (screams), (whispers)
(singing), (humming), (whistles)

Chatterbox Voice Cloning

Chatterbox TTS provides voice cloning with per-character emotion control. It clones Kokoro voices from reference audio clips, then adds adjustable emotion exaggeration.

Step 1: Generate Reference Clips

Generate ~15 second Kokoro reference clips for each character voice:

cd tools/audiobook-tts
uv run python scripts/generate_voice_refs.py

This creates WAV files in audiobooks/voice-refs/ (one per Kokoro preset used in the series).

Step 2: Use Chatterbox Profiles

Chatterbox profiles are pre-configured in config/voices.yaml (prefixed chatterbox_*) and mapped in config/series.yaml. Generate as usual:

uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --series-config config/series.yaml \
  --output ./output/book1/

Tuning Chatterbox Parameters

Each Chatterbox voice profile has three tunable parameters in config/voices.yaml:

Parameter	Default	Range	Effect
`exaggeration`	0.5	0.0-1.0	Emotion intensity. 0.0 = flat/neutral, 1.0 = maximum emotion
`cfg_weight`	0.5	0.0-1.0	Classifier-free guidance. Higher = more adherence to text
`temperature`	0.8	0.0-1.0	Sampling randomness. Lower = more consistent, higher = more varied

Character-appropriate defaults:

Controlled characters (Cassieth, Aurelius, Decimus, Basileon): exaggeration: 0.2-0.3
Moderate characters (Dessa, Kael, Jorin, Vara): exaggeration: 0.4-0.5
Emotional characters (Mirael, Lysa): exaggeration: 0.6-0.7

Switching Between Engines

To switch a character back to Kokoro, change the profile name in config/series.yaml:

# Chatterbox (voice-cloned with emotion)
Mirael: chatterbox_mirael

# Kokoro (preset voice, faster)
Mirael: voice_mirael

Both Kokoro and Chatterbox profiles are defined in config/voices.yaml — the original Kokoro profiles remain available as fallback.

Performance

Chatterbox is approximately 5-10x slower than Kokoro due to the voice cloning process:

Model	Speed	Memory
Kokoro-82M	~25x real-time	~2-3 GB
Chatterbox (fp16)	~3-5x real-time	~4-6 GB

A 100,000-word novel takes approximately 2-4 hours with Chatterbox (vs ~30 minutes with Kokoro).

Configuration

Custom Voice Configuration

Create a config/voices.yaml file:

voices:
  my_narrator:
    model: kokoro
    voice_preset: af_bella
    speed: 0.95  # Slightly slower
    lang_code: a
    description: "Custom narrator voice"

audio:
  sample_rate: 24000
  output_format: wav
  target_lufs: -20.0
  true_peak: -3.0

processing:
  max_chunk_chars: 400
  crossfade_ms: 50
  silence_padding_ms: 500

Use with CLI:

audiobook-generate --input manuscript.md --config config/voices.yaml

ACX/Audible Compliance

Generated audio meets ACX requirements:

Sample rate: 44.1 kHz (upsampled from 24kHz)
Bit rate: 192 kbps CBR (MP3)
Loudness: -20 dB LUFS (-23 to -18 dB acceptable)
Peak: ≤ -3 dB true peak
Noise floor: ≤ -60 dB
Room tone: 0.5s silence at start/end

Performance

On Apple Silicon M4 Max:

Model	Speed	Memory
Kokoro-82M	~25x real-time	~2-3 GB
Dia-1.6B	~5-8x real-time	~4-6 GB
Dia-1.6B-4bit	~8-12x real-time	~2-3 GB

A 100,000-word novel (~11 hours audio) takes approximately:

Kokoro: ~25-30 minutes to generate
With normalization: Add ~5-10 minutes

Troubleshooting

"Model not found" Error

Models are downloaded automatically from HuggingFace on first use. Ensure you have internet connectivity.

"ffmpeg not found" Error

Install ffmpeg:

brew install ffmpeg

Out of Memory

Use the 4-bit Dia model for dialogue:

from audiobook_tts.models import DiaEngine
dia = DiaEngine(use_4bit=True)

Slow Generation

Ensure you're using Apple Silicon (not Rosetta)
Close other memory-intensive applications
Use smaller chunk sizes: --config with max_chunk_chars: 300

Directory Structure

tools/audiobook-tts/
├── src/audiobook_tts/
│   ├── __init__.py
│   ├── server.py          # FastAPI server
│   ├── cli.py             # Command-line interface
│   ├── config.py          # Configuration
│   ├── models/
│   │   ├── tts_engine.py  # TTS model wrappers (Kokoro, Dia, CSM, Chatterbox)
│   │   └── model_manager.py
│   ├── processing/
│   │   ├── text_processor.py   # Text chunking
│   │   ├── audio_processor.py  # Audio concatenation
│   │   └── manuscript.py       # Manuscript handling
│   └── api/
│       ├── routes.py      # API endpoints
│       └── schemas.py     # Pydantic models
├── scripts/
│   └── generate_voice_refs.py  # Generate Kokoro reference clips for Chatterbox
├── config/
│   ├── voices.yaml        # Voice configuration (Kokoro + Chatterbox profiles)
│   └── series.yaml        # Per-book POV-to-voice mappings
├── output/                # Generated audio files
├── cache/                 # Progress tracking
├── pyproject.toml
└── README.md

Next Steps: Offline Audiobook Generation

Step 1: Verify Prerequisites

cd tools/audiobook-tts

# Check all dependencies
uv run python -c "from audiobook_tts.utils import get_dependency_status; print(get_dependency_status())"

# Or manually check:
ffmpeg -version          # Should show version info
ffmpeg-normalize --help  # Should show help

Step 2: Test with Sample Text

Before generating a full book, test with a short sample:

# Generate 30-second test
uv run python -c "
from audiobook_tts.models import KokoroEngine
from audiobook_tts.processing import AudioProcessor
from pathlib import Path

engine = KokoroEngine()
processor = AudioProcessor()

# Test generation
segments = list(engine.generate('This is a test of the audiobook generation system. The quick brown fox jumps over the lazy dog.', voice='af_bella'))
audio = processor.concatenate_segments([s.audio for s in segments])
processor.save_audio_raw(audio, Path('./output/test-sample.wav'))
print('Test audio saved to output/test-sample.wav')
"

Step 3: Generate Audiobook for a Single Chapter

# Generate Chapter 1 of Book 1
uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --chapters 1 \
  --voice af_bella \
  --format mp3 \
  --output ./output/book1/

Step 4: Generate Full Book with Series Configuration

The series configuration maps POV characters to specific voices:

# Generate Book 1 with POV-aware voice selection
uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --series-config ./config/series.yaml \
  --format mp3 \
  --output ./output/book1/

# Generate Book 2
uv run audiobook-generate \
  --input ../../compiled/2-vessels-and-vestments-manuscript.md \
  --series-config ./config/series.yaml \
  --format mp3 \
  --output ./output/book2/

Step 5: Resume Interrupted Generation

Generation progress is saved automatically. To resume:

uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --series-config ./config/series.yaml \
  --resume

Estimated Generation Times

Book	Word Count	Estimated Audio	Generation Time*
Book 1: Resonance and Reason	138,249	~15 hours	~35 min
Book 2: Vessels and Vestments	221,498	~24 hours	~55 min
Book 4: Canon and Council	~200,000	~22 hours	~50 min

*On Apple Silicon M4 Max with Kokoro. Add ~10-15 min for MP3 normalization.

Series Configuration (config/series.yaml)

Edit config/series.yaml to customize voice assignments:

name: "A Testament of Stone"
default_voice: "af_bella"  # Default narrator

books:
  resonance-and-reason:
    title: "Resonance and Reason"
    pov_voices:
      Kael: "am_michael"      # Male, intense
      Basileon: "bm_george"   # British male, imperial
      Tiberus: "am_adam"      # Male, military
      Vara: "bf_emma"         # British female, scholarly
      Jorin: "am_echo"        # Male, gentle
      Dessa: "af_bella"       # Female, warm

Output Structure

Generated files are organized as:

output/
├── book1/
│   ├── chapter-01.mp3
│   ├── chapter-02.mp3
│   ├── ...
│   └── progress.json      # Resume tracking
├── book2/
│   └── ...
└── test-sample.wav        # Test files

PDF Generation (MacTeX)

MacTeX is installed but needs PATH configuration:

# Add to ~/.zshrc or ~/.bash_profile
export PATH="/Library/TeX/texbin:$PATH"

# Reload shell
source ~/.zshrc

# Verify
pdflatex --version

License

MIT License - see LICENSE file.

The TTS models (Kokoro, Dia) are released under Apache 2.0 license and are suitable for commercial use.

Project details

Release history Release notifications | RSS feed

1.1.0

Mar 12, 2026

1.0.2

Mar 2, 2026

1.0.1

Mar 2, 2026

This version

1.0.0

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audiobook_tts-1.0.0.tar.gz (37.6 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

audiobook_tts-1.0.0-py3-none-any.whl (47.1 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file audiobook_tts-1.0.0.tar.gz.

File metadata

Download URL: audiobook_tts-1.0.0.tar.gz
Upload date: Mar 2, 2026
Size: 37.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for audiobook_tts-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a09b4de0522abb4c8a8c4823e326351a2364d1a03db1142153844a6b10868479`
MD5	`54fe86a67111857301bb5cd288a0a73d`
BLAKE2b-256	`d2f08a4487161df45b1feb1836dc6500bc9280c37d25549e9554afe21b297e55`

See more details on using hashes here.

File details

Details for the file audiobook_tts-1.0.0-py3-none-any.whl.

File metadata

Download URL: audiobook_tts-1.0.0-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 47.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.17

File hashes

Hashes for audiobook_tts-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d8b43c6efdbe59bc84cd5ab8f0f494d2e952ede981d61a3672951369df76fb16`
MD5	`889ecf3ae9576dc2f165dfc590ca7c35`
BLAKE2b-256	`4f1a59da74af62f29987e1f421b0f209fb6ca24d7ef95c7d5721a11a3b1134be`

See more details on using hashes here.

audiobook-tts 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Audiobook TTS

Features

Requirements

Installation

1. Install System Dependencies

2. Install Package with uv

3. Verify Installation

Usage

Command-Line Interface

Generate Full Audiobook

Generate Specific Chapters

Resume Interrupted Generation

List Available Voices

API Server

Start Server

API Endpoints

Python Library

Voice Profiles

Kokoro Voices (Narration)

Dia (Multi-Speaker Dialogue)

Chatterbox Voice Cloning

Step 1: Generate Reference Clips

Step 2: Use Chatterbox Profiles

Tuning Chatterbox Parameters

Switching Between Engines

Performance

Configuration

Custom Voice Configuration

ACX/Audible Compliance

Performance

Troubleshooting

"Model not found" Error

"ffmpeg not found" Error

Out of Memory

Slow Generation

Directory Structure

Next Steps: Offline Audiobook Generation

Step 1: Verify Prerequisites

Step 2: Test with Sample Text

Step 3: Generate Audiobook for a Single Chapter

Step 4: Generate Full Book with Series Configuration

Step 5: Resume Interrupted Generation

Estimated Generation Times

Series Configuration (config/series.yaml)

Output Structure

PDF Generation (MacTeX)

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes