Convert ebooks and PDFs to audiobooks using AI text-to-speech
Project description
Audify
Convert ebooks and PDFs to audiobooks using AI text-to-speech and translation services.
Audify is a pipeline and REST API that transforms written content into high-quality audio using:
- Multiple TTS Providers - Choose from Kokoro (local), Qwen-TTS (local), OpenAI, AWS Polly, or Google Cloud TTS
- Ollama + LiteLLM for intelligent translation
- LLM-powered audiobook generation for engaging audio content
๐ Features
- ๐ Multiple Formats: Convert EPUB ebooks, PDF documents, TXT, and MD files
- ๐ Directory Processing: Create audiobooks from multiple files in a directory
- ๐๏ธ Audiobook Creation: Generate audiobook-style content from books using LLM
- ๐๏ธ Flexible Task System: Transform content into audiobooks, podcasts, summaries, meditations, or custom styles
- ๐ REST API: HTTP API for programmatic synthesis and audiobook creation
- ๐ Multiple TTS Providers: Choose from Kokoro (local), Qwen-TTS (local), OpenAI, AWS Polly, or Google Cloud TTS
- ๐ Multi-language Support: Translate content
- ๐ต High-Quality TTS: Natural-sounding speech with multiple provider options
- โ๏ธ Flexible Configuration: Environment-based settings and
.keysfile support
๐ Prerequisites
Core Requirements
- Python 3.10-3.13
- UV package manager (installation guide)
For Local TTS Providers (Optional)
Kokoro TTS
- Docker & Docker Compose (for API services)
- CUDA-capable GPU (recommended for optimal performance)
Qwen-TTS
- Qwen-TTS API Server running on port 8890 (see Qwen3-TTS)
- CUDA-capable GPU (recommended for optimal performance)
For Cloud TTS Providers (Optional)
- OpenAI TTS: OpenAI API key (get one here)
- AWS Polly: AWS account with access keys (AWS setup)
- Google Cloud TTS: Google Cloud project with credentials (GCP setup)
๐ฆ Installation as a command-line tool
You can install Audify as a standalone command-line tool using pip or uv:
pip install audify-cli
Or using uv (recommended):
uv pip install audify-cli
This will install the audify command with subcommands:
audify run: Basic TTS conversion of EPUB/PDF filesaudify audiobook: LLM-powered audiobook generation
Alternatively, you can use the direct commands:
audify-run: Alias foraudify runaudify-audiobook: Alias foraudify audiobook
After installation, you can run audify --help to see available options.
๐ณ Quick Start with Docker (For Kokoro TTS)
Note: Docker is only required if you want to use the local Kokoro TTS provider. For Qwen-TTS, you'll need to run the Qwen-TTS API separately (see Qwen-TTS Setup below). You can skip to "Quick Start with Cloud TTS" if you prefer using OpenAI, AWS Polly, or Google Cloud TTS.
1. Clone and Setup
git clone https://github.com/garciadias/audify.git
cd audify
2. Start API Services
# Start Kokoro TTS and Ollama services
docker compose up -d
# Wait for services to be ready (~2-3 minutes)
# Check status: docker compose ps
3. Install Python Dependencies
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv sync
4. Setup Ollama Models
# Pull required models for translation and audiobook generation
docker compose exec ollama ollama pull qwen3:30b
# Or use lighter models for testing:
# docker compose exec ollama ollama pull llama3.2:3b
5. Convert Your First Book
# Convert EPUB to audiobook (using Kokoro TTS)
task run path/to/your/book.epub
# Convert PDF to audiobook
task run path/to/your/document.pdf
# Create audiobook from EPUB
task audiobook path/to/your/book.epub
๐ Quick Start with Qwen-TTS (Local)
Qwen-TTS is a high-quality, free, and privacy-friendly local TTS solution with excellent multilingual support.
1. Setup Qwen-TTS API
First, set up the Qwen-TTS API server (requires GPU):
# Clone Qwen-TTS API repository
git clone https://github.com/QwenLM/Qwen3-TTS
cd Qwen3-TTS
# Start with Docker (recommended)
make up
# The API will be available at http://localhost:8890
For detailed setup instructions, see the Qwen3-TTS documentation.
2. Install Audify
git clone https://github.com/garciadias/audify.git
cd audify
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv sync
3. Configure Qwen-TTS
Create a .keys file:
TTS_PROVIDER=qwen
QWEN_API_URL=http://localhost:8890
QWEN_TTS_VOICE=Vivian
4. Convert Your First Book
# Convert using Qwen-TTS
task run path/to/your/book.epub
# Or specify provider explicitly
task --tts-provider qwen run path/to/your/book.epub
๐ Quick Start with Cloud TTS
If you prefer to use cloud TTS providers without Docker:
1. Clone and Install
git clone https://github.com/garciadias/audify.git
cd audify
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv sync
2. Configure Your TTS Provider
Create a .keys file with your credentials:
cp .keys.example .keys
# Edit .keys and add your provider credentials
# See Configuration section for details
3. Convert Books with Cloud TTS
# Using OpenAI TTS
task --tts-provider openai run "book.epub"
# Using AWS Polly
task --tts-provider aws run "book.epub"
# Using Google Cloud TTS
task --tts-provider google run "book.epub"
๐ Usage Examples
Basic Audiobook Conversion
# English EPUB to audiobook
task run "book.epub"
# PDF with specific language
task --language pt run "document.pdf"
# With translation (English to Spanish)
task --language en --translate es run "book.epub"
Audiobook Generation
# Create audiobook from EPUB
task audiobook "book.epub"
# Limit to first 5 chapters
task audiobook "book.epub" --max-chapters 5
# Custom voice and language
task audiobook "book.epub" --voice af_bella --language en
# With translation
task audiobook "book.epub" --translate pt
Task System (Audiobook Styles)
Choose different transformation styles using the --task option or provide custom prompts:
# Podcast-style narration
task audiobook "book.epub" --task podcast
# Concise summary
task audiobook "book.epub" --task summary
# Guided meditation
task audiobook "book.epub" --task meditation
# Classroom lecture
task audiobook "book.epub" --task lecture
# Custom prompt file
task audiobook "book.epub" --prompt-file my-prompt.txt
# List available tasks
audify list-tasks
# Validate a custom prompt file
audify validate-prompt my-prompt.txt
See Tasks Guide for details on creating custom prompts.
Using Commercial APIs (DeepSeek, Claude, GPT-4, Gemini)
Instead of local Ollama models, you can use commercial APIs for better quality or faster processing:
# Using DeepSeek (cost-effective)
task audiobook "book.epub" -m "api:deepseek/deepseek-chat"
# Using Claude 3.5 Sonnet (high quality)
task audiobook "book.epub" -m "api:anthropic/claude-3-5-sonnet-20240620"
# Using GPT-4 (reliable)
task audiobook "book.epub" -m "api:openai/gpt-4-turbo-preview"
# Using Gemini Pro
task audiobook "book.epub" -m "api:gemini/gemini-1.5-pro"
Setup Required: Create a .keys file with your API keys for the provider(s) you intend to use. See Commercial APIs Guide for detailed instructions.
# Copy example file and add your keys
cp .keys.example .keys
# Edit .keys and add keys for your chosen provider(s):
# DEEPSEEK=your-deepseek-api-key-here
# ANTHROPIC=your-anthropic-api-key-here
# OPENAI=your-openai-api-key-here
# GEMINI=your-google-api-key-here
Directory Input (Multi-file Processing)
Process multiple files from a directory into a single audiobook:
# Create audiobook from directory of files
task audiobook "path/to/directory/"
# Process directory with translation
task --translate es audiobook "path/to/articles/"
# Directory with custom voice
task --voice af_bella --language en audiobook "path/to/papers/"
Supported file types in directory: EPUB, PDF, TXT, MD
The directory mode will:
- Process each file as a separate episode
- Use the filename as the episode title
- Combine all episodes into a single M4B audiobook with chapter markers
- Synthesize the title audio for each episode
Advanced Options
# List available languages
task run --list-languages
# List available TTS models
task --list-models run
# Save extracted text
task --save-text run "book.epub"
# Skip confirmation prompts
task -y run "book.epub"
# Use different TTS provider
task --tts-provider openai run "book.epub" # OpenAI TTS
task --tts-provider aws run "book.epub" # AWS Polly
task --tts-provider google run "book.epub" # Google Cloud TTS
task --tts-provider qwen run "book.epub" # Qwen-TTS (local)
# List available TTS providers
task --list-tts-providers run
# List available tasks
audify list-tasks
# Validate a custom prompt file
audify validate-prompt my-prompt.txt
โ๏ธ Configuration
TTS Provider Configuration
Audify supports multiple TTS providers. Configure your preferred provider using environment variables or a .keys file:
Option 1: Using .keys File (Recommended)
Create a .keys file in the project root:
cp .keys.example .keys
Edit .keys and add your credentials:
# OpenAI TTS
OPENAI_API_KEY=sk-your-openai-api-key
OPENAI_TTS_MODEL=tts-1-hd
OPENAI_TTS_VOICE=alloy
# AWS Polly
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
AWS_REGION=us-east-1
AWS_POLLY_VOICE=Joanna
AWS_POLLY_ENGINE=neural
# Google Cloud TTS
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
GOOGLE_TTS_VOICE=en-US-Chirp-HD-F
GOOGLE_TTS_LANGUAGE_CODE=en-US
# Qwen-TTS (Local)
QWEN_API_URL=http://localhost:8890
QWEN_TTS_VOICE=Vivian
# Default TTS Provider
TTS_PROVIDER=kokoro # Options: kokoro, qwen, openai, aws, google
Option 2: Environment Variables
# Kokoro TTS API (Local)
export KOKORO_API_URL="http://localhost:8887/v1/audio"
# OpenAI TTS
export OPENAI_API_KEY="sk-your-key"
export OPENAI_TTS_MODEL="tts-1-hd" # or "tts-1"
export OPENAI_TTS_VOICE="alloy" # alloy, echo, fable, onyx, nova, shimmer
# AWS Polly
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"
export AWS_POLLY_VOICE="Joanna" # Neural voices recommended
export AWS_POLLY_ENGINE="neural" # "standard" or "neural"
# Google Cloud TTS
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export GOOGLE_TTS_VOICE="en-US-Chirp-HD-F"
export GOOGLE_TTS_LANGUAGE_CODE="en-US"
# Qwen-TTS (Local)
export QWEN_API_URL="http://localhost:8890"
export QWEN_TTS_VOICE="Vivian"
# Default Provider
export TTS_PROVIDER="kokoro" # Options: kokoro, qwen, openai, aws, google
# Ollama Configuration
export OLLAMA_API_BASE_URL="http://localhost:11434"
export OLLAMA_TRANSLATION_MODEL="qwen3:30b"
export OLLAMA_MODEL="magistral:24b"
Choosing a TTS Provider
| Provider | Pros | Cons | Best For |
|---|---|---|---|
| Kokoro (Local) | Free, privacy-friendly, GPU-accelerated | Requires local setup | Development, privacy-sensitive projects |
| Qwen-TTS (Local) | Free, privacy-friendly, GPU-accelerated, multilingual | Requires separate API setup | Multilingual projects, privacy-sensitive content |
| OpenAI | High quality, easy setup | Pay per character | Production, high-quality output |
| AWS Polly | Neural voices, scalable | AWS account required | Enterprise, AWS-integrated projects |
| Google Cloud TTS | Natural voices, many languages | GCP account required | Multi-language projects |
Docker Services
The docker-compose.yml configures (only needed for local/Kokoro TTS):
- Kokoro TTS: Port 8887 (GPU-accelerated speech synthesis, local)
- Ollama: Port 11434 (LLM for translation and audiobook generation, optional)
- Audify API: Port 8000 (REST API server, starts after Kokoro and Ollama are healthy)
The api service waits for Kokoro and Ollama to pass their healthchecks before starting, so services are always ready when the API accepts requests.
Note: Docker services are only required for Kokoro (local TTS). Commercial TTS providers (OpenAI, AWS, Google) and LLM APIs (DeepSeek, Claude, GPT-4, Gemini) work without Docker.
๐ Output Structure
data/output/
โโโ [book_name]/
โ โโโ chapters.txt # Book metadata
โ โโโ cover.jpg # Book cover image
โ โโโ chapters_001.mp3 # Individual chapter audio
โ โโโ chapters_002.mp3
โ โโโ chapters_003.mp3
โ โโโ ... # More chapters
โ โโโ book_name.m4b # Final audiobook
โ
โโโ audiobooks/
โโโ [book_name]/
โโโ episodes/
โ โโโ episode_001.mp3 # Audiobook episodes
โ โโโ episode_002.mp3
โ โโโ ...
โโโ scripts/ # Generated scripts
โ โโโ episode_001_script.txt
โ โโโ original_text_001.txt
โ โโโ ...
โโโ chapters.txt # FFmpeg metadata
โโโ [book_name].m4b # Final M4B audiobook
Directory audiobook output:
data/output/
โโโ [directory_name]/
โโโ episodes/
โ โโโ episode_001.mp3 # Episode from first file
โ โโโ episode_002.mp3 # Episode from second file
โ โโโ ...
โโโ scripts/
โ โโโ episode_001_script.txt
โ โโโ ...
โโโ chapters.txt # Chapter metadata
โโโ [directory_name].m4b # Combined audiobook
๐ ๏ธ Development
Available Tasks
task test # Run tests with coverage
task format # Format code with ruff
task run # Convert ebook to audiobook
task audiobook # Create audiobook from content
task up # Start Docker services
task api # Start REST API server (dev mode, port 8000)
You can also use the installed CLI commands directly:
audify run(oraudify-run) - equivalent totask runaudify audiobook(oraudify-audiobook) - equivalent totask audiobook
Local Development Setup
# Install development dependencies
uv sync --group dev
# Run tests
task test
# Format code
task format
# Type checking (included in pre_test)
mypy ./audify ./tests --ignore-missing-imports
๐ REST API
Audify exposes a FastAPI HTTP server for programmatic access to synthesis and audiobook creation.
Starting the API
# Development mode (auto-reload)
task api
# Or via Docker (starts with Kokoro and Ollama)
docker compose up -d
The API runs on http://localhost:8000 by default.
Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check |
GET |
/providers |
List available TTS providers |
GET |
/voices?provider=kokoro&language=en |
List voices for a provider |
POST |
/synthesize |
Convert EPUB or PDF to MP3 |
POST |
/audiobook |
Convert EPUB or PDF to M4B audiobook |
Example: Synthesize an EPUB
curl -X POST http://localhost:8000/synthesize \
-F "file=@book.epub" \
-F "voice=af_bella" \
-F "language=en" \
--output book.mp3
Example: Create an M4B Audiobook
curl -X POST http://localhost:8000/audiobook \
-F "file=@book.epub" \
-F "voice=af_bella" \
-F "language=en" \
--output book.m4b
API Reference
Interactive docs are available at http://localhost:8000/docs (Swagger UI) once the server is running.
๐๏ธ Architecture
Audify uses a flexible multi-provider architecture supporting both local and cloud services:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Audify REST API (port 8000) โ
โ โข POST /synthesize โ
โ โข POST /audiobook โ
โ โข GET /voices, /providers โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
โ Audify CLI / Python API โ
โ โข EPUB/PDF/TXT Reader โ
โ โข LLM Script Generation โ
โ โข Audio Combine & M4B Assembly โ
โโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโ TTS Providers โโโโโโโโโโโโ
โ โโ Kokoro (local) โ
โ โโ Qwen-TTS (local) โ
โ โโ OpenAI TTS โ
โ โโ AWS Polly โ
โ โโ Google Cloud TTS โ
โ โ
โโโโ LLM APIs โโโโโโโโโโโโโโโโ
โโ Ollama (local)
โโ DeepSeek
โโ Claude
โโ GPT-4
โโ Gemini
Key Components
- Text Extraction: EPUB/PDF parsing with chapter detection
- Translation: LiteLLM + Commercial/Local LLMs for high-quality translation
- Task System: Flexible prompt management for audiobook, podcast, summary, meditation, and lecture styles
- TTS: Multi-provider support (Kokoro, OpenAI, AWS Polly, Google Cloud TTS)
- Audiobook Generation: LLM-powered script creation with commercial API support
- Audio Processing: Pydub for format conversion and combining
- API Management: Unified API key management via .keys file or environment variables
๐ Supported Languages
Primary: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Hungarian, Korean, Japanese, Hindi
Translation: Any language pair supported by your Ollama model
๐ง Troubleshooting
Common Issues
Services not responding (Docker/Kokoro):
# Check service status
docker compose ps
# Restart services
docker compose restart
# Check logs
docker compose logs kokoro
docker compose logs ollama
Commercial API errors:
# Verify API key configuration
cat .keys
# Test API connectivity
uv run audify translate test.txt --model api:deepseek-chat
# Check API key is loaded
# The system will show an error if the API key is missing or invalid
TTS Provider issues:
# List available TTS providers
uv run audify --list-tts-providers
# Test specific provider
uv run audify translate test.txt --tts-provider openai
# Check provider credentials in .keys file
# OpenAI: OPENAI_API_KEY
# AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
# Google: GOOGLE_APPLICATION_CREDENTIALS (path to JSON file)
Ollama model not found:
# List available models
docker compose exec ollama ollama list
# Pull required model
docker compose exec ollama ollama pull qwen3:30b
GPU issues:
# Check GPU availability
docker compose exec kokoro nvidia-smi
# If no GPU, services will run on CPU (slower)
Performance Tips
- Use SSD storage for model caching
- Ensure adequate GPU memory (8GB+ recommended) for Kokoro
- Use lighter models for testing:
llama3.2:3binstead ofmagistral:24b - Commercial TTS providers (OpenAI, AWS, Google) are faster than local Kokoro
- Commercial LLM APIs often provide better latency than local Ollama
- Consider running local services on separate machines for large workloads
- Use cloud providers for production workloads requiring high reliability
๐ Examples
Check the examples/ directory for sample usage patterns and configuration files.
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Workflow
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
task test - Submit a pull request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Kokoro TTS for high-quality speech synthesis
- Kokoro-FastAPI accessible kokoro via FastAPI
- Ollama for local LLM inference
- LiteLLM for unified LLM API interface
- OpenAI for GPT and TTS APIs
- Anthropic for Claude API
- DeepSeek for DeepSeek API
- Google for Gemini and Cloud TTS
- AWS Polly for Text-to-Speech service
Test release automation
Release automation v2 test
Release Workflow Test v3
Testing the fixed release workflow with proper breaking change detection.
Release Test v4 - Fixed grep patterns
Release validation complete - workflow is working
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file audify_cli-0.2.1.tar.gz.
File metadata
- Download URL: audify_cli-0.2.1.tar.gz
- Upload date:
- Size: 76.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
099835926a55440c871dc6be1e55ae001b6772c477064d86ca0aaefe51a1915e
|
|
| MD5 |
0e45305ceabe337ab482b87bccd8148f
|
|
| BLAKE2b-256 |
d39824bbc2b0a506795b5c6377dc2fcd946d6e505646ab4605d1f72a01cf5edd
|
Provenance
The following attestation bundles were made for audify_cli-0.2.1.tar.gz:
Publisher:
release.yml on garciadias/audify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
audify_cli-0.2.1.tar.gz -
Subject digest:
099835926a55440c871dc6be1e55ae001b6772c477064d86ca0aaefe51a1915e - Sigstore transparency entry: 1327214863
- Sigstore integration time:
-
Permalink:
garciadias/audify@573e7dfc80a61997374e51346045688ba12a14c0 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/garciadias
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@573e7dfc80a61997374e51346045688ba12a14c0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file audify_cli-0.2.1-py3-none-any.whl.
File metadata
- Download URL: audify_cli-0.2.1-py3-none-any.whl
- Upload date:
- Size: 83.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
476d74dc3c173346d788bedad84b1c14341675ce39e6a8889bf309ec8013d59f
|
|
| MD5 |
27f4eac39be650f516a1ccd16c5a4076
|
|
| BLAKE2b-256 |
3de11bac161f027e0517e30c764fb355c4d948dadb1777cfad1893055bb75789
|
Provenance
The following attestation bundles were made for audify_cli-0.2.1-py3-none-any.whl:
Publisher:
release.yml on garciadias/audify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
audify_cli-0.2.1-py3-none-any.whl -
Subject digest:
476d74dc3c173346d788bedad84b1c14341675ce39e6a8889bf309ec8013d59f - Sigstore transparency entry: 1327214926
- Sigstore integration time:
-
Permalink:
garciadias/audify@573e7dfc80a61997374e51346045688ba12a14c0 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/garciadias
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@573e7dfc80a61997374e51346045688ba12a14c0 -
Trigger Event:
push
-
Statement type: