A python library for Kokoro TTS

These details have not been verified by PyPI

Project links

Homepage

Project description

PyPI - Python Version PyPI - Downloads

PyKokoro

A Python library for Kokoro TTS (Text-to-Speech) using ONNX runtime.

Features

ONNX-based TTS: Fast, efficient text-to-speech using the Kokoro-82M model
Multiple Languages: Support for English, Spanish, French, German, Italian, Portuguese, and more
Multiple Voices: 54+ built-in voices (or 103 voices with v1.1-zh model)
Voice Blending: Create custom voices by blending multiple voices
Multiple Model Sources: Download models from HuggingFace or GitHub (v1.0/v1.1-zh)
Model Quality Options: Choose from fp32, fp16, q8, q4, and uint8 quantization levels
GPU Acceleration: Optional CUDA, CoreML, or DirectML support
Phoneme Support: Advanced phoneme-based generation with kokorog2p
Hugging Face Integration: Automatic model downloading from Hugging Face Hub
Text Normalization: Automatic say-as support for numbers, dates, phone numbers, and more using SSMD markup

Installation

Basic Installation (CPU only)

pip install pykokoro

GPU and Accelerator Support

PyKokoro supports multiple hardware accelerators for faster inference:

NVIDIA CUDA GPU

pip install pykokoro[gpu]

Intel OpenVINO

Note: OpenVINO is currently incompatible with Kokoro models due to dynamic rank tensor requirements. The provider will automatically fall back to CPU if OpenVINO fails.

pip install pykokoro[openvino]

DirectML (Windows - AMD/Intel/NVIDIA GPUs)

pip install pykokoro[directml]

Apple CoreML (macOS)

pip install pykokoro[coreml]

All Accelerators

pip install pykokoro[all]

Performance Comparison

To find the best provider for your system, run the benchmark:

python examples/gpu_benchmark.py

Quick Start

import pykokoro
import soundfile as sf

# Initialize the TTS engine (auto-selects best provider)
tts = pykokoro.Kokoro(provider="auto")

# Generate speech
text = "Hello, world! This is Kokoro speaking."
audio, sample_rate = tts.create(text, voice="af_sarah", speed=1.0, lang="en-us")

# Save to file
sf.write("output.wav", audio, sample_rate)

Hardware Acceleration

Automatic Provider Selection (Recommended)

import pykokoro

# Auto-select best available provider (CUDA > CoreML > DirectML > CPU)
# Note: OpenVINO is attempted but will fall back to next priority if incompatible
tts = pykokoro.Kokoro(provider="auto")

Explicit Provider Selection

# Force specific provider
tts = pykokoro.Kokoro(provider="cuda")      # NVIDIA CUDA
tts = pykokoro.Kokoro(provider="openvino")  # Intel OpenVINO (currently incompatible, will raise error)
tts = pykokoro.Kokoro(provider="directml")  # Windows DirectML
tts = pykokoro.Kokoro(provider="coreml")    # Apple CoreML
tts = pykokoro.Kokoro(provider="cpu")       # CPU only

Check Available Providers

# See all available providers on your system
python examples/provider_info.py

# Benchmark all providers
python examples/gpu_benchmark.py

Environment Variable Override

# Force a specific provider via environment variable
export ONNX_PROVIDER="OpenVINOExecutionProvider"
python your_script.py

Usage Examples

Basic Text-to-Speech

import pykokoro

# Create TTS instance with GPU acceleration and fp16 model
tts = pykokoro.Kokoro(provider="cuda", model_quality="fp16")

# Generate audio
audio, sr = tts.create("Hello world", voice="af_nicole", lang="en-us")

Voice Blending

# Blend two voices (50% each)
blend = pykokoro.VoiceBlend.parse("af_nicole:50,am_michael:50")
audio, sr = tts.create("Mixed voice", voice=blend)

Streaming Generation

# Synchronous streaming
for chunk, sr, text_chunk in tts.create_stream_sync("Long text here...", voice="af_sarah"):
    # Process audio chunk in real-time
    play_audio(chunk, sr)

# Async streaming
async for chunk, sr, text_chunk in tts.create_stream("Long text here...", voice="af_sarah"):
    await process_audio(chunk, sr)

Phoneme-Based Generation

from pykokoro import Tokenizer

# Create tokenizer
tokenizer = Tokenizer()

# Convert text to phonemes
phonemes = tokenizer.phonemize("Hello world", lang="en-us")
print(phonemes)  # hə'loʊ wɜːld

# Generate from phonemes
audio, sr = tts.create_from_phonemes(phonemes, voice="af_sarah")

Pause Control

PyKokoro uses SSMD (Speech Synthesis Markdown) syntax for controlling pauses in generated speech:

1. SSMD Break Markers

Add explicit pauses using SSMD break syntax in your text:

# Use SSMD break markers in your text
text = "Chapter 5 ...p I'm Klaus. ...c Welcome to the show!"

# Breaks are processed automatically
audio, sr = tts.create(
    text,
    voice="am_michael"
)

SSMD Break Markers:

...n - No pause (0ms)
...w - Weak pause (150ms by default)
...c - Clause/comma pause (300ms by default)
...s - Sentence pause (600ms by default)
...p - Paragraph pause (1000ms by default)
...500ms - Custom pause (500 milliseconds)
...2s - Custom pause (2 seconds)

Note: Bare ... (ellipsis) is NOT treated as a pause and will be phonemized normally.

Custom Pause Durations:

audio, sr = tts.create(
    text,
    voice="am_michael",
    pause_clause=0.2,      # ...c = 200ms
    pause_sentence=0.5,    # ...s = 500ms
    pause_paragraph=1.5    # ...p = 1500ms
)

2. Automatic Natural Pauses

For more natural speech, enable automatic pause insertion at linguistic boundaries:

text = """
Artificial intelligence is transforming our world. Machine learning models
are becoming more sophisticated, efficient, and accessible.

Deep learning, a subset of AI, uses neural networks with many layers. These
networks can learn complex patterns from data, enabling breakthroughs in
computer vision, natural language processing, and speech recognition.
"""

# Automatic pauses at clause, sentence, and paragraph boundaries
audio, sr = tts.create(
    text,
    voice="af_sarah",
    split_mode="clause",      # Split on commas and sentences
    trim_silence=True,        # Enable automatic pause insertion
    pause_clause=0.25,        # Pause after clauses (commas)
    pause_sentence=0.5,       # Pause after sentences
    pause_paragraph=1.0,      # Pause after paragraphs
    pause_variance=0.05,      # Add natural variance (default)
    random_seed=42            # For reproducible results (optional)
)

Key Features:

Natural boundaries: Automatically detects clauses, sentences, and paragraphs
Variance: Gaussian variance prevents robotic timing (±100ms by default)
Reproducible: Use random_seed for consistent output
Composable: Works with SSMD break markers

Split Modes:

None (default) - Automatic phoneme-based splitting, no automatic pauses
"paragraph" - Split on double newlines
"sentence" - Split on sentence boundaries (requires spaCy)
"clause" - Split on sentences + commas (requires spaCy, recommended)

Pause Variance Options:

pause_variance=0.0 - No variance (exact pauses)
pause_variance=0.05 - Default (±100ms at 95% confidence)
pause_variance=0.1 - More variation (±200ms at 95% confidence)

Note: For split_mode="sentence" or split_mode="clause", install spaCy:

pip install spacy
python -m spacy download en_core_web_sm

Combining Both Approaches:

Use SSMD markers for special emphasis and automatic pauses for natural rhythm:

text = "Welcome! ...p Let's discuss AI, machine learning, and deep learning."

audio, sr = tts.create(
    text,
    voice="af_sarah",
    split_mode="clause",     # Automatic pauses at commas
    trim_silence=True,
    pause_variance=0.05
)

See examples/pauses_demo.py, examples/pauses_with_splitting.py, and examples/automatic_pauses_demo.py for complete examples.

Text Normalization (Say-As)

PyKokoro supports automatic text normalization using SSMD (Speech Synthesis Markdown) syntax. Convert numbers, dates, phone numbers, and more into speakable text:

# Cardinal numbers
text = "I have [123](as: cardinal) apples"
audio, sr = tts.create(text, voice="af_sarah")
# TTS says: "I have one hundred twenty-three apples"

# Ordinal numbers
text = "I came in [3](as: ordinal) place"
# TTS says: "I came in third place"

# Digits (spell out)
text = "My PIN is [1234](as: digits)"
# TTS says: "My PIN is one two three four"

# Telephone numbers
text = "Call [+1-555-0123](as: telephone)"
# TTS says: "Call plus one five five five oh one two three"

# Dates with custom formatting
text = "Today is [12/31/2024](as: date, format: mdy)"
# TTS says: "Today is December thirty-first, two thousand twenty-four"

# Time (12-hour or 24-hour)
text = "The time is [14:30](as: time)"
# TTS says: "The time is two thirty PM"

# Characters (spell out)
text = "The code is [ABC](as: characters)"
# TTS says: "The code is A B C"

# Fractions
text = "Add [1/2](as: fraction) cup of sugar"
# TTS says: "Add one half cup of sugar"

# Units
text = "The package weighs [5kg](as: unit)"
# TTS says: "The package weighs five kilograms"

Supported Say-As Types:

cardinal - Numbers as cardinals: "123" → "one hundred twenty-three"
ordinal - Numbers as ordinals: "3" → "third"
digits - Spell out digits: "123" → "one two three"
number - Alias for cardinal
fraction - Fractions: "1/2" → "one half"
characters - Spell out text: "ABC" → "A B C"
telephone - Phone numbers: "+1-555-0123" → "plus one five five five oh one two three"
date - Dates with format support (mdy, dmy, ymd, ym, my, md, dm, d, m, y)
time - Time in 12h or 24h format
unit - Units: "5kg" → "five kilograms"
expletive - Censors to "beep"

Multi-language Support:

Say-as works with multiple languages (English, French, German, Spanish, and more):

# French cardinal
text = "[123](as: cardinal)"
audio, sr = tts.create(text, voice="ff_siwis", lang="fr-fr")
# TTS says: "cent vingt-trois"

# German ordinal
text = "[3](as: ordinal)"
audio, sr = tts.create(text, voice="gf_maria", lang="de-de")
# TTS says: "dritte"

Combining with Other Features:

Say-as works seamlessly with all SSMD features:

# With prosody
text = "[100](as: cardinal) +loud+ dollars!"

# With pauses
text = "[First](as: ordinal) ...c [second](as: ordinal) ...c [third](as: ordinal)!"

# With emphasis
text = "The winner is *[1](as: ordinal)*!"

See examples/say_as_demo.py for comprehensive examples.

4. Automatic Short Sentence Handling

When processing text, very short sentences (like "Why?" or "Go!") can produce poor audio quality when processed individually (only 3-8 phonemes each). Pykokoro automatically handles this using a "repeat-and-cut" technique:

How It Works:

Short segments are detected based on phoneme length (default: <30 phonemes)
The sentence is repeated: "Why?" → "Why? Why? Why?"
TTS generates audio with more context (better prosody)
Audio is trimmed to extract only the first instance

This happens automatically during kokoro.create() - no configuration needed!

Customizing the Behavior:

You can customize the thresholds using ShortSentenceConfig:

from pykokoro import Kokoro
from pykokoro.short_sentence_handler import ShortSentenceConfig

# More aggressive short sentence handling
config = ShortSentenceConfig(
    min_phoneme_length=50,    # Treat segments <50 phonemes as short
    target_phoneme_length=150, # Repeat until ~150 phonemes
    max_repetitions=7,         # Allow up to 7 repetitions
)

tts = Kokoro(short_sentence_config=config)

Default Configuration:

min_phoneme_length=30: Segments below this use repeat-and-cut
target_phoneme_length=100: Target length for repeated text
max_repetitions=5: Maximum times to repeat

Disabling Short Sentence Handling:

config = ShortSentenceConfig(min_phoneme_length=0)  # No segment is "short"
tts = Kokoro(short_sentence_config=config)

See examples/optimal_phoneme_length_demo.py for a demonstration.

Available Voices

The library includes voices across different languages and accents. The number of available voices depends on the model source:

HuggingFace & GitHub v1.0 (54 voices)

American English: af_alloy, af_bella, af_sarah, am_adam, am_michael, etc.
British English: bf_alice, bf_emma, bm_george, bm_lewis
Spanish: ef_dora, em_alex
French: ff_siwis
Japanese: jf_alpha, jm_kumo
Chinese: zf_xiaobei, zm_yunxi
And many more...

GitHub v1.1-zh (103 voices)

Includes all voices from v1.0 plus additional Chinese voices:

English voices: af_maple, af_sol, bf_vale (confirmed working)
Chinese voices: zf_001 through zf_099, zm_009 through zm_100

Example - Using v1.1-zh with English:

tts = pykokoro.Kokoro(model_source="github", model_variant="v1.1-zh")
audio, sr = tts.create("Hello world!", voice="af_maple", lang="en-us")

List all available voices:

voices = tts.get_voices()
print(voices)

Model Sources

PyKokoro supports downloading models from multiple sources:

HuggingFace (Default)

The default source with 54 multi-language voices:

tts = pykokoro.Kokoro(
    model_source="huggingface",
    model_quality="fp32"  # fp32, fp16, q8, q8f16, q4, q4f16, uint8, uint8f16
)

GitHub v1.0

54 voices with additional fp16-gpu optimized quality:

tts = pykokoro.Kokoro(
    model_source="github",
    model_variant="v1.0",
    model_quality="fp16-gpu"  # fp32, fp16, fp16-gpu, q8
)

GitHub v1.1-zh (English + Chinese)

103 voices including English and Chinese speakers:

tts = pykokoro.Kokoro(
    model_source="github",
    model_variant="v1.1-zh",
    model_quality="fp32"  # Only fp32 available
)

# Use English voices from v1.1-zh
voices = tts.get_voices()  # Returns 103 voices
audio, sr = tts.create("Hello world", voice="af_maple", lang="en-us")

Note: Chinese text generation requires proper phonemization support (currently in development).

Model Quality Options

Available quality options vary by source:

HuggingFace Models:

fp32: Full precision (highest quality, largest size)
fp16: Half precision (good quality, smaller size)
q8: 8-bit quantized (fast, small)
q8f16: 8-bit with fp16 (balanced)
q4: 4-bit quantized (fastest, smallest)
q4f16: 4-bit with fp16 (compact)
uint8: Unsigned 8-bit (compatible)
uint8f16: Unsigned 8-bit with fp16

GitHub v1.0 Models:

fp32: Full precision
fp16: Half precision
fp16-gpu: GPU-optimized fp16
q8: 8-bit quantized

GitHub v1.1-zh Models:

fp32: Full precision only

# HuggingFace with q8
tts = pykokoro.Kokoro(model_source="huggingface", model_quality="q8")

# GitHub v1.0 with GPU-optimized fp16
tts = pykokoro.Kokoro(model_source="github", model_variant="v1.0", model_quality="fp16-gpu")

Configuration

Configuration is stored in a platform-specific directory:

Linux: ~/.config/pykokoro/config.json
macOS: ~/Library/Application Support/pykokoro/config.json
Windows: %APPDATA%\pykokoro\config.json

import pykokoro

# Load config
config = pykokoro.load_config()

# Modify config
config["model_quality"] = "fp16"
config["use_gpu"] = True

# Save config
pykokoro.save_config(config)

Advanced Features

Custom Phoneme Dictionary

from pykokoro import Tokenizer, TokenizerConfig

# Create config with custom phoneme dictionary
config = TokenizerConfig(
    phoneme_dictionary_path="my_pronunciations.json"
)

tokenizer = Tokenizer(config=config)

Mixed Language Support

from pykokoro import TokenizerConfig

config = TokenizerConfig(
    use_mixed_language=True,
    mixed_language_primary="en-us",
    mixed_language_allowed=["en-us", "de", "fr"]
)

tokenizer = Tokenizer(config=config)

Backend Configuration

Control which phonemization backend and dictionaries to use:

from pykokoro import TokenizerConfig

# Default: Full dictionaries with espeak fallback (best quality)
config = TokenizerConfig(
    backend="espeak",
    load_gold=True,
    load_silver=True,
    use_espeak_fallback=True
)

# Memory-optimized: Gold dictionary only
config = TokenizerConfig(
    backend="espeak",
    load_gold=True,
    load_silver=False,  # Saves ~22-31 MB
    use_espeak_fallback=True
)

# Fastest initialization: Pure espeak
config = TokenizerConfig(
    backend="espeak",
    load_gold=False,
    load_silver=False,
    use_espeak_fallback=True
)

# Alternative backend (requires pygoruut)
config = TokenizerConfig(
    backend="goruut"
)

tokenizer = Tokenizer(config=config)

Note: use_dictionary parameter is deprecated. Use load_gold and load_silver instead for finer control.

External G2P Libraries: You can also use external phonemization libraries like Misaki:

from misaki import en, espeak
import pykokoro

# Misaki G2P with espeak-ng fallback
fallback = espeak.EspeakFallback(british=False)
g2p = en.G2P(trf=False, british=False, fallback=fallback)
phonemes, _ = g2p("Hello, world!")

# Generate audio from phonemes
kokoro = pykokoro.Kokoro()
samples, sample_rate = kokoro.create(
    phonemes,
    voice="af_bella",
    is_phonemes=True
)

License

This library is licensed under the Apache License 2.0.

Credits

Kokoro Model: hexgrad/Kokoro-82M
ONNX Models: onnx-community/Kokoro-82M-v1.0-ONNX
Phonemizer: kokorog2p

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.6.5

Feb 11, 2026

0.6.4

Jan 31, 2026

0.6.3

Jan 28, 2026

0.6.2

Jan 27, 2026

0.6.1

Jan 25, 2026

0.6.0

Jan 25, 2026

0.5.1

Jan 23, 2026

0.5.0

Jan 23, 2026

This version

0.4.0

Jan 17, 2026

0.3.0

Jan 14, 2026

0.2.0

Jan 11, 2026

0.1.0

Jan 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pykokoro-0.4.0.tar.gz (251.5 kB view details)

Uploaded Jan 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pykokoro-0.4.0-py3-none-any.whl (99.3 kB view details)

Uploaded Jan 17, 2026 Python 3

File details

Details for the file pykokoro-0.4.0.tar.gz.

File metadata

Download URL: pykokoro-0.4.0.tar.gz
Upload date: Jan 17, 2026
Size: 251.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for pykokoro-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`9c20f38bb6851d54bf4c64dabe18eda40edce81351fef40264633dbbefd0e075`
MD5	`fc24206ffdb732140979e356c9730196`
BLAKE2b-256	`212e4b7e9d0704d28568af0e6c810f68175301ce0d745316fef7b989ea011d4e`

See more details on using hashes here.

File details

Details for the file pykokoro-0.4.0-py3-none-any.whl.

File metadata

Download URL: pykokoro-0.4.0-py3-none-any.whl
Upload date: Jan 17, 2026
Size: 99.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for pykokoro-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`24b8aaec46cf928a67c64f6ee2e46938017f8da36cd7ad63f93d22385eb8e2d7`
MD5	`00ac4347885d81ef0c806900c16f0b82`
BLAKE2b-256	`fa2077491b2f1a445821f4df849a9267308e17f996b399cef012c0b4f6e19be6`

See more details on using hashes here.

pykokoro 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyKokoro

Features

Installation

Basic Installation (CPU only)

GPU and Accelerator Support

NVIDIA CUDA GPU

Intel OpenVINO

DirectML (Windows - AMD/Intel/NVIDIA GPUs)

Apple CoreML (macOS)

All Accelerators

Performance Comparison

Quick Start

Hardware Acceleration

Automatic Provider Selection (Recommended)

Explicit Provider Selection

Check Available Providers

Environment Variable Override

Usage Examples

Basic Text-to-Speech

Voice Blending

Streaming Generation

Phoneme-Based Generation

Pause Control

1. SSMD Break Markers

2. Automatic Natural Pauses

Text Normalization (Say-As)

4. Automatic Short Sentence Handling

Available Voices

HuggingFace & GitHub v1.0 (54 voices)

GitHub v1.1-zh (103 voices)

Model Sources

HuggingFace (Default)

GitHub v1.0

GitHub v1.1-zh (English + Chinese)

Model Quality Options

Configuration

Advanced Features

Custom Phoneme Dictionary

Mixed Language Support

Backend Configuration

License

Credits

Contributing

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes