Lightweight streaming text-to-speech with Kokoro engine
Project description
streaming-tts
Lightweight streaming text-to-speech with Kokoro engine.
Features
- Streaming TTS: Real-time audio synthesis with callback support
- Voice Blending: Mix multiple voices with weighted formulas
- Pause Tags: Insert natural pauses with
[pause:1.5s]syntax - Text Normalization: Convert URLs, emails, numbers, money to spoken form
- Smart Chunking: Token-aware text splitting for optimal quality
- Multi-Format Output: Export to WAV, MP3, Opus, FLAC, AAC (with
[audio]extra) - 54 Voices: American, British English + 7 other languages
Installation
pip install streaming-tts
For development installation:
pip install -e .
Optional Extras
# Japanese language support
pip install streaming-tts[jp]
# Chinese language support
pip install streaming-tts[zh]
# Korean language support
pip install streaming-tts[ko]
# All features
pip install streaming-tts[all]
Note: Non-English languages require espeak-ng to be installed on your system.
Quick Start
from streaming_tts import TextToAudioStream, KokoroEngine
# Initialize the engine
engine = KokoroEngine(voice="af_heart")
# Create stream and play
stream = TextToAudioStream(engine)
stream.feed("Hello, world! This is a test of streaming text to speech.").play()
Pause Tags
Insert natural pauses in your text:
text = "Hello! [pause:1s] How are you? [pause:500ms] I hope you're well."
stream.feed(text).play()
Text Normalization
Automatically convert special content to spoken form:
from streaming_tts import normalize_text, NormalizationOptions
options = NormalizationOptions(normalize=True)
# URLs, emails, numbers, money, etc.
text = "Visit https://example.com or email user@test.com. Price: $42.50"
normalized = normalize_text(text, options)
# -> "Visit https example dot com or email user at test dot com. Price: forty-two dollars and fifty cents"
Smart Chunking
Split long text into optimal chunks for TTS:
from streaming_tts import smart_split, process_text_with_pauses
import time
# Process text with pauses
for item in process_text_with_pauses(text, normalize=True):
if isinstance(item, float):
time.sleep(item) # Pause
else:
stream.feed(item).play() # Speak
Multi-Format Audio Export
from streaming_tts import StreamingAudioWriter
# Requires: pip install streaming-tts[audio]
writer = StreamingAudioWriter("mp3", sample_rate=24000)
for audio_chunk in audio_chunks:
mp3_data = writer.write_chunk(audio_chunk)
# Stream or save mp3_data
final_data = writer.write_chunk(finalize=True)
writer.close()
Usage with Callbacks
from streaming_tts import TextToAudioStream, KokoroEngine
def on_audio_chunk(chunk):
# Process audio chunk (e.g., send over websocket)
pass
def on_stream_stop():
print("Audio stream finished")
engine = KokoroEngine(voice="af_heart")
stream = TextToAudioStream(engine, on_audio_stream_stop=on_stream_stop)
stream.feed("Hello world").play(muted=True, on_audio_chunk=on_audio_chunk)
Available Voices
American English (lang_code='a')
- Female:
af_heart,af_alloy,af_aoede,af_bella,af_jessica,af_kore,af_nicole,af_nova,af_river,af_sarah,af_sky - Male:
am_adam,am_echo,am_eric,am_fenrir,am_liam,am_michael,am_onyx,am_puck,am_santa
British English (lang_code='b')
- Female:
bf_alice,bf_emma,bf_isabella,bf_lily - Male:
bm_daniel,bm_fable,bm_george,bm_lewis
Other Languages
- Japanese:
jf_alpha,jf_gongitsune,jf_nezumi,jf_tebukuro,jm_kumo - Chinese:
zf_xiaobei,zf_xiaoni,zf_xiaoxiao,zf_xiaoyi,zm_yunjian,zm_yunxi,zm_yunxia,zm_yunyang - Spanish:
ef_dora,em_alex,em_santa - French:
ff_siwis - Hindi:
hf_alpha,hf_beta,hm_omega,hm_psi - Italian:
if_sara,im_nicola - Portuguese:
pf_dora,pm_alex,pm_santa
Voice Blending
You can blend multiple voices using weighted formulas:
engine = KokoroEngine(voice="0.3*af_sarah + 0.7*am_adam")
API Reference
KokoroEngine
KokoroEngine(
voice="af_heart", # Voice name or blend formula
default_speed=1.0, # Speech speed multiplier
trim_silence=True, # Remove silence from audio
debug=False # Enable debug output
)
TextToAudioStream
TextToAudioStream(
engine, # KokoroEngine instance
on_audio_stream_start=None, # Callback when audio starts
on_audio_stream_stop=None, # Callback when audio stops
on_audio_chunk=None, # Callback for each audio chunk
on_word=None, # Callback for word timing
muted=False # Disable speaker output
)
Requirements
- Python 3.9-3.12
- PyAudio (may require system dependencies)
- Torch
Windows Note
PyAudio on Windows may require Visual C++ Build Tools. If you encounter issues:
pip install pipwin
pipwin install pyaudio
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file streaming_tts-0.3.0.tar.gz.
File metadata
- Download URL: streaming_tts-0.3.0.tar.gz
- Upload date:
- Size: 56.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e4d75d4ec48d63fa85ed67d5e244fedfea464fb80f95aefeb04772b425f8972
|
|
| MD5 |
160eaac099f7621249bcb5ddbc9ae5ff
|
|
| BLAKE2b-256 |
f70b9f93d27def97ab5e412d01e432e60868b92251d673b7f0a8861caa7ca26d
|
File details
Details for the file streaming_tts-0.3.0-py3-none-any.whl.
File metadata
- Download URL: streaming_tts-0.3.0-py3-none-any.whl
- Upload date:
- Size: 62.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e7c40491d715340b4c92dd47f1bb5ed862a31c48365e2695c876c8d2ab58d6e
|
|
| MD5 |
b88d259dff3162362a6bd6e87e3429f8
|
|
| BLAKE2b-256 |
fad8d06df50bf57508e9aa2474a789cd273e59c67ce30fdff64d9624ac6690a1
|