Async-first TTS (Text-to-Speech) wrapper library for Python

These details have not been verified by PyPI

Project description

SpeechFlow

SpeechFlow is an async-first Python library for text-to-speech (TTS). It gives cloud and local TTS engines the same get() and stream() interface, and it ships small audio utilities for playback, WAV writing, and WebSocket output.

The core package depends only on NumPy. Engines and I/O integrations are installed as optional extras, so applications can install only the backends they use.

What You Get

Unified async engine API: await engine.get(...) and async for chunk in engine.stream(...)
Optional sync wrappers for non-async applications
Normalized AudioData objects: float32 PCM samples, usually mono, with an explicit sample rate
Cloud engines: OpenAI, Google Gemini, FishAudio, ElevenLabs
Local engines: Kokoro, Qwen3-TTS, Style-Bert-VITS2
Audio utilities: WAV writing, sounddevice playback, and /playback WebSocket output
Domain-specific exceptions: TTSError, ConfigurationError, StreamingError, and AudioProcessingError

Installation

SpeechFlow requires Python 3.11 or newer.

Core only:

uv add speechflow

Install one engine:

uv add "speechflow[openai]"

Install an engine with playback support:

uv add "speechflow[openai,player]"

Install everything:

uv add "speechflow[all]"

Using pip:

pip install speechflow
pip install "speechflow[openai]"
pip install "speechflow[openai,player]"
pip install "speechflow[all]"

Optional Extras

Extra	Adds	Main dependency
`openai`	`OpenAITTSEngine`	`openai`
`gemini`	`GeminiTTSEngine`	`google-genai`
`fishaudio`	`FishAudioTTSEngine`	`fish-audio-sdk`
`elevenlabs`	`ElevenLabsTTSEngine`	`elevenlabs`
`kokoro`	`KokoroTTSEngine`	`kokoro`, `torch`, Japanese language helpers
`qwen3tts`	`Qwen3TTSEngine`	`qwen-tts`
`stylebert`	`StyleBertTTSEngine`	`style-bert-vits2`, `torch`
`player`	`AudioPlayer`	`sounddevice`
`ws`	`WebSocketAudioSink`	`websockets`
`all`	All engines, playback, and WebSocket output	All optional dependencies above

Optional engine classes are exported from the top-level package lazily. After installing the openai extra, from speechflow import OpenAITTSEngine becomes available, but importing speechflow itself does not load optional engine dependencies. Accessing a class whose extra is not installed raises an install-hint error such as pip install "speechflow[openai]".

GPU Setup For Local Engines

Local engines can be CPU-only, but GPU acceleration is often important for Kokoro, Qwen3-TTS, and Style-Bert-VITS2. If you need a specific PyTorch build, install it before installing the SpeechFlow extra so your environment keeps the wheel you selected.

Example for CUDA 12.6:

# uv
uv add torch torchvision torchaudio --index https://download.pytorch.org/whl/cu126
uv add "speechflow[kokoro]"

# pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install "speechflow[kokoro]"

Replace cu126 with the CUDA wheel that matches your system.

Quick Start

Generate And Save Audio

import asyncio

from speechflow import AudioWriter, OpenAITTSEngine


async def main() -> None:
    engine = OpenAITTSEngine(api_key="your-openai-api-key")
    writer = AudioWriter()

    audio = await engine.get("Hello from SpeechFlow.")
    await writer.save(audio, "hello.wav")

    print(audio.sample_rate, audio.channels, audio.duration)


asyncio.run(main())

Stream Audio

Every engine exposes stream() as an async iterator of AudioData chunks. The chunking behavior depends on the engine, but consuming code can treat every stream the same way.

import asyncio

from speechflow import AudioWriter, OpenAITTSEngine


async def main() -> None:
    engine = OpenAITTSEngine(api_key="your-openai-api-key")
    writer = AudioWriter()

    await writer.save_stream(
        engine.stream("This can be a longer piece of text."),
        "streamed.wav",
    )


asyncio.run(main())

Play Audio

Install the player extra first:

uv add "speechflow[openai,player]"

import asyncio

from speechflow import AudioPlayer, OpenAITTSEngine


async def main() -> None:
    engine = OpenAITTSEngine(api_key="your-openai-api-key")
    player = AudioPlayer()

    audio = await engine.get("This will play through the default audio device.")
    await player.play(audio)


asyncio.run(main())

For streaming playback:

combined = await player.play_stream(engine.stream("Play chunks as they arrive."))

play_stream() returns one combined AudioData containing every played chunk.

Warm Up Local Engines

Local engines can have a noticeable cold penalty on their first synthesis while lazy model and GPU resources initialize. Call warmup() after constructing an engine and before accepting the first request:

engine = StyleBertTTSEngine(model_name="jvnv-F1-jp", device="cuda")
await engine.warmup()

audio = await engine.get("The first user request is now served warm.")

warmup() is awaitable and completes only when preparation succeeds. Failures propagate as TTSError; repeated calls after success are inexpensive no-ops. Engines without specialized warmup behavior inherit an immediate no-op. Synchronous applications can call the blocking warmup_sync() wrapper.

Use Sync Wrappers

Sync wrappers provide a blocking API for callers that cannot use the async API directly. They submit work to one shared daemon background event loop and block the calling thread until a result is available. They also work when the caller already has a running event loop, but async code should normally prefer the async API to avoid blocking its event-loop thread.

from speechflow import AudioWriter, OpenAITTSEngine

engine = OpenAITTSEngine(api_key="your-openai-api-key")
writer = AudioWriter()

audio = engine.get_sync("Hello from sync code.")
writer.save_sync(audio, "sync.wav")

Important sync behavior:

Engine, writer, and player sync wrappers all use the same thread-safe background event loop.
stream_sync() yields each async chunk synchronously as it arrives. Closing a partially consumed iterator also closes the underlying async stream.
Engine-specific per-call arguments such as speed, instructions, stability, or speaker are available on async get() and stream(). The inherited engine sync wrappers only accept text, model, and voice.
Every sync wrapper blocks its calling thread. Prefer await and async for in async code.

Public API

Top-level imports from speechflow:

Name	Availability	Purpose
`AudioData`	Core	Audio container with `data`, `sample_rate`, `channels`, `format`, and `duration`
`TTSEngineBase`	Core	Abstract async engine contract
`AudioWriter`	Core	WAV writing for complete buffers and streams
`AudioPlayer`	`player` extra	Playback for complete buffers and streams
`OpenAITTSEngine`	`openai` extra	OpenAI TTS engine
`GeminiTTSEngine`	`gemini` extra	Google Gemini TTS engine
`FishAudioTTSEngine`	`fishaudio` extra	FishAudio TTS engine
`ElevenLabsTTSEngine`	`elevenlabs` extra	ElevenLabs TTS engine
`KokoroTTSEngine`	`kokoro` extra	Local Kokoro engine
`Qwen3TTSEngine`	`qwen3tts` extra	Local Qwen3-TTS engine
`StyleBertTTSEngine`	`stylebert` extra	Local Style-Bert-VITS2 engine
`TTSError`	Core	Base SpeechFlow exception
`ConfigurationError`	Core	Invalid configuration, missing credentials, unsupported parameters
`AudioProcessingError`	Core	Audio conversion, playback, writing, or output failures
`StreamingError`	Core	Mid-stream synthesis failures
`EngineNotFoundError`	Core	Requested engine is unavailable or not installed

WebSocketAudioSink is exported from speechflow.audio when the ws extra is installed:

from speechflow.audio import WebSocketAudioSink

AudioData

Engines and audio utilities exchange AudioData:

from speechflow import AudioData

audio = AudioData(
    data=samples,       # numpy.ndarray, normally float32 samples in [-1, 1]
    sample_rate=24000,  # int
    channels=1,         # int, default 1
    format="pcm",       # str, default "pcm"
)

seconds = audio.duration

AudioData is a lightweight container. It does not validate or normalize data when you construct it yourself. SpeechFlow engines return normalized float32 PCM data by convention.

Engine Behavior

All engines implement:

audio = await engine.get(text, model=None, voice=None)

async for chunk in engine.stream(text, model=None, voice=None):
    ...

Engine-specific options are added as keyword arguments without changing the base text, model, and voice shape.

Engine	Extra	Output rate	Streaming behavior
OpenAI	`openai`	24 kHz	True chunked streaming
Gemini	`gemini`	Parsed from response, usually 24 kHz	One complete-audio chunk in this implementation
FishAudio	`fishaudio`	24 kHz	True chunked streaming
ElevenLabs	`elevenlabs`	From `output_format`, default 24 kHz	True chunked streaming
Kokoro	`kokoro`	24 kHz	Segment-by-segment local generation
Qwen3-TTS	`qwen3tts`	24 kHz	Sentence-by-sentence local generation
Style-Bert-VITS2	`stylebert`	Model-dependent, JVNV is typically 44.1 kHz	Sentence-by-sentence local generation

Engines

Examples in this section show code that belongs inside an async function unless they include an asyncio.run() wrapper.

OpenAI

from speechflow import OpenAITTSEngine

engine = OpenAITTSEngine(
    api_key="your-openai-api-key",
    model="gpt-4o-mini-tts",
    voice="alloy",
    instructions="Speak clearly and warmly.",
    speed=1.0,
)

audio = await engine.get(
    "Hello.",
    voice="nova",
    speed=1.05,
    instructions="Use a calm tone.",
)

async for chunk in engine.stream("Stream this text."):
    ...

Models:

gpt-4o-mini-tts (default)
tts-1
tts-1-hd
gpt-4o-mini-tts-2025-12-15

Voices:

alloy (default)
ash
ballad
coral
echo
fable
nova
onyx
sage
shimmer
verse
marin
cedar

instructions is intended for gpt-4o-mini-tts. speed is passed through to the API; it is reliably supported by tts-1 and tts-1-hd.

Google Gemini

from speechflow import GeminiTTSEngine

engine = GeminiTTSEngine(api_key="your-gemini-api-key")

audio = await engine.get(
    "Hello.",
    model="gemini-2.5-flash-preview-tts",
    voice="Zephyr",
)

Models:

gemini-2.5-flash-preview-tts (default)
gemini-2.5-pro-preview-tts
gemini-2.5-flash-lite-preview-tts

Voices:

Zephyr (default)
Puck
Charon
Kore
Fenrir
Leda
Orus
Aoede
Callirrhoe
Autonoe
Enceladus
Iapetus
Umbriel
Algieba
Despina
Erinome
Algenib
Rasalgethi
Laomedeia
Achernar
Alnilam
Schedar
Gacrux
Pulcherrima
Achird
Zubenelgenubi
Vindemiatrix
Sadachbia
Sadaltager
Sulafat

stream() uses the Gemini streaming API surface, but the current engine yields complete TTS audio as a single chunk when Gemini returns it that way.

FishAudio

from speechflow import FishAudioTTSEngine

engine = FishAudioTTSEngine(
    api_key="your-fishaudio-api-key",
    model="s1",
    voice="optional-reference-id",
    speed=1.0,
    volume=0.0,
)

audio = await engine.get(
    "Hello.",
    speed=1.1,
    volume=0.0,
)

Models:

s1 (default)
speech-1.6
speech-1.5

voice is a FishAudio reference ID. speed is a speed multiplier (1.0 = normal) and volume is a volume adjustment in dB (0.0 = no change); both are passed to the SDK without engine-level clamping. Output is 24 kHz mono PCM.

ElevenLabs

from speechflow import ElevenLabsTTSEngine

engine = ElevenLabsTTSEngine(
    api_key="your-elevenlabs-api-key",
    model="eleven_multilingual_v2",
    voice="21m00Tcm4TlvDq8ikWAM",
    output_format="pcm_24000",
)

audio = await engine.get(
    "Hello.",
    stability=0.5,
    similarity_boost=0.75,
    speed=1.0,
)

Models:

eleven_multilingual_v2 (default)
eleven_v3
eleven_flash_v2_5
eleven_turbo_v2_5

The default voice is 21m00Tcm4TlvDq8ikWAM. output_format must be a PCM format such as pcm_24000; the sample rate is parsed from that suffix. Non-PCM formats are rejected with ConfigurationError.

Kokoro

from speechflow import KokoroTTSEngine

engine = KokoroTTSEngine(lang_code="a")
audio = await engine.get("Hello from Kokoro.", voice="af_heart", speed=1.0)

async for chunk in engine.stream("First sentence.\nSecond sentence."):
    ...

Constructor options:

lang_code: one of the supported language codes below, default a
device: Torch device string such as cpu or cuda; None lets Kokoro select

Languages and default voices:

Code	Language	Default voice
`a`	American English	`af_heart`
`b`	British English	`bf_emma`
`e`	Spanish	`ef_dora`
`f`	French	`ff_siwis`
`h`	Hindi	`hf_alpha`
`i`	Italian	`if_sara`
`j`	Japanese	`jf_alpha`
`p`	Brazilian Portuguese	`pf_dora`
`z`	Mandarin Chinese	`zf_xiaobei`

The selected voice must start with the configured language code. Japanese support checks for a UniDic dictionary and attempts to download it if missing. If that fails, run:

python -m unidic download

Qwen3-TTS

from speechflow import Qwen3TTSEngine

engine = Qwen3TTSEngine()

audio = await engine.get(
    "Hello from Qwen3-TTS.",
    language="English",
    speaker="Vivian",
)

Models:

Qwen/Qwen3-TTS-12Hz-0.6B-Base
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
Qwen/Qwen3-TTS-12Hz-1.7B-Base
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice (default)
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

Languages:

Auto (default)
Chinese
English
Japanese
Korean
German
French
Russian
Portuguese
Spanish
Italian

CustomVoice speakers:

Vivian (default)
Serena
Uncle_Fu
Dylan
Eric
Ryan
Aiden
Ono_Anna
Sohee

Model variants control which arguments are valid:

CustomVoice models use speaker; voice is accepted as a synonym when speaker is not provided.
Base models require ref_audio and ref_text, or a stored profile created with set_voice_profile().
VoiceDesign models require instruct, a natural-language voice description.

Voice cloning with a Base model:

engine = Qwen3TTSEngine(model_id="Qwen/Qwen3-TTS-12Hz-1.7B-Base")

await engine.set_voice_profile(
    ref_audio="/path/to/reference.wav",
    ref_text="Transcript of the reference audio.",
)

audio = await engine.get("Use the stored voice profile.", language="English")
await engine.clear_voice_profile()

ref_audio can be a file path, URL, base64 string, or a (numpy.ndarray, sample_rate) tuple. Output is 24 kHz mono PCM.

Style-Bert-VITS2

from speechflow import StyleBertTTSEngine

engine = StyleBertTTSEngine(model_name="jvnv-F1-jp")

audio = await engine.get(
    "こんにちは。",
    speaker_id=0,
    style="Happy",
    style_weight=5.0,
    speed=1.0,
    pitch=0.0,
)

English synthesis uses an English-compatible model and language="EN":

engine = StyleBertTTSEngine(model_name="rikka-botan-en", language="EN")

audio = await engine.get(
    "Hello from the exhibition.",
    speaker_id=0,
    style="Neutral",
)

You must provide either model_name or model_path.

Pretrained model names:

jvnv-F1-jp
jvnv-F2-jp
jvnv-M1-jp
jvnv-M2-jp
rikka-botan-en

JVNV JP-Extra models are Japanese-only. Use rikka-botan-en or another English-compatible Style-Bert-VITS2 model with language="EN".

To use a local Style-Bert-VITS2 model, pass model_path pointing to a directory containing the model files expected by the engine:

one .safetensors file
config.json
style_vectors.npy

For example:

engine = StyleBertTTSEngine(model_path="/path/to/stylebert-model", language="EN")

Styles:

Neutral (default)
Happy
Sad
Angry
Fear
Surprise
Disgust

style_weight is clamped to 0.0 through 10.0. speed must be greater than 0.0. pitch is specified in semitones. Advanced keyword arguments are forwarded to the underlying Style-Bert-VITS2 TTSModel.infer() call.

Pretrained models are downloaded from Hugging Face on first use and cached under your user cache directory. The sample rate comes from the loaded model configuration; JVNV models are typically 44.1 kHz.

Audio Utilities

AudioWriter

AudioWriter is included in the core package and writes WAV files using the standard-library wave module.

from speechflow import AudioWriter

async def save_examples(audio, engine) -> None:
    writer = AudioWriter()

    await writer.save(audio, "output.wav")
    combined = await writer.save_stream(engine.stream("Save a stream."), "stream.wav")
    print(combined.duration)

Supported extensions:

.wav
.wave

Behavior:

Converts float32 samples to int16 PCM.
Clips samples to [-1.0, 1.0] before conversion.
Creates parent directories when needed.
Collects stream chunks before writing because WAV headers need the final data length.
Returns the input AudioData for save() and combined AudioData for save_stream().
Raises AudioProcessingError for unsupported extensions, inconsistent stream parameters, empty streams, or write failures.

AudioPlayer

Install the player extra to use AudioPlayer.

from speechflow import AudioPlayer

async def play_examples(audio, engine) -> None:
    player = AudioPlayer()

    await player.play(audio)
    combined = await player.play_stream(engine.stream("Play this stream."))
    player.stop()
    print(combined.duration)

Behavior:

Uses sounddevice.OutputStream.
Starts stream playback after the first chunk arrives.
Requires all chunks in a stream to share the same sample rate and channel count.
Returns combined AudioData from play_stream().
Can be used as a sync or async context manager; leaving the context calls stop().
Raises AudioProcessingError for playback failures, empty streams, or inconsistent stream parameters.

WebSocketAudioSink

Install the ws extra to send audio to a WebSocket /playback endpoint.

uv add "speechflow[openai,ws]"

import asyncio

from speechflow import OpenAITTSEngine
from speechflow.audio import WebSocketAudioSink


async def main() -> None:
    engine = OpenAITTSEngine(api_key="your-openai-api-key")

    async with WebSocketAudioSink("ws://127.0.0.1:8765/playback") as sink:
        await sink.send_stream(engine.stream("Send this as streaming PCM."))


asyncio.run(main())

API:

await sink.connect()
await sink.close()
await sink.send_stream(stream)
await sink.send_batch(audio)
await sink.flush()
async with WebSocketAudioSink(url) as sink: ...

Wire behavior:

send_stream() sends {"type": "begin", "format": "f32le", "sample_rate": ...}, then little-endian float32 mono PCM binary chunks, then {"type": "end"}.
send_batch() sends one 16-bit PCM mono WAV binary message.
flush() sends {"type": "flush"}.
If flush() is called from another task while send_stream() is active, the active stream is cancelled, no end frame is sent, and send_stream() returns the audio sent so far.
The sink supports mono audio only and does not resample.
Mixed stream sample rates, mixed channel counts, concurrent streams, connection failures, and send failures raise AudioProcessingError.

Errors

Catch TTSError to handle any SpeechFlow-domain error, or catch specific subclasses when you want different recovery behavior.

from speechflow import ConfigurationError, StreamingError, TTSError

try:
    audio = await engine.get("Hello.")
except ConfigurationError as exc:
    print(f"Configuration problem: {exc}")
except StreamingError as exc:
    print(f"Streaming failed: {exc}")
except TTSError as exc:
    print(f"TTS failed: {exc}")

Common mapping:

Missing API keys and unsupported parameters raise ConfigurationError.
Synthesis failures raise TTSError.
Streaming failures raise StreamingError.
Playback, WAV writing, and WebSocket output failures raise AudioProcessingError.

Notes For Application Code

Pass API keys directly to engine constructors. SpeechFlow does not load .env files or manage credentials.
Keep async code async. In web servers, notebooks, and GUI applications with an existing event loop, use await engine.get(...) and async for chunk in engine.stream(...).
Use AudioData.sample_rate instead of assuming every engine returns the same rate.
Use stream() when you want low-latency consumption, but check the engine table above: some engines yield sentence-sized chunks, and Gemini currently yields complete audio as one chunk.
Use AudioWriter.save_stream() when you need a WAV file from an audio stream.
Use AudioPlayer.play_stream() or WebSocketAudioSink.send_stream() when you want to consume chunks as they arrive.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.8.0

Jul 13, 2026

0.4.0

Mar 3, 2026

0.3.4

Mar 2, 2026

0.3.3

Mar 2, 2026

0.1.7

Sep 8, 2025

0.1.6

Aug 9, 2025

0.1.5

Aug 7, 2025

0.1.4

Aug 7, 2025

0.1.3

Aug 7, 2025

0.1.0

Aug 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speechflow-0.8.0.tar.gz (72.3 kB view details)

Uploaded Jul 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

speechflow-0.8.0-py3-none-any.whl (50.4 kB view details)

Uploaded Jul 13, 2026 Python 3

File details

Details for the file speechflow-0.8.0.tar.gz.

File metadata

Download URL: speechflow-0.8.0.tar.gz
Upload date: Jul 13, 2026
Size: 72.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for speechflow-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`e9a52a424bc4f2c23fcf2ef6c56810073c1468b3fff7ea595de6dbaf487fcdc9`
MD5	`aeb69e8e0986f0b12a2d5af1ac1a8950`
BLAKE2b-256	`d34e9576370ed1eb7be6b7d5ce1affd46aa13a7fa799beb4949fa8e5bde5d898`

See more details on using hashes here.

File details

Details for the file speechflow-0.8.0-py3-none-any.whl.

File metadata

Download URL: speechflow-0.8.0-py3-none-any.whl
Upload date: Jul 13, 2026
Size: 50.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for speechflow-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f80fa6a37950afe5460bd48d8e02a33aa45378c1c3c00519fc9108fe5a9303a`
MD5	`ccb8e22d2927979b35b9f5b319ce0c4c`
BLAKE2b-256	`09edffd53057fdf6dabbdd139a2d16ae203a87ef3143f8e55fccdc9bd61e0d64`

See more details on using hashes here.

speechflow 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

SpeechFlow

What You Get

Installation

Optional Extras

GPU Setup For Local Engines

Quick Start

Generate And Save Audio

Stream Audio

Play Audio

Warm Up Local Engines

Use Sync Wrappers

Public API

AudioData

Engine Behavior

Engines

OpenAI

Google Gemini

FishAudio

ElevenLabs

Kokoro

Qwen3-TTS

Style-Bert-VITS2

Audio Utilities

AudioWriter

AudioPlayer

WebSocketAudioSink

Errors

Notes For Application Code

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes