Skip to main content

Async-first TTS (Text-to-Speech) wrapper library for Python

Project description

SpeechFlow

A unified async-first Python TTS (Text-to-Speech) library with multiple engine support.

Features

  • Multiple TTS Engines: OpenAI, Google Gemini, FishAudio, ElevenLabs, Kokoro (local), Qwen3-TTS (local), Style-Bert-VITS2 (local)
  • Async-First Design: Native async/await API with sync wrappers for convenience
  • Streaming Support: Real-time audio streaming for supported engines
  • Decoupled Architecture: Engines, player, and writer are independent components
  • Optional Dependencies: Core requires only numpy; each engine is installable as an extra

Installation

# Core only (no engines)
uv add speechflow

# Install with specific engine
uv add "speechflow[openai]"

# Install with audio playback
uv add "speechflow[openai,player]"

# Install everything
uv add "speechflow[all]"

Available Extras

Extra Engine Type
openai OpenAI TTS Cloud
gemini Google Gemini TTS Cloud
fishaudio FishAudio TTS Cloud
elevenlabs ElevenLabs TTS Cloud
kokoro Kokoro TTS (includes PyTorch) Local
qwen3tts Qwen3-TTS (includes PyTorch) Local
stylebert Style-Bert-VITS2 (includes PyTorch) Local
player Audio playback via sounddevice Utility
all All of the above -
Using pip instead of uv
pip install "speechflow[openai]"
pip install "speechflow[openai,player]"
pip install "speechflow[all]"

GPU Support (Kokoro / Qwen3-TTS / Style-Bert-VITS2)

Local engines pull PyTorch as a dependency. By default, CPU-only PyTorch is installed. For GPU acceleration, install PyTorch with CUDA before installing speechflow:

# uv
uv add torch torchvision torchaudio --index https://download.pytorch.org/whl/cu121
uv add "speechflow[kokoro]"

# pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install "speechflow[kokoro]"

Replace cu121 with your CUDA version (e.g., cu118, cu124).

Quick Start

Async (Primary API)

import asyncio
from speechflow import OpenAITTSEngine, AudioPlayer, AudioWriter

async def main():
    engine = OpenAITTSEngine(api_key="your-api-key")
    player = AudioPlayer()
    writer = AudioWriter()

    # Generate audio
    audio = await engine.get("Hello, world!")

    # Play audio
    await player.play(audio)

    # Save to file
    await writer.save(audio, "output.wav")

asyncio.run(main())

Sync Wrappers

from speechflow import OpenAITTSEngine, AudioPlayer, AudioWriter

engine = OpenAITTSEngine(api_key="your-api-key")
player = AudioPlayer()
writer = AudioWriter()

audio = engine.get_sync("Hello, world!")
player.play_sync(audio)
writer.save_sync(audio, "output.wav")

Streaming

import asyncio
from speechflow import OpenAITTSEngine, AudioPlayer

async def main():
    engine = OpenAITTSEngine(api_key="your-api-key")
    player = AudioPlayer()

    # Stream and play (returns combined AudioData)
    combined = await player.play_stream(engine.stream("This is a long text that will be streamed..."))

asyncio.run(main())

Streaming notes:

  • OpenAI: True streaming with multiple chunks.
  • Gemini: Returns complete audio in a single chunk (API limitation).
  • FishAudio: True streaming.
  • ElevenLabs: True streaming.
  • Kokoro / Style-Bert-VITS2 / Qwen3-TTS: Sentence-by-sentence streaming.

Engine-Specific Features

OpenAI TTS

engine = OpenAITTSEngine(api_key="your-api-key")
audio = await engine.get(
    "Hello",
    voice="alloy",           # ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer
    model="gpt-4o-mini-tts", # tts-1, tts-1-hd
    speed=1.0,
    instructions="Speak in a cheerful tone",
)

# Streaming
async for chunk in engine.stream("Long text..."):
    pass

Google Gemini TTS

engine = GeminiTTSEngine(api_key="your-api-key")
audio = await engine.get(
    "Hello",
    model="gemini-2.5-flash-preview-tts",  # gemini-2.5-pro-preview-tts
    voice="Leda",                           # Puck, Charon, Kore, Fenrir, Aoede, ...
)

FishAudio TTS

engine = FishAudioTTSEngine(api_key="your-api-key")
audio = await engine.get(
    "Hello world",
    model="s1",                  # s1-mini, speech-1.6, speech-1.5, agent-x0
    voice="your-voice-id",
    speed=1.0,                   # Speech speed
    volume=1.0,                  # Volume
)

# Streaming
async for chunk in engine.stream("Streaming text..."):
    pass

ElevenLabs TTS

engine = ElevenLabsTTSEngine(api_key="your-api-key")
audio = await engine.get(
    "Hello",
    voice="21m00Tcm4TlvDq8ikWAM",  # Voice ID from ElevenLabs dashboard
    model="eleven_multilingual_v2", # eleven_turbo_v2_5, eleven_turbo_v2, eleven_monolingual_v1
    output_format="pcm_24000",      # pcm_16000, pcm_22050, pcm_44100
    stability=0.5,
    similarity_boost=0.75,
    speed=1.0,
)

# Streaming
async for chunk in engine.stream("Streaming text..."):
    pass

Qwen3-TTS

# CustomVoice model (default) — choose from built-in speakers
engine = Qwen3TTSEngine()
audio = await engine.get(
    "Hello, world!",
    speaker="Chelsie",  # Ethan, Chelsie, etc.
    language="en",
)

# Base model — voice cloning with reference audio
engine = Qwen3TTSEngine(model_id="Qwen/Qwen3-TTS-0.6B-Base")
engine.set_voice_profile(ref_audio=audio_bytes, ref_text="transcript")
audio = await engine.get("Clone this voice", language="en")

# Sentence-by-sentence streaming
async for chunk in engine.stream("Long text for streaming...", language="ja"):
    pass

Supported languages: Chinese (zh), English (en), Japanese (ja), Korean (ko), and more.

Kokoro TTS

# Default: American English
engine = KokoroTTSEngine()
audio = await engine.get(
    "Hello world",
    voice="af_heart",
    speed=1.0,
)

# Japanese (dictionary auto-downloads on first use)
engine = KokoroTTSEngine(lang_code="j")
audio = await engine.get("こんにちは、世界", voice="af_heart")

If Japanese dictionary download fails, run manually: python -m unidic download

Supported languages: American English (a), British English (b), Spanish (e), French (f), Hindi (h), Italian (i), Japanese (j), Brazilian Portuguese (p), Mandarin Chinese (z)

Style-Bert-VITS2

# Pre-trained model (auto-downloads on first use)
engine = StyleBertTTSEngine(model_name="jvnv-F1-jp")
audio = await engine.get(
    "こんにちは、世界",
    style="Happy",       # Neutral, Happy, Sad, Angry, Fear, Surprise, Disgust
    style_weight=5.0,    # Emotion strength (0.0-10.0)
    speed=1.0,
    pitch=0.0,           # Pitch shift in semitones
    speaker_id=0,
)

# Custom model
engine = StyleBertTTSEngine(model_path="/path/to/your/model")

# Sentence-by-sentence streaming
async for chunk in engine.stream("長い文章を文ごとに生成します。"):
    pass

Pre-trained models: jvnv-F1-jp, jvnv-F2-jp (female), jvnv-M1-jp, jvnv-M2-jp (male)

Optimized for Japanese. GPU recommended for best performance.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speechflow-0.4.0.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

speechflow-0.4.0-py3-none-any.whl (39.2 kB view details)

Uploaded Python 3

File details

Details for the file speechflow-0.4.0.tar.gz.

File metadata

  • Download URL: speechflow-0.4.0.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for speechflow-0.4.0.tar.gz
Algorithm Hash digest
SHA256 0d87896c6aed6a37ef5ddfcd357dee7e0c3ca7d8e94665b0a7da83f82ed20cfc
MD5 13702615305f54bd27d34da07f23dba7
BLAKE2b-256 b761bbeddce85fafcfad9bffa034afb5f8031f24155fb6b04979a6108387434b

See more details on using hashes here.

Provenance

The following attestation bundles were made for speechflow-0.4.0.tar.gz:

Publisher: publish.yml on sync-dev-org/speechflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file speechflow-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: speechflow-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 39.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for speechflow-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8a25dc5ce0c2130754398c42fc5aaf0b5c6549fdf0ea6abd99628d9c460b350
MD5 050f4c7fae39362ba7f9c4ee89c058fe
BLAKE2b-256 2b4acd3327454d3d08e40cf2011cfe894ac90d44c09d72fb4d6f89dda0496853

See more details on using hashes here.

Provenance

The following attestation bundles were made for speechflow-0.4.0-py3-none-any.whl:

Publisher: publish.yml on sync-dev-org/speechflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page