Skip to main content

Text-to-speech for AI agents and developers. Compiler → Graph → Engine architecture.

Project description

日本語 | 中文 | Español | Français | हिन्दी | Italiano | Português (BR)

Voice Soundboard Logo

Give your AI agents a voice that feels real.

PyPI version CI Codecov Python 3.10+ License: MIT Landing Page

Part of MCP Tool Shop — practical developer tools that stay out of your way.


Voice Soundboard is a text-to-speech engine built for developers who need more than just a .mp3 file.

Most TTS libraries force a choice: easy APIs that hide everything, or complex lower-level tools that require audio engineering knowledge. Voice Soundboard gives you the best of both worlds.

  • Simple High-Level API: Just call engine.speak("Hello") and get audio.
  • Powerful Internals: Under the hood, we use a Compiler/Graph/Engine architecture that separates what is said (intent, emotion) from how it's rendered (backend, audio format).
  • Zero-Cost Abstractions: Emotions, styles, and SSML are compiled into a control graph, so the runtime engine stays fast and lightweight.

Quick Start

pip install voice-soundboard
from voice_soundboard import VoiceEngine

# Easy text-to-speech
engine = VoiceEngine()
result = engine.speak("Hello world! This is my AI voice.")
print(f"Saved to: {result.audio_path}")

Architecture

compile_request("text", emotion="happy")
        |
    ControlGraph (pure data)
        |
    engine.synthesize(graph)
        |
    PCM audio (numpy array)

The compiler transforms intent (text + emotion + style) into a ControlGraph.

The engine transforms the graph into audio. It knows nothing about emotions or styles.

This separation means:

  • Features are "free" at runtime (already baked into the graph)
  • Engine is tiny, fast, testable
  • Backends are swappable without touching feature logic

Usage

Basic

from voice_soundboard import VoiceEngine

engine = VoiceEngine()

# Simple
result = engine.speak("Hello world!")

# With voice
result = engine.speak("Cheerio!", voice="bm_george")

# With preset
result = engine.speak("Breaking news!", preset="announcer")

# With emotion
result = engine.speak("I'm so happy!", emotion="excited")

# With natural language style
result = engine.speak("Good morning!", style="warmly and cheerfully")

Advanced: Direct Graph Manipulation

from voice_soundboard.compiler import compile_request
from voice_soundboard.engine import load_backend

# Compile once
graph = compile_request(
    "Hello world!",
    voice="af_bella",
    emotion="happy",
)

# Synthesize many times (or with different backends)
backend = load_backend("kokoro")
audio = backend.synthesize(graph)

Streaming

Streaming operates at two levels:

  1. Graph streaming: compile_stream() yields ControlGraphs as sentence boundaries are detected
  2. Audio streaming: StreamingSynthesizer chunks audio for real-time playback

Note: This is sentence-level streaming, not word-by-word incremental synthesis. The compiler waits for sentence boundaries before yielding graphs. True incremental synthesis (speculative execution with rollback) is architecturally supported but not yet implemented.

from voice_soundboard.compiler import compile_stream
from voice_soundboard.runtime import StreamingSynthesizer

# For LLM output or real-time text
def text_chunks():
    yield "Hello, "
    yield "how are "
    yield "you today?"

backend = load_backend()
streamer = StreamingSynthesizer(backend)

for graph in compile_stream(text_chunks()):
    for audio_chunk in streamer.stream(graph):
        play(audio_chunk)

CLI

# Speak text
voice-soundboard speak "Hello world!"

# With options
voice-soundboard speak "Breaking news!" --preset announcer --speed 1.1

# List voices
voice-soundboard voices

# List presets
voice-soundboard presets

# List emotions
voice-soundboard emotions

Backends

Backend Quality Speed Sample Rate Install
Kokoro Excellent Fast (GPU) 24000 Hz pip install voice-soundboard[kokoro]
Piper Great Fast (CPU) 22050 Hz pip install voice-soundboard[piper]
OpenAI Excellent Cloud 24000 Hz pip install voice-soundboard[openai]
Coqui Great Moderate (GPU) 22050 Hz pip install voice-soundboard[coqui]
ElevenLabs Premium Cloud 44100 Hz pip install voice-soundboard[elevenlabs]
Azure Excellent Cloud 24000 Hz pip install voice-soundboard[azure]
Mock N/A Instant 24000 Hz (built-in, for testing)

Kokoro Setup

pip install voice-soundboard[kokoro]

# Download models
mkdir models && cd models
curl -LO https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.onnx
curl -LO https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin

Piper Setup

pip install voice-soundboard[piper]

# Download a voice (example: en_US_lessac_medium)
python -m piper.download_voices en_US-lessac-medium

Piper features:

  • 30+ voices across multiple languages (English, German, French, Spanish)
  • Pure CPU - no GPU required
  • Speed control via length_scale (inverted: 0.8 = faster, 1.2 = slower)
  • Sample rate: 22050 Hz (backend-specific)

Voice mapping from Kokoro:

# These Kokoro voices have Piper equivalents
engine = VoiceEngine(Config(backend="piper"))
result = engine.speak("Hello!", voice="af_bella")  # Uses en_US_lessac_medium

Package Structure

voice_soundboard/
├── graph/          # ControlGraph, TokenEvent, SpeakerRef
├── compiler/       # Text -> Graph (all features live here)
│   ├── text.py     # Tokenization, normalization
│   ├── emotion.py  # Emotion -> prosody
│   ├── style.py    # Natural language style
│   └── compile.py  # Main entry point
├── engine/         # Graph -> PCM (no features, just synthesis)
│   └── backends/   # Kokoro, Piper, OpenAI, Coqui, ElevenLabs, Azure, Mock
├── runtime/        # Streaming, timeline, ducking
├── adapters/       # CLI, public API (thin wrappers)
├── streaming/      # Incremental word-by-word synthesis
├── conversation/   # Multi-speaker dialogue
├── cloning/        # Speaker embedding extraction
├── speakers/       # Speaker database
├── realtime/       # Low-latency streaming engine
├── plugins/        # Plugin architecture
├── quality/        # Voice quality metrics
├── formats/        # Audio format conversion, LUFS
├── debug/          # Graph visualization, profiler
├── testing/        # VoiceMock, AudioAssertions
└── accessibility/  # Screen reader integration, captions

Key invariant: engine/ never imports from compiler/.

Architecture Invariants

These rules are enforced in tests and must never be violated:

  1. Engine isolation: engine/ never imports from compiler/. The engine knows nothing about emotions, styles, or presets -- only ControlGraphs.

  2. Voice cloning boundary: Raw audio never reaches the engine. The compiler extracts speaker embeddings; the engine receives only embedding vectors via SpeakerRef.

  3. Graph stability: GRAPH_VERSION (currently 1) is bumped on breaking changes to ControlGraph. Backends can check this for compatibility.

from voice_soundboard.graph import GRAPH_VERSION, ControlGraph
assert GRAPH_VERSION == 1

Migration

The public API is unchanged across all major versions:

# This works in v1, v2, and v3
from voice_soundboard import VoiceEngine
engine = VoiceEngine()
result = engine.speak("Hello!", voice="af_bella", emotion="happy")

v2 → v3

v3 removes 11 speculative modules that shipped with zero test coverage (distributed, serverless, intelligence, analytics, monitoring, security, ambiance, scenes, spatial, mcp, v3-alpha). The public API is unchanged. If you imported removed internals, see CHANGELOG.md for details.

v1 → v2

v1 v2+
engine.py adapters/api.py
emotions.py compiler/emotion.py
interpreter.py compiler/style.py
engines/kokoro.py engine/backends/kokoro.py

Security & Data Scope

  • Data accessed: Reads text input for TTS synthesis. Processes audio through configured backends (Kokoro, Piper, or mock). Returns PCM audio as numpy arrays or WAV files.
  • Data NOT accessed: No network egress by default (backends are local). No telemetry, analytics, or tracking. No user data storage beyond transient audio buffers.
  • Permissions required: Read access to TTS model files. Optional write access for audio output.

See SECURITY.md for vulnerability reporting.

Scorecard

Category Score
A. Security 10/10
B. Error Handling 10/10
C. Operator Docs 10/10
D. Shipping Hygiene 10/10
E. Identity (soft) 10/10
Overall 50/50

Evaluated with @mcptoolshop/shipcheck

License

MIT -- see LICENSE for details.


Built by MCP Tool Shop

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voice_soundboard-3.0.0.tar.gz (5.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voice_soundboard-3.0.0-py3-none-any.whl (203.9 kB view details)

Uploaded Python 3

File details

Details for the file voice_soundboard-3.0.0.tar.gz.

File metadata

  • Download URL: voice_soundboard-3.0.0.tar.gz
  • Upload date:
  • Size: 5.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for voice_soundboard-3.0.0.tar.gz
Algorithm Hash digest
SHA256 0fbb9cdeb65cd1f672dad847052d8e4f82d64c3b44a89848aebec877bd82d3da
MD5 0dfe2c4f9953eaf1ecb903fb8186cb3c
BLAKE2b-256 8a4104d8c9962c0213c6c47a09cf7ccb52cbf36eb10ab24a933cc889820c756a

See more details on using hashes here.

File details

Details for the file voice_soundboard-3.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for voice_soundboard-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f99bcaab14ffe338b19e47d3627bf58f116c6d429602daad652ad6ad685e4681
MD5 c8d6f2c8a7276f6bc2631ad4a4adedf7
BLAKE2b-256 0c3e300818e3247d2b9ffa2dd48abecc44aef76c74bb57ab5cf36af98d542cef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page