Text-to-speech for AI agents and developers. Compiler → Graph → Engine architecture.
Project description
日本語 | 中文 | Español | Français | हिन्दी | Italiano | Português (BR)
Give your AI agents a voice that feels real.
Part of MCP Tool Shop — practical developer tools that stay out of your way.
Voice Soundboard is a text-to-speech engine built for developers who need more than just a .mp3 file.
Most TTS libraries force a choice: easy APIs that hide everything, or complex lower-level tools that require audio engineering knowledge. Voice Soundboard gives you the best of both worlds.
- Simple High-Level API: Just call
engine.speak("Hello")and get audio. - Powerful Internals: Under the hood, we use a Compiler/Graph/Engine architecture that separates what is said (intent, emotion) from how it's rendered (backend, audio format).
- Zero-Cost Abstractions: Emotions, styles, and SSML are compiled into a control graph, so the runtime engine stays fast and lightweight.
Quick Start
pip install voice-soundboard
from voice_soundboard import VoiceEngine
# Easy text-to-speech
engine = VoiceEngine()
result = engine.speak("Hello world! This is my AI voice.")
print(f"Saved to: {result.audio_path}")
Architecture
compile_request("text", emotion="happy")
|
ControlGraph (pure data)
|
engine.synthesize(graph)
|
PCM audio (numpy array)
The compiler transforms intent (text + emotion + style) into a ControlGraph.
The engine transforms the graph into audio. It knows nothing about emotions or styles.
This separation means:
- Features are "free" at runtime (already baked into the graph)
- Engine is tiny, fast, testable
- Backends are swappable without touching feature logic
Usage
Basic
from voice_soundboard import VoiceEngine
engine = VoiceEngine()
# Simple
result = engine.speak("Hello world!")
# With voice
result = engine.speak("Cheerio!", voice="bm_george")
# With preset
result = engine.speak("Breaking news!", preset="announcer")
# With emotion
result = engine.speak("I'm so happy!", emotion="excited")
# With natural language style
result = engine.speak("Good morning!", style="warmly and cheerfully")
Advanced: Direct Graph Manipulation
from voice_soundboard.compiler import compile_request
from voice_soundboard.engine import load_backend
# Compile once
graph = compile_request(
"Hello world!",
voice="af_bella",
emotion="happy",
)
# Synthesize many times (or with different backends)
backend = load_backend("kokoro")
audio = backend.synthesize(graph)
Streaming
Streaming operates at two levels:
- Graph streaming:
compile_stream()yields ControlGraphs as sentence boundaries are detected - Audio streaming:
StreamingSynthesizerchunks audio for real-time playback
Note: This is sentence-level streaming, not word-by-word incremental synthesis. The compiler waits for sentence boundaries before yielding graphs. True incremental synthesis (speculative execution with rollback) is architecturally supported but not yet implemented.
from voice_soundboard.compiler import compile_stream
from voice_soundboard.runtime import StreamingSynthesizer
# For LLM output or real-time text
def text_chunks():
yield "Hello, "
yield "how are "
yield "you today?"
backend = load_backend()
streamer = StreamingSynthesizer(backend)
for graph in compile_stream(text_chunks()):
for audio_chunk in streamer.stream(graph):
play(audio_chunk)
CLI
# Speak text
voice-soundboard speak "Hello world!"
# With options
voice-soundboard speak "Breaking news!" --preset announcer --speed 1.1
# List voices
voice-soundboard voices
# List presets
voice-soundboard presets
# List emotions
voice-soundboard emotions
Backends
| Backend | Quality | Speed | Sample Rate | Install |
|---|---|---|---|---|
| Kokoro | Excellent | Fast (GPU) | 24000 Hz | pip install voice-soundboard[kokoro] |
| Piper | Great | Fast (CPU) | 22050 Hz | pip install voice-soundboard[piper] |
| OpenAI | Excellent | Cloud | 24000 Hz | pip install voice-soundboard[openai] |
| Coqui | Great | Moderate (GPU) | 22050 Hz | pip install voice-soundboard[coqui] |
| ElevenLabs | Premium | Cloud | 44100 Hz | pip install voice-soundboard[elevenlabs] |
| Azure | Excellent | Cloud | 24000 Hz | pip install voice-soundboard[azure] |
| Mock | N/A | Instant | 24000 Hz | (built-in, for testing) |
Kokoro Setup
pip install voice-soundboard[kokoro]
# Download models
mkdir models && cd models
curl -LO https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.onnx
curl -LO https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin
Piper Setup
pip install voice-soundboard[piper]
# Download a voice (example: en_US_lessac_medium)
python -m piper.download_voices en_US-lessac-medium
Piper features:
- 30+ voices across multiple languages (English, German, French, Spanish)
- Pure CPU - no GPU required
- Speed control via
length_scale(inverted: 0.8 = faster, 1.2 = slower) - Sample rate: 22050 Hz (backend-specific)
Voice mapping from Kokoro:
# These Kokoro voices have Piper equivalents
engine = VoiceEngine(Config(backend="piper"))
result = engine.speak("Hello!", voice="af_bella") # Uses en_US_lessac_medium
Package Structure
voice_soundboard/
├── graph/ # ControlGraph, TokenEvent, SpeakerRef
├── compiler/ # Text -> Graph (all features live here)
│ ├── text.py # Tokenization, normalization
│ ├── emotion.py # Emotion -> prosody
│ ├── style.py # Natural language style
│ └── compile.py # Main entry point
├── engine/ # Graph -> PCM (no features, just synthesis)
│ └── backends/ # Kokoro, Piper, OpenAI, Coqui, ElevenLabs, Azure, Mock
├── runtime/ # Streaming, timeline, ducking
├── adapters/ # CLI, public API (thin wrappers)
├── streaming/ # Incremental word-by-word synthesis
├── conversation/ # Multi-speaker dialogue
├── cloning/ # Speaker embedding extraction
├── speakers/ # Speaker database
├── realtime/ # Low-latency streaming engine
├── plugins/ # Plugin architecture
├── quality/ # Voice quality metrics
├── formats/ # Audio format conversion, LUFS
├── debug/ # Graph visualization, profiler
├── testing/ # VoiceMock, AudioAssertions
└── accessibility/ # Screen reader integration, captions
Key invariant: engine/ never imports from compiler/.
Architecture Invariants
These rules are enforced in tests and must never be violated:
-
Engine isolation:
engine/never imports fromcompiler/. The engine knows nothing about emotions, styles, or presets -- only ControlGraphs. -
Voice cloning boundary: Raw audio never reaches the engine. The compiler extracts speaker embeddings; the engine receives only embedding vectors via
SpeakerRef. -
Graph stability:
GRAPH_VERSION(currently 1) is bumped on breaking changes to ControlGraph. Backends can check this for compatibility.
from voice_soundboard.graph import GRAPH_VERSION, ControlGraph
assert GRAPH_VERSION == 1
Migration
The public API is unchanged across all major versions:
# This works in v1, v2, and v3
from voice_soundboard import VoiceEngine
engine = VoiceEngine()
result = engine.speak("Hello!", voice="af_bella", emotion="happy")
v2 → v3
v3 removes 11 speculative modules that shipped with zero test coverage (distributed, serverless, intelligence, analytics, monitoring, security, ambiance, scenes, spatial, mcp, v3-alpha). The public API is unchanged. If you imported removed internals, see CHANGELOG.md for details.
v1 → v2
| v1 | v2+ |
|---|---|
engine.py |
adapters/api.py |
emotions.py |
compiler/emotion.py |
interpreter.py |
compiler/style.py |
engines/kokoro.py |
engine/backends/kokoro.py |
Security & Data Scope
- Data accessed: Reads text input for TTS synthesis. Processes audio through configured backends (Kokoro, Piper, or mock). Returns PCM audio as numpy arrays or WAV files.
- Data NOT accessed: No network egress by default (backends are local). No telemetry, analytics, or tracking. No user data storage beyond transient audio buffers.
- Permissions required: Read access to TTS model files. Optional write access for audio output.
See SECURITY.md for vulnerability reporting.
Scorecard
| Category | Score |
|---|---|
| A. Security | 10/10 |
| B. Error Handling | 10/10 |
| C. Operator Docs | 10/10 |
| D. Shipping Hygiene | 10/10 |
| E. Identity (soft) | 10/10 |
| Overall | 50/50 |
Evaluated with
@mcptoolshop/shipcheck
License
MIT -- see LICENSE for details.
Built by MCP Tool Shop
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voice_soundboard-3.0.0.tar.gz.
File metadata
- Download URL: voice_soundboard-3.0.0.tar.gz
- Upload date:
- Size: 5.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fbb9cdeb65cd1f672dad847052d8e4f82d64c3b44a89848aebec877bd82d3da
|
|
| MD5 |
0dfe2c4f9953eaf1ecb903fb8186cb3c
|
|
| BLAKE2b-256 |
8a4104d8c9962c0213c6c47a09cf7ccb52cbf36eb10ab24a933cc889820c756a
|
File details
Details for the file voice_soundboard-3.0.0-py3-none-any.whl.
File metadata
- Download URL: voice_soundboard-3.0.0-py3-none-any.whl
- Upload date:
- Size: 203.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f99bcaab14ffe338b19e47d3627bf58f116c6d429602daad652ad6ad685e4681
|
|
| MD5 |
c8d6f2c8a7276f6bc2631ad4a4adedf7
|
|
| BLAKE2b-256 |
0c3e300818e3247d2b9ffa2dd48abecc44aef76c74bb57ab5cf36af98d542cef
|