Skip to main content

Offline TTS library using Kokoro-82M

Project description

stackvox

ci coverage python license

Offline TTS using Kokoro-82M via kokoro-onnx. Apache 2.0 model, ~340MB, CPU real-time, plays straight to system audio. Designed to be importable as a Python library, drivable as a CLI, or poked via a unix socket for ~13ms speech requests from shell scripts.

๐Ÿ”Š Hear it: docs/demo.wav โ€” five seconds of two voices speaking the tagline (af_heart then bf_emma).

Install

From PyPI โ€” recommended for most users:

pipx install stackvox       # `stackvox` CLI on PATH
# or
pip install stackvox        # use as a library

If you want the low-latency bash helper (stackvox-say) for shell scripts and hooks, install it on PATH after installing the package:

stackvox install-helper     # copies bash helper to ~/.local/bin
                            # use --prefix DIR to install elsewhere

This is a one-time step. The helper is shipped as package data rather than as an automatic install script โ€” explicit beats magical, and it keeps stackvox compatible with modern build backends. Skip it if you only ever use the Python stackvox say client.

From git, if you want an unreleased commit:

pipx install git+https://github.com/StackOneHQ/stackvox.git
# upgrade later with: pipx install --force git+https://github.com/StackOneHQ/stackvox.git

Dev install from a clone:

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

Model + voice files auto-download to ~/.cache/stackvox/ on first use. Override with STACKVOX_CACHE_DIR.

CLI

stackvox "Hello world"              # synthesize and play in-process
stackvox speak "Hi" --voice bf_emma # same, explicit subcommand
stackvox speak "save" --out a.wav   # write wav instead of playing
stackvox welcome                    # multilingual welcome (6 languages)
stackvox voices                     # list all voice ids
echo "from a pipe" | stackvox       # piped stdin works for speak/say
stackvox speak --file message.txt   # read a whole file

Bash completion:

eval "$(stackvox completion bash)"  # current shell
# or persist:
stackvox completion bash > ~/.stackvox-completion.bash
echo 'source ~/.stackvox-completion.bash' >> ~/.bashrc

Configuration

stackvox reads per-user defaults from a TOML file, so you don't need to repeat --voice bf_emma --speed 1.1 on every invocation. Set values in ~/.config/stackvox/config.toml (or $XDG_CONFIG_HOME/stackvox/config.toml, or wherever STACKVOX_CONFIG points):

[defaults]
voice = "bf_emma"
speed = 1.1
lang = "en-gb"

CLI flags always win over config values, and config values always win over the built-in defaults. A missing file is fine โ€” built-ins apply. A malformed file logs a warning and is ignored.

Daemon mode

Keeps the model resident so each subsequent call is instant:

stackvox serve         # foreground; run with `nohup stackvox serve &` to background
stackvox status        # is the daemon up? also shows version + any pending PyPI update
stackvox say "Hello"   # send text to the daemon (fails if not running)
stackvox stop          # graceful shutdown

stackvox checks PyPI for newer versions but only at two moments โ€” when you run stackvox status and at daemon startup. The script-heavy paths (say, speak, stackvox-say, hooks, CI) never make a network call. To see notices on every invocation set STACKVOX_UPDATE_NOTICE=1. To disable the check entirely set STACKVOX_NO_UPDATE_CHECK=1. The check is auto-skipped when common CI env vars (CI, GITHUB_ACTIONS, etc.) are set so build logs stay clean.

stackvox-say (bash helper, ~13ms)

When you want minimum latency from shell scripts (hooks, CI steps, etc.), skip the Python client and use the bash helper โ€” it talks directly to the daemon's unix socket via nc:

stackvox-say "back to you in 5"
stackvox-say --voice bf_emma --speed 1.1 "hello"
stackvox-say --fallback-say "text"     # shell out to macOS `say` if daemon is down

Exit codes: 0 ok, 2 daemon unreachable (unless --fallback-say was given).

Python library

from stackvox import Stackvox, speak, synthesize

# One-shot โ€” model loads on first call, reused for subsequent calls.
speak("Hello world")

# Reusable engine.
tts = Stackvox(voice="af_bella")
tts.speak("First line")
tts.speak("Faster", speed=1.2)

# Non-blocking playback.
tts.speak("async", blocking=False)
tts.stop()

# Raw samples for custom processing.
samples, sr = tts.synthesize("give me the array")

# Gapless multi-line playback with concurrent synthesis.
tts.speak_sequence([
    {"text": "Hello", "voice": "af_heart", "lang": "en-us"},
    {"text": "Bonjour", "voice": "ff_siwis", "lang": "fr-fr"},
])

Daemon client from Python

from stackvox import daemon

ok, resp = daemon.say("queue this via the running daemon")
if daemon.is_running():
    daemon.stop()

Voices

Kokoro ships voices across several languages. Voice prefix encodes gender + language:

Prefix Language Example
af_*, am_* American English af_heart, am_michael
bf_*, bm_* British English bf_emma, bm_fable
ff_* French ff_siwis
hf_*, hm_* Hindi hf_alpha, hm_omega
if_*, im_* Italian if_sara, im_nicola
pf_*, pm_* Portuguese pf_dora, pm_alex
ef_*, em_* Spanish ef_dora, em_alex
jf_*, jm_* Japanese jf_alpha
zf_*, zm_* Mandarin Chinese zf_xiaoxiao

Run stackvox voices for the authoritative list.

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      unix socket           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  stackvox-say      โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ   โ”‚  stackvox daemon        โ”‚
โ”‚  (bash, ~13ms)     โ”‚   JSON line per request    โ”‚  (Python, long-lived)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                            โ”‚                         โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      ~500ms (Py startup)   โ”‚  preloaded Kokoro ONNX  โ”‚
โ”‚  stackvox say      โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ   โ”‚  worker thread playback โ”‚
โ”‚  (Python client)   โ”‚                            โ”‚  โ†’ sounddevice โ†’ audio  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  stackvox speak    โ”‚   loads model in-process, plays, exits
โ”‚  (one-shot CLI)    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Socket lives at ~/.cache/stackvox/daemon.sock (override with STACKVOX_SOCKET for the client, STACKVOX_CACHE_DIR for the daemon). Protocol is one line of JSON per connection: {"text":"...", "voice":"...", "speed":1.0, "lang":"en-us"}; reply is ok / busy / err: <msg>. Plain text (no JSON) is accepted as a fallback and treated as {"text": line}.

Queue depth is 2 โ€” rapid-fire requests beyond that get busy rather than piling up.

Before each utterance the daemon resets PortAudio so it picks up the current system default output device. Swap from speakers to Bluetooth headphones mid-session and the next say follows you โ€” no daemon restart needed. The refresh costs ~10โ€“50ms per play, which is invisible next to synthesis time.

Requirements

  • Python 3.10+
  • macOS or Linux
  • nc (BSD netcat โ€” default on macOS, netcat-openbsd on Linux) for the bash helper

Security considerations

stackvox doesn't open any network port. The daemon binds a unix socket under ~/.cache/stackvox/ (default file-mode 0600, i.e. user-only per the OS defaults for files in $HOME). Any process running as the same local user can send text to the daemon โ€” there's no per-message authentication on the socket itself. That's the trust boundary: stackvox assumes anything running as your UID is allowed to speak on your behalf.

If you're exposing stackvox through a different surface (HTTP server, shared system service, container), authentication and rate-limiting are your responsibility at that layer.

Model weights (kokoro-v1.0.onnx, ~340 MB) and voices are downloaded from the kokoro-onnx GitHub release assets on first use and cached under ~/.cache/stackvox/. If you operate in a restricted environment, pre-seed that directory offline.

Security issues themselves should not be filed as public GitHub issues โ€” see SECURITY.md for the disclosure process.

How does this compare to other TTS?

stackvox is a fairly opinionated narrow slice of the TTS space. Here's where it sits next to the obvious neighbours:

Tool Offline? Quality Latency (typical) License Best for
stackvox (Kokoro-82M) โœ… High (24kHz, 50+ voices, 9 languages) ~300ms in-process ยท ~13ms via daemon helper Apache 2.0 Local apps, shell hooks, anything that wants natural voice without the cloud
macOS say โœ… OK ~50ms macOS only macOS-only scripts, "good enough" voice
espeak-ng โœ… Robotic ~10ms GPL-3.0 Accessibility, screen readers, embedded
Piper โœ… High ~100ms MIT Similar use-case to stackvox; ONNX-based, more voices in some languages
Coqui TTS โœ… Very high (research models) seconds MPL-2.0 Research, fine-tuning, voice cloning
OpenAI / ElevenLabs / etc. โŒ Highest network-bound Proprietary Production apps that can pay per-call and accept network dependency

Where stackvox tries to be different from Piper specifically: a resident daemon + bash helper path that gets you sub-15ms speech requests from shell scripts (CI hooks, terminal notifications, status announcements) without paying Python's startup cost on every call. That's basically the point โ€” voice quality alone wouldn't be enough to switch off Piper, but the IPC story makes a difference for shell-driven workflows.

Pick stackvox if you want good voices, fully offline, with a fast shell-friendly API.

License & attributions

stackvox itself is licensed under the Apache License, Version 2.0 โ€” see LICENSE. Third-party attributions are collected in NOTICE; the summary below is informational.

Model. Speech is generated by Kokoro-82M (ยฉ hexgrad, Apache 2.0). The ONNX-converted weights (kokoro-v1.0.onnx) and voice pack (voices-v1.0.bin) are downloaded from the kokoro-onnx release assets on first use and cached under ~/.cache/stackvox/. stackvox does not modify or redistribute them.

Runtime dependencies. kokoro-onnx (MIT, ยฉ thewh1teagle), onnxruntime (MIT, ยฉ Microsoft), sounddevice (MIT, ยฉ Matthias Geier), soundfile (BSD-3, ยฉ Bastian Bechtold), numpy (BSD-3).

GPL note. kokoro-onnx pulls in phonemizer-fork as a transitive runtime dependency; it is licensed under GPL-3.0. stackvox does not bundle, modify, or statically link it โ€” pip installs it alongside stackvox and the two communicate through phonemizer's published Python API at runtime. If you redistribute a combined work (e.g. a frozen binary, container image, or vendored wheel set) that includes phonemizer-fork, review GPL-3.0 obligations for that distribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stackvox-0.5.0.tar.gz (43.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stackvox-0.5.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file stackvox-0.5.0.tar.gz.

File metadata

  • Download URL: stackvox-0.5.0.tar.gz
  • Upload date:
  • Size: 43.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for stackvox-0.5.0.tar.gz
Algorithm Hash digest
SHA256 937483a1d479749f832f655160ad040906edcd7ac819f4a3ded8e8ae846532b4
MD5 abe7d6b2f218df411a5085ea8ac46683
BLAKE2b-256 1decb7961f99eb86f35d1003cd5e54fd8945aae8bdc7fcec42666f2e1a478597

See more details on using hashes here.

Provenance

The following attestation bundles were made for stackvox-0.5.0.tar.gz:

Publisher: release-please.yml on StackOneHQ/stackvox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stackvox-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: stackvox-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for stackvox-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d0ebd10f3095538b23f5b55df03dc922f7adb05b2de058dad2f610476fd72d3e
MD5 c13d22a5f06a3c4896ebc11a93c23d31
BLAKE2b-256 2b2de01adf22cb266d0ac40d58e3ead5937a2aeabb079da358f9ef3cba1aeff3

See more details on using hashes here.

Provenance

The following attestation bundles were made for stackvox-0.5.0-py3-none-any.whl:

Publisher: release-please.yml on StackOneHQ/stackvox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page