Skip to main content

Local text-to-speech daemon and CLI for speech notifications from scripts, builds, cron jobs, and ML training runs, powered by Kokoro TTS with GPU offload and espeak fallback.

Project description

speakd

PyPI PyPI - Python Version

A local text-to-speech daemon and CLI for speech notifications from long-running jobs.

speakd is a Python TTS daemon powered by Kokoro (a fast, high-quality local text-to-speech model). Shell scripts, machine-learning training runs, builds, cron jobs, CI hooks, and Python programs can send fire-and-forget speech notifications over a Unix socket; the caller returns in about a millisecond while the daemon queues, synthesizes, and plays each line in order. If anything in the audio stack fails, the line degrades to espeak instead of disappearing.

It was built to narrate machine-learning training runs on a single-GPU workstation, which shaped its defining feature: the TTS model dynamically offloads itself from the GPU when narration goes quiet, so it never holds VRAM hostage from the workload it is narrating.

$ pip install speakd
$ speak "training started"          # daemon auto-spawns on first use
$ speak --interrupt "loss is NaN"   # cuts off whatever is playing, speaks NOW
$ make 2>&1 | tail -1 | speak       # pipe-friendly

Use cases

  • Add voice alerts to machine-learning training runs when epochs finish, checkpoints save, loss becomes NaN, or jobs crash.
  • Turn shell scripts, Makefiles, cron jobs, and CI hooks into spoken status updates.
  • Use Kokoro TTS locally without blocking the process that asked for speech.
  • Share one text-to-speech queue across multiple processes so messages do not overlap.
  • Release GPU VRAM after narration bursts with dynamic CPU/GPU offload.

Why a daemon?

Calling a TTS library inline is the obvious approach and the wrong one for narration: it blocks the caller for seconds per line, loads a model per process, and overlapping lines talk over each other. speakd inverts this:

  • ~1 ms per call. The client writes one line to a Unix socket and returns. Narration can sit inside hot loops and signal handlers.
  • One model, one queue. A single daemon owns the model and serialises playback. Ten processes can narrate concurrently without crosstalk.
  • Failure-proof by design. Daemon down? The client spawns it. Spawn fails? espeak fallback. No audio at all? The caller still never raises.

Architecture

 any process, any language                 speakd daemon (one per socket, flock-enforced)
┌──────────────────────┐            ┌───────────────────────────────────────────────┐
│  speak "epoch done"  │──┐         │   asyncio Unix-socket server                  │
└──────────────────────┘  │         │        │                                      │
┌──────────────────────┐  │  UTF-8  │        ├── volume msg ──▶ live volume         │
│  Python: speak(...)  │──┼─ line ─▶│        ├── interrupt ───▶ drain queue +       │
└──────────────────────┘  │  over   │        │                  kill playback       │
┌──────────────────────┐  │  socket │        ▼                                      │
│  CI job, cron, hook  │──┘         │   FIFO queue ──▶ worker (thread executor)     │
└──────────────────────┘            │                     │                         │
          ▲                         │                     ▼                         │
          │ "OK\n" ack              │   Kokoro TTS ──▶ wav ──▶ mpv ──▶ 🔊           │
          │ (blocking mode only)    │   CPU ⇄ GPU                                   │
          └─────────────────────────│   (offloads after idle keepalive)             │
                                    │                                               │
                                    │   any failure ──▶ espeak fallback             │
                                    └───────────────────────────────────────────────┘

Features

  • Fire-and-forget socket design — newline-terminated UTF-8 over a Unix domain socket; trivially scriptable from any language. Optional OK ack for blocking callers.
  • Dynamic GPU offload with keepalive — the model loads on CPU, hops onto the GPU for narration bursts, and releases its VRAM (~3 GB) after a configurable idle period. If the GPU is full (another job grabbed it), that request simply synthesizes on CPU instead of failing.
  • Interrupt protocol — an urgent line drains the pending queue, kills in-flight playback mid-word, and speaks immediately.
  • Live volume control — one socket message, applies from the next line; no restart.
  • Singleton via flock(2) — clients can race to auto-spawn the daemon; exactly one wins, the rest exit cleanly. Stale sockets are detected and removed on startup.
  • Graceful fallback — Kokoro import error, synthesis failure, playback failure, or daemon unreachable: the line is spoken by espeak and the event is logged. Narration degrades; it never silently vanishes.
  • One TOML file, env-var overrides, zero-config defaults — works out of the box on CPU with no config file at all.

Requirements

  • Linux or macOS (Unix sockets + flock), Python ≥ 3.10
  • mpv for playback (apt install mpv) — or any player, via config
  • espeak for the fallback voice (apt install espeak) — optional but recommended
  • A CUDA-capable GPU is optional; everything works on CPU

Install

pip install speakd

This installs the kokoro TTS package (which pulls in PyTorch) and two console commands: speakd (the daemon) and speak (the client).

To install from source:

git clone https://github.com/I-Alpha/speakd && cd speakd
pip install .

Quickstart

# 1. Just speak — the daemon auto-spawns on first use:
speak "hello from speakd"

# 2. Or run the daemon in the foreground to watch it work:
speakd --device cpu --voice af_heart

# 3. Script it:
speak --blocking "waits until this has been spoken"
speak --interrupt "queue drained, this plays immediately"
speak --volume 60 "quieter from now on"
echo "pipes work too" | speak

From Python:

from speakd import speak, set_volume

speak("checkpoint saved")                        # ~1 ms, non-blocking
speak("eval finished", blocking=True)            # wait until spoken
speak("loss is NaN — stopping", interrupt=True)  # jump the queue
set_volume(85)

See examples/ for runnable demos of narration, interrupts, and volume control.

Configuration

Defaults work with no config at all. To customise, copy config.example.toml to ~/.config/speakd/config.toml (or point $SPEAKD_CONFIG at any path). Environment variables override the file; CLI flags override both.

TOML key Env override Default Meaning
tts.voice SPEAKD_VOICE af_heart Kokoro voice id (af_*, am_*, bf_*, bm_*, ...)
tts.speed SPEAKD_SPEED 1.0 Speech-rate multiplier
tts.lang_code SPEAKD_LANG a Kokoro language code (a US English, b UK English)
device.policy SPEAKD_DEVICE auto auto (dynamic offload) / cpu / gpu
device.keepalive_seconds SPEAKD_KEEPALIVE 180 Idle seconds before GPU→CPU offload
daemon.socket_path SPEAKD_SOCKET $XDG_RUNTIME_DIR/speakd.sock Unix socket path
daemon.socket_mode "600" Octal permissions on the socket file
daemon.log_file SPEAKD_LOG_FILE ~/.local/state/speakd/daemon.log Log target for auto-spawned daemons
audio.volume SPEAKD_VOLUME 100 Playback volume 0–130 (mpv scale)
audio.max_playback_seconds 120 Kill a single line's playback after this
audio.player mpv template Player argv; {file} and {volume} are substituted
fallback.command espeak template Fallback argv; {text} is substituted; [] disables
client.connect_timeout 0.5 Socket connect/send timeout (s)
client.ack_timeout 300.0 --blocking wait for the spoken-ack (s)
client.spawn_wait 4.0 Wait for an auto-spawned daemon (s)

speakd --print-config shows the fully-resolved effective configuration.

Wire protocol

One newline-terminated UTF-8 line per connection — easy to speak from any language without a client library:

Message Bytes Effect
Speak <text>\n Queue the line; daemon replies OK\n when spoken
Interrupt \x01INTERRUPT\x01<text>\n Drain queue, kill playback, speak now
Volume \x02VOLUME\x02<int>\n Set live volume (0–130)
# speak from raw shell, no client needed:
printf 'hello from netcat\n' | nc -U "$XDG_RUNTIME_DIR/speakd.sock"

The control markers are ASCII SOH/STX characters that cannot occur in normal text, so no escaping is ever needed.

GPU offload in detail

The auto policy exists for machines where the GPU has a day job:

  1. The model loads on CPU at first request.
  2. Each synthesis tries to move it to the GPU first (a few hundred ms, then synthesis is much faster). If CUDA is busy or OOM, that line synthesizes on CPU — no error, just slower.
  3. After keepalive_seconds (default 180 s) without a request, an idle timer moves the model back to CPU and calls torch.cuda.empty_cache(), releasing the VRAM.

The effect: during an active narration burst the voice is snappy and GPU-accelerated; ten minutes into a silent stretch, your training job has its VRAM back. All device moves are serialised with synthesis under one lock, so the model can never be moved mid-utterance.

Troubleshooting

Symptom Likely cause / fix
speak says fallback engine used Daemon failed to start — check ~/.local/state/speakd/daemon.log. Most common: kokoro not installed in the Python that spawned it (set SPEAKD_DAEMON_CMD="/path/to/python -m speakd.daemon").
No audio, no errors Is mpv installed and does it play a wav from your terminal? Swap audio.player if you use a different player.
First line is slow Cold start: model weights load on first request (a few seconds). Subsequent lines are fast.
Robotic voice instead of Kokoro That is the espeak fallback working as designed — see the first row.
Two daemons after a crash They cannot coexist: the flock singleton makes the second exit immediately, and stale sockets are cleaned on startup. Delete <socket>.lock only if a machine crash left it owned by a dead PID holder (flock releases on process death, so this is near-impossible).
daemon already running (pid N) Working as intended — the running daemon serves all clients.
GPU memory not released The model offloads after device.keepalive_seconds of no requests; lower it, or run with --device cpu.

License

MIT © 2026 ibrahim Alfa

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxcaster-0.2.0.tar.gz (42.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voxcaster-0.2.0-py3-none-any.whl (45.4 kB view details)

Uploaded Python 3

File details

Details for the file voxcaster-0.2.0.tar.gz.

File metadata

  • Download URL: voxcaster-0.2.0.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for voxcaster-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f4958e1f349ba8bf65a9c6aa01cef1765e738c64e23a8221082d526da2770258
MD5 0e84e13840efad516e88e70e9bbe173b
BLAKE2b-256 d9db586214abaf978a129bf50047c7b577f0d8700b7bd088021c061f04a425c8

See more details on using hashes here.

File details

Details for the file voxcaster-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: voxcaster-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 45.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for voxcaster-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 59730bff036da461976d02c6114c4ddc318428e812324ff08655582d345dbfbd
MD5 20b7535de266ac9a01c94ed37949fae6
BLAKE2b-256 e0e1a9c19161ebfa6fee7840eba2499319678c3ed9c92d071bf1ac186ae67723

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page