Skip to main content

Universal local runtime for STT and TTS models

Project description

Vox

Local runtime for speech-to-text and text-to-speech models. Pull a model, start the server, hit an API.

What it does

Vox manages STT and TTS models through a REST API. Models are downloaded from HuggingFace, and each model family (Whisper, Kokoro, Parakeet, etc.) is handled by a plugin adapter that is installed automatically on first pull.

The Docker images intentionally start without any models or adapter packages installed. Pulling a model installs the matching adapter on demand.

vox pull kokoro:v1.0       # downloads model + installs adapter
vox pull whisper:large-v3   # same for STT
vox serve                   # starts REST API on :11435

Install

pip install vox-runtime
# or
uv pip install vox-runtime

Usage

Server

vox serve --port 11435 --device auto

Pull a model

vox pull kokoro:v1.0
vox pull parakeet:tdt-0.6b-v3
vox list

Transcribe (STT)

# CLI
vox run parakeet:tdt-0.6b-v3 recording.wav
vox stream-transcribe parakeet:tdt-0.6b-v3 meeting.mp3

# API
curl -F file=@recording.wav http://localhost:11435/api/transcribe

# OpenAI-compatible
curl -F file=@recording.wav http://localhost:11435/v1/audio/transcriptions

Synthesize (TTS)

# CLI
vox run kokoro:v1.0 "Hello, how are you?" -o output.wav
vox stream-synthesize kokoro:v1.0 "Hello, how are you?" -o output.wav

# API
curl -X POST http://localhost:11435/api/synthesize \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro:v1.0","input":"Hello, how are you?"}' \
  -o output.wav

# OpenAI-compatible
curl -X POST http://localhost:11435/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro:v1.0","input":"Hello"}' \
  -o output.wav

Search available models

vox search
vox search --type tts
vox search --type stt

Other commands

vox list          # downloaded models
vox ps            # loaded models
vox show kokoro:v1.0
vox rm kokoro:v1.0
vox voices kokoro:v1.0

Streaming APIs

Use the unary HTTP endpoints for short bounded requests.

Use the WebSocket APIs for:

  • long recordings
  • browser or pipeline streaming
  • live uploads where the client stays connected until the final result arrives

These streaming sessions are intentionally short-lived:

  • no job store
  • no durable result retention
  • disconnect cancels the session

Long-form STT over WebSocket

Endpoint:

ws://localhost:11435/v1/audio/transcriptions/stream

Protocol:

  1. Client sends a JSON config message.
  2. Client sends binary audio chunks.
  3. Client sends {"type":"end"}.
  4. Server emits progress events and one final done event with the full transcript.

Example config:

{
  "type": "config",
  "model": "parakeet:tdt-0.6b-v3",
  "input_format": "pcm16",
  "sample_rate": 16000,
  "language": "en",
  "word_timestamps": true,
  "chunk_ms": 30000,
  "overlap_ms": 1000
}

Server events:

{"type":"ready","model":"parakeet:tdt-0.6b-v3","input_format":"pcm16","sample_rate":16000}
{"type":"progress","uploaded_ms":60000,"processed_ms":30000,"chunks_completed":1}
{"type":"done","text":"full transcript","duration_ms":120000,"processing_ms":8420,"segments":[]}

Notes:

  • pcm16 is the simplest long-form transport. The CLI helper uses it by default.
  • wav, flac, mp3, ogg, and webm are also accepted as input_format, but each binary frame must be a self-contained decodable blob, such as a MediaRecorder chunk. Arbitrary byte slices of one compressed file are not supported.

Long-form TTS over WebSocket

Endpoint:

ws://localhost:11435/v1/audio/speech/stream

Protocol:

  1. Client sends a JSON config message.
  2. Client sends one or more {"type":"text","text":"..."} messages.
  3. Client sends {"type":"end"}.
  4. Server emits:
    • ready
    • audio_start
    • progress
    • binary audio chunks
    • final done

Example config:

{
  "type": "config",
  "model": "kokoro:v1.0",
  "voice": "af_heart",
  "speed": 1.0,
  "response_format": "pcm16"
}

Server events:

{"type":"ready","model":"kokoro:v1.0","response_format":"pcm16"}
{"type":"audio_start","sample_rate":24000,"response_format":"pcm16"}
{"type":"progress","completed_chars":120,"total_chars":480,"chunks_completed":1,"chunks_total":4}
{"type":"done","response_format":"pcm16","audio_duration_ms":2450,"processing_ms":891}

Binary frames between audio_start and done carry the synthesized audio payload. pcm16 and opus are currently supported for the raw stream; the CLI helper writes pcm16 into a WAV file.

Streaming CLI helpers

These commands sit on top of the WebSocket APIs:

vox stream-transcribe parakeet:tdt-0.6b-v3 meeting.mp3
vox stream-transcribe parakeet:tdt-0.6b-v3 meeting.wav --json-output
vox stream-synthesize kokoro:v1.0 script.txt -o script.wav

vox stream-transcribe transcodes the local input to streamed mono pcm16 on the client side, then uploads chunk-by-chunk over the WebSocket session. For compressed inputs this uses ffmpeg; install it if you want the helper to handle formats that soundfile cannot stream directly.

Docker

# GPU (default)
docker compose up -d
vox pull kokoro:v1.0  # auto-installs adapter inside container

# CPU
docker compose --profile cpu up -d

Models and dynamically installed adapters persist in a Docker volume across container restarts. No image rebuild needed to add new models.

Spark ONNX GPU build

The default GPU multi-arch image is generic:

  • amd64 uses onnxruntime-gpu
  • arm64 uses CPU onnxruntime
# Local image
make build-local

# Multi-arch publish build
make build

Spark image

The default image stays generic. If you want a Spark-specific arm64 image with a NVIDIA-provided ONNX Runtime source, use the dedicated Spark build:

# Local Spark build
make build-local-spark

# Published Spark build
make build-spark

Notes:

  • build-spark is linux/arm64 only.
  • By default, Dockerfile.spark uses the tested cp312 linux_aarch64 NVIDIA Jetson AI Lab wheel:
    • onnxruntime_gpu-1.23.0-cp312-cp312-linux_aarch64.whl
  • You can still override it with:
    • SPARK_ORT_WHEEL=/path/or/url/to/wheel
    • or SPARK_ORT_INDEX_URL / SPARK_ORT_EXTRA_INDEX_URL
  • The generic make build path is unchanged and still produces the normal multi-arch image.

Available models

Model Type Description
parakeet:tdt-0.6b STT NVIDIA Parakeet TDT 0.6B
parakeet:tdt-0.6b-v3 STT Parakeet TDT 0.6B v3, 25 languages
whisper:large-v3 STT OpenAI Whisper Large V3 via CTranslate2
whisper:large-v3-turbo STT Whisper Large V3 Turbo
whisper:base.en STT Whisper Base English
kokoro:v1.0 TTS Kokoro 82M ONNX, preset voices
piper:en-us-lessac-medium TTS Piper English US Lessac
fish-speech:v1.4 TTS Fish Speech 1.4, multilingual, voice cloning
orpheus:3b TTS Orpheus 3B, emotional speech
dia:1.6b TTS Dia 1.6B, multi-speaker dialogue
sesame:csm-1b TTS Sesame CSM 1B, conversational speech

More models at vox-registry. Add a model by submitting a PR with a JSON file.

API

Endpoint Method Purpose
/api/transcribe POST Audio to text
/api/synthesize POST Text to audio
/api/pull POST Download a model
/api/list GET List downloaded models
/api/show POST Model details
/api/delete DELETE Remove a model
/api/ps GET Currently loaded models
/api/voices GET List voices for a TTS model
/api/health GET Health check
/v1/audio/transcriptions POST OpenAI-compatible STT
/v1/audio/speech POST OpenAI-compatible TTS
/v1/audio/transcriptions/stream WS Long-form streaming STT
/v1/audio/speech/stream WS Long-form streaming TTS

Adding a model

Write an adapter package that implements STTAdapter or TTSAdapter:

from vox.core.adapter import TTSAdapter

class MyAdapter(TTSAdapter):
    def info(self): ...
    def load(self, model_path, device, **kwargs): ...
    def unload(self): ...
    @property
    def is_loaded(self): ...
    async def synthesize(self, text, *, voice=None, speed=1.0, **kwargs):
        yield SynthesizeChunk(audio=audio_bytes, sample_rate=24000)

Register it via entry point:

[project.entry-points."vox.adapters"]
my-model = "my_package.adapter:MyAdapter"

Add a JSON file to vox-registry so vox pull can find it.

Project structure

src/vox/
  core/          # types, adapter ABCs, scheduler, store, registry
  audio/         # codec, resampling, pipeline
  server/        # FastAPI routes
  cli.py         # Click CLI
adapters/
  vox-parakeet/  # NVIDIA Parakeet STT
  vox-kokoro/    # Kokoro TTS

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vox_runtime-0.2.2.tar.gz (219.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vox_runtime-0.2.2-py3-none-any.whl (72.4 kB view details)

Uploaded Python 3

File details

Details for the file vox_runtime-0.2.2.tar.gz.

File metadata

  • Download URL: vox_runtime-0.2.2.tar.gz
  • Upload date:
  • Size: 219.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for vox_runtime-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c0bfd70f3d8065e768fcabf79e56126d73a209d311d4178ea961dc00e1ee6feb
MD5 d6e64fbb77379110304f8baea762df7a
BLAKE2b-256 1153935b26ce489e94d4b35cf9f579a722a31a63536be326372d04e062e451b3

See more details on using hashes here.

File details

Details for the file vox_runtime-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: vox_runtime-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 72.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for vox_runtime-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c25de1bc70f52b981ba9b4c410ce499e93bd7e0efd97c4e5b49aff9818e40372
MD5 51454c23142cd81bbfcb04600159ee4a
BLAKE2b-256 ebab3c640fdcc9ddafa50221ea87f07287e744705590704fbad2df5592e95ed2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page