Skip to main content

Universal local runtime for STT and TTS models

Project description

Vox

Vox is a local runtime for speech models.

It gives speech-to-text and text-to-speech models one operational surface: pull a model, serve one API, and run local speech workloads without hand-wiring each model family yourself. Vox is built around speech-native concerns like streaming audio, voice selection, backend-specific adapters, and one consistent interface across STT and TTS models.

Why Vox

  • One runtime for both speech-to-text and text-to-speech
  • One CLI and one API surface across many model families
  • Pull-on-demand model and adapter installation
  • Multiple backends behind the same runtime: ONNX, Torch, NeMo, CTranslate2, and vLLM
  • Stored custom voices for clone-capable TTS models
  • REST, WebSocket, gRPC, and OpenAI-compatible endpoints
  • Local-first deployment with Docker images that start empty and install only what you use

Quickstart

pip install vox-runtime

vox pull kokoro-tts-onnx:v1.0
vox pull whisper-stt-ct2:large-v3
vox serve

Then hit the local API:

curl -X POST http://localhost:11435/api/synthesize \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro-tts-onnx:v1.0","input":"Hello from Vox"}' \
  -o output.wav

gRPC starts with vox serve too and listens on :9090 by default.

What it does

Vox manages STT and TTS models through a consistent runtime API. Models are downloaded from Hugging Face, and each model family is handled by an adapter that is installed automatically on first pull.

The Docker images intentionally start without any models or adapter packages installed. Pulling a model installs the matching adapter on demand.

vox pull kokoro-tts-onnx:v1.0
vox pull whisper-stt-ct2:large-v3
vox serve

Install

pip install vox-runtime
# or
uv pip install vox-runtime

Usage

Server

vox serve --port 11435 --device auto

Pull a model

vox pull kokoro-tts-onnx:v1.0
vox pull parakeet-stt-onnx:tdt-0.6b-v3
vox list

Transcribe (STT)

# CLI
vox run parakeet-stt-onnx:tdt-0.6b-v3 recording.wav
vox stream-transcribe parakeet-stt-onnx:tdt-0.6b-v3 meeting.mp3

# API
curl -F file=@recording.wav http://localhost:11435/api/transcribe

# OpenAI-compatible
curl -F file=@recording.wav http://localhost:11435/v1/audio/transcriptions

Synthesize (TTS)

# CLI
vox run kokoro-tts-onnx:v1.0 "Hello, how are you?" -o output.wav
vox stream-synthesize kokoro-tts-onnx:v1.0 "Hello, how are you?" -o output.wav

# API
curl -X POST http://localhost:11435/api/synthesize \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro-tts-onnx:v1.0","input":"Hello, how are you?"}' \
  -o output.wav

# OpenAI-compatible
curl -X POST http://localhost:11435/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro-tts-onnx:v1.0","input":"Hello"}' \
  -o output.wav

Create and use custom voices

Vox can store cloned voices and reuse them across HTTP and gRPC. This is only available for TTS adapters that declare voice-cloning support. Preset-only models still list their built-in voices, but they will reject stored cloned voices at synthesis time.

# create a cloned voice from a reference sample
curl -X POST http://localhost:11435/v1/audio/voices \
  -F audio_sample=@sample.wav \
  -F name="Roy" \
  -F language=en \
  -F reference_text="Hello there from my custom voice"

# list voices, including cloned voices for clone-capable models
curl "http://localhost:11435/api/voices?model=openvoice-tts-torch:v1"

# synthesize with the stored voice id returned at creation time
curl -X POST http://localhost:11435/api/synthesize \
  -H "Content-Type: application/json" \
  -d '{"model":"openvoice-tts-torch:v1","input":"Hello from Vox","voice":"voice1234"}' \
  -o output.wav

# delete a stored cloned voice
curl -X DELETE http://localhost:11435/v1/audio/voices/voice1234

Search available models

vox search
vox search --type tts
vox search --type stt

Other commands

vox list          # downloaded models
vox ps            # loaded models
vox show kokoro-tts-onnx:v1.0
vox rm kokoro-tts-onnx:v1.0
vox voices kokoro-tts-onnx:v1.0

Streaming APIs

Use the unary HTTP endpoints for short bounded requests.

Use the WebSocket APIs for:

  • long recordings
  • browser or pipeline streaming
  • live uploads where the client stays connected until the final result arrives

These streaming sessions are intentionally short-lived:

  • no job store
  • no durable result retention
  • disconnect cancels the session

Long-form STT over WebSocket

Endpoint:

ws://localhost:11435/v1/audio/transcriptions/stream

Protocol:

  1. Client sends a JSON config message.
  2. Client sends binary audio chunks.
  3. Client sends {"type":"end"}.
  4. Server emits progress events and one final done event with the full transcript.

Example config:

{
  "type": "config",
  "model": "parakeet-stt-onnx:tdt-0.6b-v3",
  "input_format": "pcm16",
  "sample_rate": 16000,
  "language": "en",
  "word_timestamps": true,
  "chunk_ms": 30000,
  "overlap_ms": 1000
}

Server events:

{"type":"ready","model":"parakeet-stt-onnx:tdt-0.6b-v3","input_format":"pcm16","sample_rate":16000}
{"type":"progress","uploaded_ms":60000,"processed_ms":30000,"chunks_completed":1}
{"type":"done","text":"full transcript","duration_ms":120000,"processing_ms":8420,"segments":[]}

Notes:

  • pcm16 is the simplest long-form transport. The CLI helper uses it by default.
  • wav, flac, mp3, ogg, and webm are also accepted as input_format, but each binary frame must be a self-contained decodable blob, such as a MediaRecorder chunk. Arbitrary byte slices of one compressed file are not supported.

Long-form TTS over WebSocket

Endpoint:

ws://localhost:11435/v1/audio/speech/stream

Protocol:

  1. Client sends a JSON config message.
  2. Client sends one or more {"type":"text","text":"..."} messages.
  3. Client sends {"type":"end"}.
  4. Server emits:
    • ready
    • audio_start
    • progress
    • binary audio chunks
    • final done

Example config:

{
  "type": "config",
  "model": "kokoro-tts-onnx:v1.0",
  "voice": "af_heart",
  "speed": 1.0,
  "response_format": "pcm16"
}

Server events:

{"type":"ready","model":"kokoro-tts-onnx:v1.0","response_format":"pcm16"}
{"type":"audio_start","sample_rate":24000,"response_format":"pcm16"}
{"type":"progress","completed_chars":120,"total_chars":480,"chunks_completed":1,"chunks_total":4}
{"type":"done","response_format":"pcm16","audio_duration_ms":2450,"processing_ms":891}

Binary frames between audio_start and done carry the synthesized audio payload. pcm16 and opus are currently supported for the raw stream; the CLI helper writes pcm16 into a WAV file.

Streaming CLI helpers

These commands sit on top of the WebSocket APIs:

vox stream-transcribe parakeet-stt-onnx:tdt-0.6b-v3 meeting.mp3
vox stream-transcribe parakeet-stt-onnx:tdt-0.6b-v3 meeting.wav --json-output
vox stream-synthesize kokoro-tts-onnx:v1.0 script.txt -o script.wav

vox stream-transcribe transcodes the local input to streamed mono pcm16 on the client side, then uploads chunk-by-chunk over the WebSocket session. For compressed inputs this uses ffmpeg; install it if you want the helper to handle formats that soundfile cannot stream directly.

Docker

# GPU (default)
docker compose up -d
vox pull kokoro-tts-onnx:v1.0  # auto-installs adapter inside container

# CPU
docker compose --profile cpu up -d

Models and dynamically installed adapters persist in a Docker volume across container restarts. No image rebuild needed to add new models.

Spark ONNX GPU build

The default GPU multi-arch image is generic:

  • amd64 uses onnxruntime-gpu
  • arm64 uses CPU onnxruntime
# Local image
make build-local

# Multi-arch publish build
make build

Spark image

The default image stays generic. If you want a Spark-specific arm64 image with a NVIDIA-provided ONNX Runtime source, use the dedicated Spark build:

# Local Spark build
make build-local-spark

# Published Spark build
make build-spark

Notes:

  • build-spark is linux/arm64 only.
  • By default, Dockerfile.spark uses the tested cp312 linux_aarch64 NVIDIA Jetson AI Lab wheel:
    • onnxruntime_gpu-1.23.0-cp312-cp312-linux_aarch64.whl
  • You can still override it with:
    • SPARK_ORT_WHEEL=/path/or/url/to/wheel
    • or SPARK_ORT_INDEX_URL / SPARK_ORT_EXTRA_INDEX_URL
  • The generic make build path is unchanged and still produces the normal multi-arch image.

Representative models

Model Type Description
parakeet-stt-onnx:tdt-0.6b-v3 STT NVIDIA Parakeet TDT 0.6B v3 via ONNX
parakeet-stt-nemo:tdt-0.6b-v3 STT NVIDIA Parakeet TDT 0.6B v3 via NeMo
whisper-stt-ct2:large-v3 STT OpenAI Whisper Large V3 via CTranslate2
whisper-stt-ct2:base.en STT Whisper Base English
qwen3-stt-torch:0.6b STT Qwen3 ASR 0.6B
voxtral-stt-torch:mini-3b STT Voxtral Mini 3B speech-to-text
kokoro-tts-onnx:v1.0 TTS Kokoro 82M ONNX with preset voices
kokoro-tts-torch:v1.0 TTS Kokoro native runtime backend
qwen3-tts-torch:0.6b TTS Qwen3 TTS 0.6B
voxtral-tts-vllm:4b TTS Voxtral 4B TTS via vLLM-Omni
openvoice-tts-torch:v1 TTS OpenVoice voice-cloning backend
piper-tts-onnx:en-us-lessac-medium TTS Piper English US Lessac
dia-tts-torch:1.6b TTS Dia 1.6B multi-speaker dialogue
sesame-tts-torch:csm-1b TTS Sesame CSM 1B conversational speech

More models at vox-registry. Add a model by submitting a PR with a JSON file.

API

Endpoint Method Purpose
/api/transcribe POST Audio to text
/api/synthesize POST Text to audio
/api/voices POST Create a stored cloned voice
/api/pull POST Download a model
/api/list GET List downloaded models
/api/show POST Model details
/api/delete DELETE Remove a model
/api/ps GET Currently loaded models
/api/voices GET List voices for a TTS model
/api/voices/{id} DELETE Delete a stored cloned voice
/api/health GET Health check
/v1/audio/transcriptions POST OpenAI-compatible STT
/v1/audio/speech POST OpenAI-compatible TTS
/v1/audio/voices GET OpenAI-style voice listing
/v1/audio/voices POST OpenAI-style cloned voice creation
/v1/audio/voices/{id} DELETE OpenAI-style cloned voice deletion
/v1/audio/transcriptions/stream WS Long-form streaming STT
/v1/audio/speech/stream WS Long-form streaming TTS

gRPC

vox serve starts the gRPC server automatically unless you disable it with --grpc-port 0.

  • default gRPC port: 9090
  • health and model lifecycle:
    • HealthService.Health
    • HealthService.ListLoaded
    • ModelService.Pull
    • ModelService.List
    • ModelService.Show
    • ModelService.Delete
  • speech:
    • TranscriptionService.Transcribe
    • SynthesisService.Synthesize
    • SynthesisService.ListVoices
    • SynthesisService.CreateVoice
    • SynthesisService.DeleteVoice
    • StreamingService.StreamTranscribe

The gRPC voice APIs use the same stored voice data as HTTP. Creating or deleting a cloned voice over one transport is immediately visible through the other.

Adding a model

Write an adapter package that implements STTAdapter or TTSAdapter:

from vox.core.adapter import TTSAdapter

class MyAdapter(TTSAdapter):
    def info(self): ...
    def load(self, model_path, device, **kwargs): ...
    def unload(self): ...
    @property
    def is_loaded(self): ...
    async def synthesize(self, text, *, voice=None, speed=1.0, **kwargs):
        yield SynthesizeChunk(audio=audio_bytes, sample_rate=24000)

Register it via entry point:

[project.entry-points."vox.adapters"]
my-model = "my_package.adapter:MyAdapter"

Add a JSON file to vox-registry so vox pull can find it.

Project structure

src/vox/
  core/          # types, adapter ABCs, scheduler, store, registry
  audio/         # codec, resampling, pipeline
  server/        # FastAPI routes
  cli.py         # Click CLI
adapters/
  vox-parakeet/  # NVIDIA Parakeet STT
  vox-kokoro/    # Kokoro TTS

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vox_runtime-0.2.4.tar.gz (229.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vox_runtime-0.2.4-py3-none-any.whl (77.7 kB view details)

Uploaded Python 3

File details

Details for the file vox_runtime-0.2.4.tar.gz.

File metadata

  • Download URL: vox_runtime-0.2.4.tar.gz
  • Upload date:
  • Size: 229.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for vox_runtime-0.2.4.tar.gz
Algorithm Hash digest
SHA256 ddc4e2eec25c3ca379449a6e84bda392d4dafa1e397cdc04c3b836bd2626897c
MD5 0a208e83f84f5dd8a6b8dfbd25f96e9f
BLAKE2b-256 d24ca50f2615a3e8a9e7f37538eb21fb9c785994cfaf627e9e6ebe49b8d3c732

See more details on using hashes here.

File details

Details for the file vox_runtime-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: vox_runtime-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 77.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for vox_runtime-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 371d385cb738786e55b9f0ffbdd882af61b2b7fe16e2a06c611913c98707669e
MD5 5e5a2bf9da8e0c60bbdd0847a9922bd8
BLAKE2b-256 6f64b19d53d9b732bb74e091aeb476bae29a7f6e18559670d09c7a08614a2f4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page