Universal local runtime for STT and TTS models
Project description
Vox
Vox is a local runtime for speech models.
It gives speech-to-text and text-to-speech models one operational surface: pull a model, serve one API, and run local speech workloads without hand-wiring each model family yourself. Vox is built around speech-native concerns like streaming audio, voice selection, backend-specific adapters, and one consistent interface across STT and TTS models.
Why Vox
- One runtime for both speech-to-text and text-to-speech
- One CLI and one API surface across many model families
- Pull-on-demand model and adapter installation
- Multiple backends behind the same runtime: ONNX, Torch, NeMo, CTranslate2, and vLLM
- Stored custom voices for clone-capable TTS models
- REST, WebSocket, gRPC, and OpenAI-compatible endpoints
- Local-first deployment with Docker images that start empty and install only what you use
Quickstart
pip install vox-runtime
vox pull kokoro-tts-onnx:v1.0
vox pull whisper-stt-ct2:large-v3
vox serve
Then hit the local API:
curl -X POST http://localhost:11435/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro-tts-onnx:v1.0","input":"Hello from Vox"}' \
-o output.wav
gRPC starts with vox serve too and listens on :9090 by default.
What it does
Vox manages STT and TTS models through a consistent runtime API. Models are downloaded from Hugging Face, and each model family is handled by an adapter that is installed automatically on first pull.
The Docker images intentionally start without any models or adapter packages installed. Pulling a model installs the matching adapter on demand.
vox pull kokoro-tts-onnx:v1.0
vox pull whisper-stt-ct2:large-v3
vox serve
Install
pip install vox-runtime
# or
uv pip install vox-runtime
Usage
Server
vox serve --port 11435 --device auto
Pull a model
vox pull kokoro-tts-onnx:v1.0
vox pull parakeet-stt-onnx:tdt-0.6b-v3
vox list
Transcribe (STT)
# CLI
vox run parakeet-stt-onnx:tdt-0.6b-v3 recording.wav
vox stream-transcribe parakeet-stt-onnx:tdt-0.6b-v3 meeting.mp3
# OpenAI-compatible: thin response (just {"text": ...})
curl -F file=@recording.wav http://localhost:11435/v1/audio/transcriptions
# Rich response with segments, word timestamps, entities, topics
curl -F file=@recording.wav -F response_format=verbose_json \
http://localhost:11435/v1/audio/transcriptions
Synthesize (TTS)
# CLI
vox run kokoro-tts-onnx:v1.0 "Hello, how are you?" -o output.wav
vox stream-synthesize kokoro-tts-onnx:v1.0 "Hello, how are you?" -o output.wav
# OpenAI-compatible
curl -X POST http://localhost:11435/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro-tts-onnx:v1.0","input":"Hello"}' \
-o output.wav
Create and use custom voices
Vox can store cloned voices and reuse them across HTTP and gRPC. This is only available for TTS adapters that declare voice-cloning support. Preset-only models still list their built-in voices, but they will reject stored cloned voices at synthesis time.
# create a cloned voice from a reference sample
curl -X POST http://localhost:11435/v1/audio/voices \
-F audio_sample=@sample.wav \
-F name="Roy" \
-F language=en \
-F reference_text="Hello there from my custom voice"
# list voices, including cloned voices for clone-capable models
curl "http://localhost:11435/v1/audio/voices?model=openvoice-tts-torch:v1"
# download the stored reference audio
curl -o reference.wav http://localhost:11435/v1/audio/voices/voice1234/reference
# synthesize with the stored voice id returned at creation time
curl -X POST http://localhost:11435/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"openvoice-tts-torch:v1","input":"Hello from Vox","voice":"voice1234"}' \
-o output.wav
# delete a stored cloned voice
curl -X DELETE http://localhost:11435/v1/audio/voices/voice1234
Search available models
vox search
vox search --type tts
vox search --type stt
Other commands
vox list # downloaded models
vox ps # loaded models
vox show kokoro-tts-onnx:v1.0
vox rm kokoro-tts-onnx:v1.0
vox voices kokoro-tts-onnx:v1.0
Streaming APIs
Use the unary HTTP endpoints for short bounded requests.
Use the WebSocket APIs for:
- long recordings
- browser or pipeline streaming
- live uploads where the client stays connected until the final result arrives
These streaming sessions are intentionally short-lived:
- no job store
- no durable result retention
- disconnect cancels the session
Long-form STT over WebSocket
Endpoint:
ws://localhost:11435/v1/audio/transcriptions/stream
Protocol:
- Client sends a JSON config message.
- Client sends binary audio chunks.
- Client sends
{"type":"end"}. - Server emits progress events and one final
doneevent with the full transcript.
Example config:
{
"type": "config",
"model": "parakeet-stt-onnx:tdt-0.6b-v3",
"input_format": "pcm16",
"sample_rate": 16000,
"language": "en",
"word_timestamps": true,
"chunk_ms": 30000,
"overlap_ms": 1000
}
Server events:
{"type":"ready","model":"parakeet-stt-onnx:tdt-0.6b-v3","input_format":"pcm16","sample_rate":16000}
{"type":"progress","uploaded_ms":60000,"processed_ms":30000,"chunks_completed":1}
{"type":"done","text":"full transcript","duration_ms":120000,"processing_ms":8420,"segments":[]}
Notes:
pcm16is the simplest long-form transport. The CLI helper uses it by default.wav,flac,mp3,ogg, andwebmare also accepted asinput_format, but each binary frame must be a self-contained decodable blob, such as aMediaRecorderchunk. Arbitrary byte slices of one compressed file are not supported.
Long-form TTS over WebSocket
Endpoint:
ws://localhost:11435/v1/audio/speech/stream
Protocol:
- Client sends a JSON config message.
- Client sends one or more
{"type":"text","text":"..."}messages. - Client sends
{"type":"end"}. - Server emits:
readyaudio_startprogress- binary audio chunks
- final
done
Example config:
{
"type": "config",
"model": "kokoro-tts-onnx:v1.0",
"voice": "af_heart",
"speed": 1.0,
"response_format": "pcm16"
}
Server events:
{"type":"ready","model":"kokoro-tts-onnx:v1.0","response_format":"pcm16"}
{"type":"audio_start","sample_rate":24000,"response_format":"pcm16"}
{"type":"progress","completed_chars":120,"total_chars":480,"chunks_completed":1,"chunks_total":4}
{"type":"done","response_format":"pcm16","audio_duration_ms":2450,"processing_ms":891}
Binary frames between audio_start and done carry the synthesized audio payload. pcm16 and opus are currently supported for the raw stream; the CLI helper writes pcm16 into a WAV file.
Streaming CLI helpers
These commands sit on top of the WebSocket APIs:
vox stream-transcribe parakeet-stt-onnx:tdt-0.6b-v3 meeting.mp3
vox stream-transcribe parakeet-stt-onnx:tdt-0.6b-v3 meeting.wav --json-output
vox stream-synthesize kokoro-tts-onnx:v1.0 script.txt -o script.wav
vox stream-transcribe transcodes the local input to streamed mono pcm16 on the client side, then uploads chunk-by-chunk over the WebSocket session. For compressed inputs this uses ffmpeg; install it if you want the helper to handle formats that soundfile cannot stream directly.
Docker
# GPU (default)
docker compose up -d
vox pull kokoro-tts-onnx:v1.0 # auto-installs adapter inside container
# CPU
docker compose --profile cpu up -d
Models and dynamically installed adapters persist in a Docker volume across container restarts. No image rebuild needed to add new models.
Spark ONNX GPU build
The default GPU multi-arch image is generic:
amd64usesonnxruntime-gpuarm64uses CPUonnxruntime
# Local image
make build-local
# Multi-arch publish build
make build
Spark image
The default image stays generic. If you want a Spark-specific arm64 image with a NVIDIA-provided ONNX Runtime source, use the dedicated Spark build:
# Local Spark build
make build-local-spark
# Published Spark build
make build-spark
Notes:
build-sparkislinux/arm64only.- By default,
Dockerfile.sparkuses:nvidia/cuda:13.0.0-cudnn-runtime-ubuntu24.04torch==2.9.0/torchaudio==2.9.0from the NVIDIA Jetson AI Lab SBSA CUDA 13.0 index (https://pypi.jetson-ai-lab.io/sbsa/cu130/+simple)- the tested
cp312 linux_aarch64NVIDIA Jetson AI Lab ONNX Runtime wheel:onnxruntime_gpu-1.23.0-cp312-cp312-linux_aarch64.whl
Dockerfile.sparknow refuses to publish if it would install a CPU-onlytorchbuild.- Provide a CUDA-capable Torch source with either:
SPARK_TORCH_WHEELandSPARK_TORCHAUDIO_WHEEL- or
SPARK_TORCH_INDEX_URL/SPARK_TORCH_EXTRA_INDEX_URL
- Provide a CUDA-capable Torch source with either:
- You can still override it with:
SPARK_ORT_WHEEL=/path/or/url/to/wheel- or
SPARK_ORT_INDEX_URL/SPARK_ORT_EXTRA_INDEX_URL
- The generic
make buildpath is unchanged and still produces the normal multi-arch image.
Representative models
| Model | Type | Description |
|---|---|---|
parakeet-stt-onnx:tdt-0.6b-v3 |
STT | NVIDIA Parakeet TDT 0.6B v3 via ONNX |
parakeet-stt-nemo:tdt-0.6b-v3 |
STT | NVIDIA Parakeet TDT 0.6B v3 via NeMo |
whisper-stt-ct2:large-v3 |
STT | OpenAI Whisper Large V3 via CTranslate2 |
whisper-stt-ct2:base.en |
STT | Whisper Base English |
qwen3-stt-torch:0.6b |
STT | Qwen3 ASR 0.6B |
voxtral-stt-torch:mini-3b |
STT | Voxtral Mini 3B speech-to-text |
kokoro-tts-onnx:v1.0 |
TTS | Kokoro 82M ONNX with preset voices |
kokoro-tts-torch:v1.0 |
TTS | Kokoro native runtime backend |
qwen3-tts-torch:0.6b |
TTS | Qwen3 TTS 0.6B |
voxtral-tts-vllm:4b |
TTS | Voxtral 4B TTS via vLLM-Omni |
openvoice-tts-torch:v1 |
TTS | OpenVoice voice-cloning backend |
piper-tts-onnx:en-us-lessac-medium |
TTS | Piper English US Lessac |
dia-tts-torch:1.6b |
TTS | Dia 1.6B multi-speaker dialogue |
sesame-tts-torch:csm-1b |
TTS | Sesame CSM 1B conversational speech |
More models at vox-registry. Add a model by submitting a PR with a JSON file.
API
All HTTP endpoints live under /v1/. STT/TTS endpoints are OpenAI-compatible by default; pass response_format=verbose_json on /v1/audio/transcriptions for the rich payload (segments, word timestamps, entities, topics).
| Endpoint | Method | Purpose |
|---|---|---|
/v1/health |
GET | Health check |
/v1/models |
GET | List downloaded models |
/v1/models/{name} |
GET | Model details |
/v1/models/{name} |
DELETE | Remove a model |
/v1/models/pull |
POST | Download a model |
/v1/models/loaded |
GET | Currently loaded models |
/v1/audio/transcriptions |
POST | Transcribe audio (OpenAI-compatible; verbose_json for rich payload) |
/v1/audio/speech |
POST | Synthesize speech (OpenAI-compatible; supports stream) |
/v1/audio/voices |
GET | List voices for a TTS model |
/v1/audio/voices |
POST | Create a stored cloned voice |
/v1/audio/voices/{id} |
DELETE | Delete a stored cloned voice |
/v1/audio/voices/{id}/reference |
GET | Download the stored reference audio |
/v1/audio/transcriptions/stream |
WS | Long-form streaming STT |
/v1/audio/speech/stream |
WS | Long-form streaming TTS |
gRPC
vox serve starts the gRPC server automatically unless you disable it with --grpc-port 0.
- default gRPC port:
9090 - health and model lifecycle:
HealthService.HealthHealthService.ListLoadedModelService.PullModelService.ListModelService.ShowModelService.Delete
- speech:
TranscriptionService.TranscribeSynthesisService.SynthesizeSynthesisService.ListVoicesSynthesisService.CreateVoiceSynthesisService.DeleteVoiceStreamingService.StreamTranscribe
The gRPC voice APIs use the same stored voice data as HTTP. Creating or deleting a cloned voice over one transport is immediately visible through the other.
Adding a model
Write an adapter package that implements STTAdapter or TTSAdapter:
from vox.core.adapter import TTSAdapter
class MyAdapter(TTSAdapter):
def info(self): ...
def load(self, model_path, device, **kwargs): ...
def unload(self): ...
@property
def is_loaded(self): ...
async def synthesize(self, text, *, voice=None, speed=1.0, **kwargs):
yield SynthesizeChunk(audio=audio_bytes, sample_rate=24000)
Register it via entry point:
[project.entry-points."vox.adapters"]
my-model = "my_package.adapter:MyAdapter"
Add a JSON file to vox-registry so vox pull can find it.
Project structure
src/vox/
core/ # types, adapter ABCs, scheduler, store, registry
audio/ # codec, resampling, pipeline
server/ # FastAPI routes
cli.py # Click CLI
adapters/
vox-parakeet/ # NVIDIA Parakeet STT
vox-kokoro/ # Kokoro TTS
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vox_runtime-0.2.12.tar.gz.
File metadata
- Download URL: vox_runtime-0.2.12.tar.gz
- Upload date:
- Size: 356.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce3243d85319d19734081dc884d8d2b75174222d579f6a3182dbb40a2b154a66
|
|
| MD5 |
4b92cd9012105107decef7024d26f4e9
|
|
| BLAKE2b-256 |
b36e9e5c30cd06dfe29afe1c0bb989c7dacb14f36acf0134baf886f9cc9aea1f
|
File details
Details for the file vox_runtime-0.2.12-py3-none-any.whl.
File metadata
- Download URL: vox_runtime-0.2.12-py3-none-any.whl
- Upload date:
- Size: 114.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62a7467f74d6f8f268fe65997f46fc2ec5bb46a7915eb65ce8b7851b1811c4ed
|
|
| MD5 |
7221f95d696effa632494c4f67b1c9cd
|
|
| BLAKE2b-256 |
2ba511753df1d92712720c0fdefbdb42402a985a795de4721dab8887e8cb3682
|