Universal local runtime for STT and TTS models
Project description
Vox
Local runtime for speech-to-text and text-to-speech models. Pull a model, start the server, hit an API.
What it does
Vox manages STT and TTS models through a REST API. Models are downloaded from HuggingFace, and each model family (Whisper, Kokoro, Parakeet, etc.) is handled by a plugin adapter that is installed automatically on first pull.
The Docker images intentionally start without any models or adapter packages installed. Pulling a model installs the matching adapter on demand.
vox pull kokoro:v1.0 # downloads model + installs adapter
vox pull whisper:large-v3 # same for STT
vox serve # starts REST API on :11435
Install
pip install vox-runtime
# or
uv pip install vox-runtime
Usage
Server
vox serve --port 11435 --device auto
Pull a model
vox pull kokoro:v1.0
vox pull parakeet:tdt-0.6b-v3
vox list
Transcribe (STT)
# CLI
vox run parakeet:tdt-0.6b-v3 recording.wav
vox stream-transcribe parakeet:tdt-0.6b-v3 meeting.mp3
# API
curl -F file=@recording.wav http://localhost:11435/api/transcribe
# OpenAI-compatible
curl -F file=@recording.wav http://localhost:11435/v1/audio/transcriptions
Synthesize (TTS)
# CLI
vox run kokoro:v1.0 "Hello, how are you?" -o output.wav
vox stream-synthesize kokoro:v1.0 "Hello, how are you?" -o output.wav
# API
curl -X POST http://localhost:11435/api/synthesize \
-H "Content-Type: application/json" \
-d '{"model":"kokoro:v1.0","input":"Hello, how are you?"}' \
-o output.wav
# OpenAI-compatible
curl -X POST http://localhost:11435/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"kokoro:v1.0","input":"Hello"}' \
-o output.wav
Search available models
vox search
vox search --type tts
vox search --type stt
Other commands
vox list # downloaded models
vox ps # loaded models
vox show kokoro:v1.0
vox rm kokoro:v1.0
vox voices kokoro:v1.0
Streaming APIs
Use the unary HTTP endpoints for short bounded requests.
Use the WebSocket APIs for:
- long recordings
- browser or pipeline streaming
- live uploads where the client stays connected until the final result arrives
These streaming sessions are intentionally short-lived:
- no job store
- no durable result retention
- disconnect cancels the session
Long-form STT over WebSocket
Endpoint:
ws://localhost:11435/v1/audio/transcriptions/stream
Protocol:
- Client sends a JSON config message.
- Client sends binary audio chunks.
- Client sends
{"type":"end"}. - Server emits progress events and one final
doneevent with the full transcript.
Example config:
{
"type": "config",
"model": "parakeet:tdt-0.6b-v3",
"input_format": "pcm16",
"sample_rate": 16000,
"language": "en",
"word_timestamps": true,
"chunk_ms": 30000,
"overlap_ms": 1000
}
Server events:
{"type":"ready","model":"parakeet:tdt-0.6b-v3","input_format":"pcm16","sample_rate":16000}
{"type":"progress","uploaded_ms":60000,"processed_ms":30000,"chunks_completed":1}
{"type":"done","text":"full transcript","duration_ms":120000,"processing_ms":8420,"segments":[]}
Notes:
pcm16is the simplest long-form transport. The CLI helper uses it by default.wav,flac,mp3,ogg, andwebmare also accepted asinput_format, but each binary frame must be a self-contained decodable blob, such as aMediaRecorderchunk. Arbitrary byte slices of one compressed file are not supported.
Long-form TTS over WebSocket
Endpoint:
ws://localhost:11435/v1/audio/speech/stream
Protocol:
- Client sends a JSON config message.
- Client sends one or more
{"type":"text","text":"..."}messages. - Client sends
{"type":"end"}. - Server emits:
readyaudio_startprogress- binary audio chunks
- final
done
Example config:
{
"type": "config",
"model": "kokoro:v1.0",
"voice": "af_heart",
"speed": 1.0,
"response_format": "pcm16"
}
Server events:
{"type":"ready","model":"kokoro:v1.0","response_format":"pcm16"}
{"type":"audio_start","sample_rate":24000,"response_format":"pcm16"}
{"type":"progress","completed_chars":120,"total_chars":480,"chunks_completed":1,"chunks_total":4}
{"type":"done","response_format":"pcm16","audio_duration_ms":2450,"processing_ms":891}
Binary frames between audio_start and done carry the synthesized audio payload. pcm16 and opus are currently supported for the raw stream; the CLI helper writes pcm16 into a WAV file.
Streaming CLI helpers
These commands sit on top of the WebSocket APIs:
vox stream-transcribe parakeet:tdt-0.6b-v3 meeting.mp3
vox stream-transcribe parakeet:tdt-0.6b-v3 meeting.wav --json-output
vox stream-synthesize kokoro:v1.0 script.txt -o script.wav
vox stream-transcribe transcodes the local input to streamed mono pcm16 on the client side, then uploads chunk-by-chunk over the WebSocket session. For compressed inputs this uses ffmpeg; install it if you want the helper to handle formats that soundfile cannot stream directly.
Docker
# GPU (default)
docker compose up -d
vox pull kokoro:v1.0 # auto-installs adapter inside container
# CPU
docker compose --profile cpu up -d
Models and dynamically installed adapters persist in a Docker volume across container restarts. No image rebuild needed to add new models.
Spark ONNX GPU build
The default GPU multi-arch image is generic:
amd64usesonnxruntime-gpuarm64uses CPUonnxruntime
# Local image
make build-local
# Multi-arch publish build
make build
Spark image
The default image stays generic. If you want a Spark-specific arm64 image with a NVIDIA-provided ONNX Runtime source, use the dedicated Spark build:
# Local Spark build
make build-local-spark
# Published Spark build
make build-spark
Notes:
build-sparkislinux/arm64only.- By default,
Dockerfile.sparkuses the testedcp312 linux_aarch64NVIDIA Jetson AI Lab wheel:onnxruntime_gpu-1.23.0-cp312-cp312-linux_aarch64.whl
- You can still override it with:
SPARK_ORT_WHEEL=/path/or/url/to/wheel- or
SPARK_ORT_INDEX_URL/SPARK_ORT_EXTRA_INDEX_URL
- The generic
make buildpath is unchanged and still produces the normal multi-arch image.
Available models
| Model | Type | Description |
|---|---|---|
parakeet:tdt-0.6b |
STT | NVIDIA Parakeet TDT 0.6B |
parakeet:tdt-0.6b-v3 |
STT | Parakeet TDT 0.6B v3, 25 languages |
whisper:large-v3 |
STT | OpenAI Whisper Large V3 via CTranslate2 |
whisper:large-v3-turbo |
STT | Whisper Large V3 Turbo |
whisper:base.en |
STT | Whisper Base English |
kokoro:v1.0 |
TTS | Kokoro 82M ONNX, preset voices |
piper:en-us-lessac-medium |
TTS | Piper English US Lessac |
fish-speech:v1.4 |
TTS | Fish Speech 1.4, multilingual, voice cloning |
orpheus:3b |
TTS | Orpheus 3B, emotional speech |
dia:1.6b |
TTS | Dia 1.6B, multi-speaker dialogue |
sesame:csm-1b |
TTS | Sesame CSM 1B, conversational speech |
More models at vox-registry. Add a model by submitting a PR with a JSON file.
API
| Endpoint | Method | Purpose |
|---|---|---|
/api/transcribe |
POST | Audio to text |
/api/synthesize |
POST | Text to audio |
/api/pull |
POST | Download a model |
/api/list |
GET | List downloaded models |
/api/show |
POST | Model details |
/api/delete |
DELETE | Remove a model |
/api/ps |
GET | Currently loaded models |
/api/voices |
GET | List voices for a TTS model |
/api/health |
GET | Health check |
/v1/audio/transcriptions |
POST | OpenAI-compatible STT |
/v1/audio/speech |
POST | OpenAI-compatible TTS |
/v1/audio/transcriptions/stream |
WS | Long-form streaming STT |
/v1/audio/speech/stream |
WS | Long-form streaming TTS |
Adding a model
Write an adapter package that implements STTAdapter or TTSAdapter:
from vox.core.adapter import TTSAdapter
class MyAdapter(TTSAdapter):
def info(self): ...
def load(self, model_path, device, **kwargs): ...
def unload(self): ...
@property
def is_loaded(self): ...
async def synthesize(self, text, *, voice=None, speed=1.0, **kwargs):
yield SynthesizeChunk(audio=audio_bytes, sample_rate=24000)
Register it via entry point:
[project.entry-points."vox.adapters"]
my-model = "my_package.adapter:MyAdapter"
Add a JSON file to vox-registry so vox pull can find it.
Project structure
src/vox/
core/ # types, adapter ABCs, scheduler, store, registry
audio/ # codec, resampling, pipeline
server/ # FastAPI routes
cli.py # Click CLI
adapters/
vox-parakeet/ # NVIDIA Parakeet STT
vox-kokoro/ # Kokoro TTS
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vox_runtime-0.2.2.tar.gz.
File metadata
- Download URL: vox_runtime-0.2.2.tar.gz
- Upload date:
- Size: 219.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0bfd70f3d8065e768fcabf79e56126d73a209d311d4178ea961dc00e1ee6feb
|
|
| MD5 |
d6e64fbb77379110304f8baea762df7a
|
|
| BLAKE2b-256 |
1153935b26ce489e94d4b35cf9f579a722a31a63536be326372d04e062e451b3
|
File details
Details for the file vox_runtime-0.2.2-py3-none-any.whl.
File metadata
- Download URL: vox_runtime-0.2.2-py3-none-any.whl
- Upload date:
- Size: 72.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c25de1bc70f52b981ba9b4c410ce499e93bd7e0efd97c4e5b49aff9818e40372
|
|
| MD5 |
51454c23142cd81bbfcb04600159ee4a
|
|
| BLAKE2b-256 |
ebab3c640fdcc9ddafa50221ea87f07287e744705590704fbad2df5592e95ed2
|