Universal local runtime for STT and TTS models

Project description

Vox

Vox is a local runtime for speech models.

It gives speech-to-text and text-to-speech models one operational surface: pull a model, serve one API, and run local speech workloads without hand-wiring each model family yourself. Vox is built around speech-native concerns like streaming audio, voice selection, backend-specific adapters, and one consistent interface across STT and TTS models.

Why Vox

One runtime for both speech-to-text and text-to-speech
One CLI and one API surface across many model families
Pull-on-demand model and adapter installation
Multiple backends behind the same runtime: ONNX, Torch, NeMo, CTranslate2, and vLLM
Stored custom voices for clone-capable TTS models
REST, WebSocket, gRPC, and OpenAI-compatible endpoints
Local-first deployment with Docker images that start empty and install only what you use

Quickstart

pip install vox-runtime

vox pull kokoro-tts-onnx:v1.0
vox pull whisper-stt-ct2:large-v3
vox serve

Then hit the local API:

curl -X POST http://localhost:11435/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro-tts-onnx:v1.0","input":"Hello from Vox"}' \
  -o output.wav

gRPC starts with vox serve too and listens on :9090 by default.

What it does

Vox manages STT and TTS models through a consistent runtime API. Models are downloaded from Hugging Face, and each model family is handled by an adapter that is installed automatically on first pull.

The Docker images intentionally start without any models or adapter packages installed. Pulling a model installs the matching adapter on demand.

vox pull kokoro-tts-onnx:v1.0
vox pull whisper-stt-ct2:large-v3
vox serve

Install

pip install vox-runtime
# or
uv pip install vox-runtime

Usage

Server

vox serve --port 11435 --device auto

Pull a model

vox pull kokoro-tts-onnx:v1.0
vox pull parakeet-stt-onnx:tdt-0.6b-v3
vox list

Transcribe (STT)

# CLI
vox run parakeet-stt-onnx:tdt-0.6b-v3 recording.wav
vox stream-transcribe parakeet-stt-onnx:tdt-0.6b-v3 meeting.mp3

# OpenAI-compatible: thin response (just {"text": ...})
curl -F file=@recording.wav http://localhost:11435/v1/audio/transcriptions

# Rich response with segments, word timestamps, entities, topics
curl -F file=@recording.wav -F response_format=verbose_json \
  http://localhost:11435/v1/audio/transcriptions

Synthesize (TTS)

# CLI
vox run kokoro-tts-onnx:v1.0 "Hello, how are you?" -o output.wav
vox stream-synthesize kokoro-tts-onnx:v1.0 "Hello, how are you?" -o output.wav

# OpenAI-compatible
curl -X POST http://localhost:11435/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro-tts-onnx:v1.0","input":"Hello"}' \
  -o output.wav

Create and use custom voices

Vox can store cloned voices and reuse them across HTTP and gRPC. This is only available for TTS adapters that declare voice-cloning support. Preset-only models still list their built-in voices, but they will reject stored cloned voices at synthesis time.

# create a cloned voice from a reference sample
curl -X POST http://localhost:11435/v1/audio/voices \
  -F audio_sample=@sample.wav \
  -F name="Roy" \
  -F language=en \
  -F reference_text="Hello there from my custom voice"

# list voices, including cloned voices for clone-capable models
curl "http://localhost:11435/v1/audio/voices?model=openvoice-tts-torch:v1"

# download the stored reference audio
curl -o reference.wav http://localhost:11435/v1/audio/voices/voice1234/reference

# synthesize with the stored voice id returned at creation time
curl -X POST http://localhost:11435/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"openvoice-tts-torch:v1","input":"Hello from Vox","voice":"voice1234"}' \
  -o output.wav

# delete a stored cloned voice
curl -X DELETE http://localhost:11435/v1/audio/voices/voice1234

Search available models

vox search
vox search --type tts
vox search --type stt

Other commands

vox list          # downloaded models
vox ps            # loaded models
vox show kokoro-tts-onnx:v1.0
vox rm kokoro-tts-onnx:v1.0
vox voices kokoro-tts-onnx:v1.0

Streaming APIs

Use the unary HTTP endpoints for short bounded requests.

Use the WebSocket APIs for:

long recordings
browser or pipeline streaming
live uploads where the client stays connected until the final result arrives

These streaming sessions are intentionally short-lived:

no job store
no durable result retention
disconnect cancels the session

Long-form STT over WebSocket

Endpoint:

ws://localhost:11435/v1/audio/transcriptions/stream

Protocol:

Client sends a JSON config message.
Client sends binary audio chunks.
Client sends {"type":"end"}.
Server emits progress events and one final done event with the full transcript.

Example config:

{
  "type": "config",
  "model": "parakeet-stt-onnx:tdt-0.6b-v3",
  "input_format": "pcm16",
  "sample_rate": 16000,
  "language": "en",
  "word_timestamps": true,
  "chunk_ms": 30000,
  "overlap_ms": 1000
}

Server events:

{"type":"ready","model":"parakeet-stt-onnx:tdt-0.6b-v3","input_format":"pcm16","sample_rate":16000}
{"type":"progress","uploaded_ms":60000,"processed_ms":30000,"chunks_completed":1}
{"type":"done","text":"full transcript","duration_ms":120000,"processing_ms":8420,"segments":[]}

Notes:

pcm16 is the simplest long-form transport. The CLI helper uses it by default.
wav, flac, mp3, ogg, and webm are also accepted as input_format, but each binary frame must be a self-contained decodable blob, such as a MediaRecorder chunk. Arbitrary byte slices of one compressed file are not supported.

Long-form TTS over WebSocket

Endpoint:

ws://localhost:11435/v1/audio/speech/stream

Protocol:

Client sends a JSON config message.
Client sends one or more {"type":"text","text":"..."} messages.
Client sends {"type":"end"}.
Server emits:
- ready
- audio_start
- progress
- binary audio chunks
- final done

Example config:

{
  "type": "config",
  "model": "kokoro-tts-onnx:v1.0",
  "voice": "af_heart",
  "speed": 1.0,
  "response_format": "pcm16"
}

Server events:

{"type":"ready","model":"kokoro-tts-onnx:v1.0","response_format":"pcm16"}
{"type":"audio_start","sample_rate":24000,"response_format":"pcm16"}
{"type":"progress","completed_chars":120,"total_chars":480,"chunks_completed":1,"chunks_total":4}
{"type":"done","response_format":"pcm16","audio_duration_ms":2450,"processing_ms":891}

Binary frames between audio_start and done carry the synthesized audio payload. pcm16 and opus are currently supported for the raw stream; the CLI helper writes pcm16 into a WAV file.

Streaming CLI helpers

These commands sit on top of the WebSocket APIs:

vox stream-transcribe parakeet-stt-onnx:tdt-0.6b-v3 meeting.mp3
vox stream-transcribe parakeet-stt-onnx:tdt-0.6b-v3 meeting.wav --json-output
vox stream-synthesize kokoro-tts-onnx:v1.0 script.txt -o script.wav

vox stream-transcribe transcodes the local input to streamed mono pcm16 on the client side, then uploads chunk-by-chunk over the WebSocket session. For compressed inputs this uses ffmpeg; install it if you want the helper to handle formats that soundfile cannot stream directly.

Docker

# GPU (default)
docker compose up -d
vox pull kokoro-tts-onnx:v1.0  # auto-installs adapter inside container

# CPU
docker compose --profile cpu up -d

Models and dynamically installed adapters persist in a Docker volume across container restarts. No image rebuild needed to add new models.

Spark ONNX GPU build

The default GPU multi-arch image is generic:

amd64 uses onnxruntime-gpu
arm64 uses CPU onnxruntime

# Local image
make build-local

# Multi-arch publish build
make build

Spark image

The default image stays generic. If you want a Spark-specific arm64 image with a NVIDIA-provided ONNX Runtime source, use the dedicated Spark build:

# Local Spark build
make build-local-spark

# Published Spark build
make build-spark

Notes:

build-spark is linux/arm64 only.
By default, Dockerfile.spark uses:
- nvidia/cuda:13.0.0-cudnn-runtime-ubuntu24.04
- torch==2.9.0 / torchaudio==2.9.0 from the NVIDIA Jetson AI Lab SBSA CUDA 13.0 index (https://pypi.jetson-ai-lab.io/sbsa/cu130/+simple)
- the tested cp312 linux_aarch64 NVIDIA Jetson AI Lab ONNX Runtime wheel:
  - onnxruntime_gpu-1.23.0-cp312-cp312-linux_aarch64.whl
Dockerfile.spark now refuses to publish if it would install a CPU-only torch build.
- Provide a CUDA-capable Torch source with either:
  - SPARK_TORCH_WHEEL and SPARK_TORCHAUDIO_WHEEL
  - or SPARK_TORCH_INDEX_URL / SPARK_TORCH_EXTRA_INDEX_URL
You can still override it with:
- SPARK_ORT_WHEEL=/path/or/url/to/wheel
- or SPARK_ORT_INDEX_URL / SPARK_ORT_EXTRA_INDEX_URL
The generic make build path is unchanged and still produces the normal multi-arch image.

Representative models

Model	Type	Description
`parakeet-stt-onnx:tdt-0.6b-v3`	STT	NVIDIA Parakeet TDT 0.6B v3 via ONNX
`parakeet-stt-nemo:tdt-0.6b-v3`	STT	NVIDIA Parakeet TDT 0.6B v3 via NeMo
`whisper-stt-ct2:large-v3`	STT	OpenAI Whisper Large V3 via CTranslate2
`whisper-stt-ct2:base.en`	STT	Whisper Base English
`qwen3-stt-torch:0.6b`	STT	Qwen3 ASR 0.6B
`voxtral-stt-torch:mini-3b`	STT	Voxtral Mini 3B speech-to-text
`kokoro-tts-onnx:v1.0`	TTS	Kokoro 82M ONNX with preset voices
`kokoro-tts-torch:v1.0`	TTS	Kokoro native runtime backend
`qwen3-tts-torch:0.6b`	TTS	Qwen3 TTS 0.6B
`voxtral-tts-vllm:4b`	TTS	Voxtral 4B TTS via vLLM-Omni
`openvoice-tts-torch:v1`	TTS	OpenVoice voice-cloning backend
`piper-tts-onnx:en-us-lessac-medium`	TTS	Piper English US Lessac
`dia-tts-torch:1.6b`	TTS	Dia 1.6B multi-speaker dialogue
`sesame-tts-torch:csm-1b`	TTS	Sesame CSM 1B conversational speech

More models at vox-registry. Add a model by submitting a PR with a JSON file.

API

All HTTP endpoints live under /v1/. STT/TTS endpoints are OpenAI-compatible by default; pass response_format=verbose_json on /v1/audio/transcriptions for the rich payload (segments, word timestamps, entities, topics).

Endpoint	Method	Purpose
`/v1/health`	GET	Health check
`/v1/models`	GET	List downloaded models
`/v1/models/{name}`	GET	Model details
`/v1/models/{name}`	DELETE	Remove a model
`/v1/models/pull`	POST	Download a model
`/v1/models/loaded`	GET	Currently loaded models
`/v1/audio/transcriptions`	POST	Transcribe audio (OpenAI-compatible; `verbose_json` for rich payload)
`/v1/audio/speech`	POST	Synthesize speech (OpenAI-compatible; supports `stream`)
`/v1/audio/voices`	GET	List voices for a TTS model
`/v1/audio/voices`	POST	Create a stored cloned voice
`/v1/audio/voices/{id}`	DELETE	Delete a stored cloned voice
`/v1/audio/voices/{id}/reference`	GET	Download the stored reference audio
`/v1/audio/transcriptions/stream`	WS	Long-form streaming STT
`/v1/audio/speech/stream`	WS	Long-form streaming TTS

gRPC

vox serve starts the gRPC server automatically unless you disable it with --grpc-port 0.

default gRPC port: 9090
health and model lifecycle:
- HealthService.Health
- HealthService.ListLoaded
- ModelService.Pull
- ModelService.List
- ModelService.Show
- ModelService.Delete
speech:
- TranscriptionService.Transcribe
- SynthesisService.Synthesize
- SynthesisService.ListVoices
- SynthesisService.CreateVoice
- SynthesisService.DeleteVoice
- StreamingService.StreamTranscribe

The gRPC voice APIs use the same stored voice data as HTTP. Creating or deleting a cloned voice over one transport is immediately visible through the other.

Adding a model

Write an adapter package that implements STTAdapter or TTSAdapter:

from vox.core.adapter import TTSAdapter

class MyAdapter(TTSAdapter):
    def info(self): ...
    def load(self, model_path, device, **kwargs): ...
    def unload(self): ...
    @property
    def is_loaded(self): ...
    async def synthesize(self, text, *, voice=None, speed=1.0, **kwargs):
        yield SynthesizeChunk(audio=audio_bytes, sample_rate=24000)

[project.entry-points."vox.adapters"]
my-model = "my_package.adapter:MyAdapter"

Add a JSON file to vox-registry so vox pull can find it.

Project structure

src/vox/
  core/          # types, adapter ABCs, scheduler, store, registry
  audio/         # codec, resampling, pipeline
  server/        # FastAPI routes
  cli.py         # Click CLI
adapters/
  vox-parakeet/  # NVIDIA Parakeet STT
  vox-kokoro/    # Kokoro TTS

License

Apache-2.0

Project details

Release history Release notifications | RSS feed

0.2.15

Apr 18, 2026

0.2.14

Apr 18, 2026

0.2.13

Apr 17, 2026

This version

0.2.12

Apr 17, 2026

0.2.11

Apr 17, 2026

0.2.10

Apr 17, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 17, 2026

0.2.4

Apr 16, 2026

0.2.2

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vox_runtime-0.2.12.tar.gz (356.1 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vox_runtime-0.2.12-py3-none-any.whl (114.2 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file vox_runtime-0.2.12.tar.gz.

File metadata

Download URL: vox_runtime-0.2.12.tar.gz
Upload date: Apr 17, 2026
Size: 356.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for vox_runtime-0.2.12.tar.gz
Algorithm	Hash digest
SHA256	`ce3243d85319d19734081dc884d8d2b75174222d579f6a3182dbb40a2b154a66`
MD5	`4b92cd9012105107decef7024d26f4e9`
BLAKE2b-256	`b36e9e5c30cd06dfe29afe1c0bb989c7dacb14f36acf0134baf886f9cc9aea1f`

See more details on using hashes here.

File details

Details for the file vox_runtime-0.2.12-py3-none-any.whl.

File metadata

Download URL: vox_runtime-0.2.12-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 114.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for vox_runtime-0.2.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`62a7467f74d6f8f268fe65997f46fc2ec5bb46a7915eb65ce8b7851b1811c4ed`
MD5	`7221f95d696effa632494c4f67b1c9cd`
BLAKE2b-256	`2ba511753df1d92712720c0fdefbdb42402a985a795de4721dab8887e8cb3682`

See more details on using hashes here.

vox-runtime 0.2.12

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Vox

Why Vox

Quickstart

What it does

Install

Usage

Server

Pull a model

Transcribe (STT)

Synthesize (TTS)

Create and use custom voices

Search available models

Other commands

Streaming APIs

Long-form STT over WebSocket

Long-form TTS over WebSocket

Streaming CLI helpers

Docker

Spark ONNX GPU build

Spark image

Representative models

API

gRPC

Adding a model

Project structure

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes