Universal local runtime for STT and TTS models

Project description

Vox

Local runtime for speech-to-text and text-to-speech models. Pull a model, start the server, hit an API.

What it does

Vox manages STT and TTS models through a REST API. Models are downloaded from HuggingFace, and each model family (Whisper, Kokoro, Parakeet, etc.) is handled by a plugin adapter that is installed automatically on first pull.

The Docker images intentionally start without any models or adapter packages installed. Pulling a model installs the matching adapter on demand.

vox pull kokoro:v1.0       # downloads model + installs adapter
vox pull whisper:large-v3   # same for STT
vox serve                   # starts REST API on :11435

Install

pip install vox-runtime
# or
uv pip install vox-runtime

Usage

Server

vox serve --port 11435 --device auto

Pull a model

vox pull kokoro:v1.0
vox pull parakeet:tdt-0.6b-v3
vox list

Transcribe (STT)

# CLI
vox run parakeet:tdt-0.6b-v3 recording.wav
vox stream-transcribe parakeet:tdt-0.6b-v3 meeting.mp3

# API
curl -F file=@recording.wav http://localhost:11435/api/transcribe

# OpenAI-compatible
curl -F file=@recording.wav http://localhost:11435/v1/audio/transcriptions

Synthesize (TTS)

# CLI
vox run kokoro:v1.0 "Hello, how are you?" -o output.wav
vox stream-synthesize kokoro:v1.0 "Hello, how are you?" -o output.wav

# API
curl -X POST http://localhost:11435/api/synthesize \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro:v1.0","input":"Hello, how are you?"}' \
  -o output.wav

# OpenAI-compatible
curl -X POST http://localhost:11435/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro:v1.0","input":"Hello"}' \
  -o output.wav

Search available models

vox search
vox search --type tts
vox search --type stt

Other commands

vox list          # downloaded models
vox ps            # loaded models
vox show kokoro:v1.0
vox rm kokoro:v1.0
vox voices kokoro:v1.0

Streaming APIs

Use the unary HTTP endpoints for short bounded requests.

Use the WebSocket APIs for:

long recordings
browser or pipeline streaming
live uploads where the client stays connected until the final result arrives

These streaming sessions are intentionally short-lived:

no job store
no durable result retention
disconnect cancels the session

Long-form STT over WebSocket

Endpoint:

ws://localhost:11435/v1/audio/transcriptions/stream

Protocol:

Client sends a JSON config message.
Client sends binary audio chunks.
Client sends {"type":"end"}.
Server emits progress events and one final done event with the full transcript.

Example config:

{
  "type": "config",
  "model": "parakeet:tdt-0.6b-v3",
  "input_format": "pcm16",
  "sample_rate": 16000,
  "language": "en",
  "word_timestamps": true,
  "chunk_ms": 30000,
  "overlap_ms": 1000
}

Server events:

{"type":"ready","model":"parakeet:tdt-0.6b-v3","input_format":"pcm16","sample_rate":16000}
{"type":"progress","uploaded_ms":60000,"processed_ms":30000,"chunks_completed":1}
{"type":"done","text":"full transcript","duration_ms":120000,"processing_ms":8420,"segments":[]}

Notes:

pcm16 is the simplest long-form transport. The CLI helper uses it by default.
wav, flac, mp3, ogg, and webm are also accepted as input_format, but each binary frame must be a self-contained decodable blob, such as a MediaRecorder chunk. Arbitrary byte slices of one compressed file are not supported.

Long-form TTS over WebSocket

Endpoint:

ws://localhost:11435/v1/audio/speech/stream

Protocol:

Client sends a JSON config message.
Client sends one or more {"type":"text","text":"..."} messages.
Client sends {"type":"end"}.
Server emits:
- ready
- audio_start
- progress
- binary audio chunks
- final done

Example config:

{
  "type": "config",
  "model": "kokoro:v1.0",
  "voice": "af_heart",
  "speed": 1.0,
  "response_format": "pcm16"
}

Server events:

{"type":"ready","model":"kokoro:v1.0","response_format":"pcm16"}
{"type":"audio_start","sample_rate":24000,"response_format":"pcm16"}
{"type":"progress","completed_chars":120,"total_chars":480,"chunks_completed":1,"chunks_total":4}
{"type":"done","response_format":"pcm16","audio_duration_ms":2450,"processing_ms":891}

Binary frames between audio_start and done carry the synthesized audio payload. pcm16 and opus are currently supported for the raw stream; the CLI helper writes pcm16 into a WAV file.

Streaming CLI helpers

These commands sit on top of the WebSocket APIs:

vox stream-transcribe parakeet:tdt-0.6b-v3 meeting.mp3
vox stream-transcribe parakeet:tdt-0.6b-v3 meeting.wav --json-output
vox stream-synthesize kokoro:v1.0 script.txt -o script.wav

vox stream-transcribe transcodes the local input to streamed mono pcm16 on the client side, then uploads chunk-by-chunk over the WebSocket session. For compressed inputs this uses ffmpeg; install it if you want the helper to handle formats that soundfile cannot stream directly.

Docker

# GPU (default)
docker compose up -d
vox pull kokoro:v1.0  # auto-installs adapter inside container

# CPU
docker compose --profile cpu up -d

Models and dynamically installed adapters persist in a Docker volume across container restarts. No image rebuild needed to add new models.

Spark ONNX GPU build

The default GPU multi-arch image is generic:

amd64 uses onnxruntime-gpu
arm64 uses CPU onnxruntime

# Local image
make build-local

# Multi-arch publish build
make build

Spark image

The default image stays generic. If you want a Spark-specific arm64 image with a NVIDIA-provided ONNX Runtime source, use the dedicated Spark build:

# Local Spark build
make build-local-spark

# Published Spark build
make build-spark

Notes:

build-spark is linux/arm64 only.
By default, Dockerfile.spark uses the tested cp312 linux_aarch64 NVIDIA Jetson AI Lab wheel:
- onnxruntime_gpu-1.23.0-cp312-cp312-linux_aarch64.whl
You can still override it with:
- SPARK_ORT_WHEEL=/path/or/url/to/wheel
- or SPARK_ORT_INDEX_URL / SPARK_ORT_EXTRA_INDEX_URL
The generic make build path is unchanged and still produces the normal multi-arch image.

Available models

Model	Type	Description
`parakeet:tdt-0.6b`	STT	NVIDIA Parakeet TDT 0.6B
`parakeet:tdt-0.6b-v3`	STT	Parakeet TDT 0.6B v3, 25 languages
`whisper:large-v3`	STT	OpenAI Whisper Large V3 via CTranslate2
`whisper:large-v3-turbo`	STT	Whisper Large V3 Turbo
`whisper:base.en`	STT	Whisper Base English
`kokoro:v1.0`	TTS	Kokoro 82M ONNX, preset voices
`piper:en-us-lessac-medium`	TTS	Piper English US Lessac
`fish-speech:v1.4`	TTS	Fish Speech 1.4, multilingual, voice cloning
`orpheus:3b`	TTS	Orpheus 3B, emotional speech
`dia:1.6b`	TTS	Dia 1.6B, multi-speaker dialogue
`sesame:csm-1b`	TTS	Sesame CSM 1B, conversational speech

More models at vox-registry. Add a model by submitting a PR with a JSON file.

API

Endpoint	Method	Purpose
`/api/transcribe`	POST	Audio to text
`/api/synthesize`	POST	Text to audio
`/api/pull`	POST	Download a model
`/api/list`	GET	List downloaded models
`/api/show`	POST	Model details
`/api/delete`	DELETE	Remove a model
`/api/ps`	GET	Currently loaded models
`/api/voices`	GET	List voices for a TTS model
`/api/health`	GET	Health check
`/v1/audio/transcriptions`	POST	OpenAI-compatible STT
`/v1/audio/speech`	POST	OpenAI-compatible TTS
`/v1/audio/transcriptions/stream`	WS	Long-form streaming STT
`/v1/audio/speech/stream`	WS	Long-form streaming TTS

Adding a model

Write an adapter package that implements STTAdapter or TTSAdapter:

from vox.core.adapter import TTSAdapter

class MyAdapter(TTSAdapter):
    def info(self): ...
    def load(self, model_path, device, **kwargs): ...
    def unload(self): ...
    @property
    def is_loaded(self): ...
    async def synthesize(self, text, *, voice=None, speed=1.0, **kwargs):
        yield SynthesizeChunk(audio=audio_bytes, sample_rate=24000)

[project.entry-points."vox.adapters"]
my-model = "my_package.adapter:MyAdapter"

Add a JSON file to vox-registry so vox pull can find it.

Project structure

src/vox/
  core/          # types, adapter ABCs, scheduler, store, registry
  audio/         # codec, resampling, pipeline
  server/        # FastAPI routes
  cli.py         # Click CLI
adapters/
  vox-parakeet/  # NVIDIA Parakeet STT
  vox-kokoro/    # Kokoro TTS

License

Apache-2.0

Project details

Release history Release notifications | RSS feed

0.2.15

Apr 18, 2026

0.2.14

Apr 18, 2026

0.2.13

Apr 17, 2026

0.2.12

Apr 17, 2026

0.2.11

Apr 17, 2026

0.2.10

Apr 17, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 17, 2026

0.2.4

Apr 16, 2026

This version

0.2.2

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vox_runtime-0.2.2.tar.gz (219.4 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vox_runtime-0.2.2-py3-none-any.whl (72.4 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file vox_runtime-0.2.2.tar.gz.

File metadata

Download URL: vox_runtime-0.2.2.tar.gz
Upload date: Apr 15, 2026
Size: 219.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for vox_runtime-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`c0bfd70f3d8065e768fcabf79e56126d73a209d311d4178ea961dc00e1ee6feb`
MD5	`d6e64fbb77379110304f8baea762df7a`
BLAKE2b-256	`1153935b26ce489e94d4b35cf9f579a722a31a63536be326372d04e062e451b3`

See more details on using hashes here.

File details

Details for the file vox_runtime-0.2.2-py3-none-any.whl.

File metadata

Download URL: vox_runtime-0.2.2-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 72.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for vox_runtime-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c25de1bc70f52b981ba9b4c410ce499e93bd7e0efd97c4e5b49aff9818e40372`
MD5	`51454c23142cd81bbfcb04600159ee4a`
BLAKE2b-256	`ebab3c640fdcc9ddafa50221ea87f07287e744705590704fbad2df5592e95ed2`

See more details on using hashes here.

vox-runtime 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Vox

What it does

Install

Usage

Server

Pull a model

Transcribe (STT)

Synthesize (TTS)

Search available models

Other commands

Streaming APIs

Long-form STT over WebSocket

Long-form TTS over WebSocket

Streaming CLI helpers

Docker

Spark ONNX GPU build

Spark image

Available models

API

Adding a model

Project structure

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes