Ollama-style Voice Model Management - Self-hosted OpenAI-compatible Speech AI
Project description
Vocal
Ollama for Voice Models — Self-hosted Speech AI Platform
Vocal manages STT (Speech-to-Text) and TTS (Text-to-Speech) models the way Ollama manages LLMs. It provides an OpenAI-compatible REST API, a Python SDK, and a CLI — with model download, caching, and multi-backend support built in.
Quick Start
# Run without installing
uvx --from vocal-ai vocal serve
# Or install permanently
pip install vocal-ai
vocal serve
Interactive API docs are at http://localhost:8000/docs.
# Pull a model and transcribe
vocal models pull Systran/faster-whisper-tiny
vocal transcribe your_audio.wav
# Text-to-speech (built-in, no download)
vocal speak "Hello, world!"
# Real-time microphone transcription
vocal listen
# Full voice agent (STT → LLM → TTS)
vocal chat # requires Ollama running locally
Optional backends (base install already includes torch + faster-whisper + transformers + silero-vad):
| Extra | What you get | Install |
|---|---|---|
kokoro |
Kokoro-82M neural TTS, #1 on TTS Arena | pip install "vocal-ai[kokoro]" |
piper |
Piper offline TTS, fast, multilingual | pip install "vocal-ai[piper]" |
qwen3-tts |
Qwen3-TTS voice cloning (CUDA required) | pip install "vocal-ai[qwen3-tts]" |
whisperx |
WhisperX — word-level timestamps + diarization | pip install "vocal-ai[whisperx]" |
nemo |
NVIDIA NeMo STT (Parakeet-TDT, Canary-Qwen) | pip install "vocal-ai[nemo]" |
chatterbox |
Chatterbox voice cloning TTS | pip install "vocal-ai[chatterbox]" |
Missing a backend? The error message will tell you exactly which command to run.
Features
- OpenAI-compatible —
/v1/audio/transcriptions,/v1/audio/speech,/v1/realtime - Ollama-style model management — pull, list, delete models from the CLI or API
- Auto-generated SDK — typed Python client generated from the live OpenAPI spec
- Streaming TTS — first audio bytes before full synthesis completes
- WebSocket ASR — ~200 ms latency with server-side VAD
- Voice agent — full STT → LLM → TTS loop, OpenAI Realtime protocol compatible
- Voice selection — list and select voices per model
- Voice cloning — clone a voice from a 3–30 s reference recording
- Cross-platform — Windows, macOS, Linux (WSL supported)
- GPU acceleration — automatic CUDA detection with VRAM optimization
Documentation
| Getting Started | Install, first transcription, platform notes |
| Available Models | STT/TTS catalog, hardware guide |
| CLI Reference | All commands with options |
| Configuration | Environment variables, .env |
| Contributing | Dev setup, PR workflow |
| Architecture | Package structure, adapter pattern |
| Adding Models | New STT/TTS backends |
| Testing | Test tiers, CI, cross-platform |
| Release Process | Version bump, PyPI publish |
API Overview
Speech-to-Text (OpenAI-compatible)
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@audio.mp3" \
-F "model=Systran/faster-whisper-tiny"
Text-to-Speech (OpenAI-compatible)
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"pyttsx3","input":"Hello, world!","response_format":"wav"}' \
--output speech.wav
Voice Cloning
curl -X POST http://localhost:8000/v1/audio/clone \
-F "text=Synthesize in my voice." \
-F "reference_audio=@speaker.wav" \
--output clone.wav
Model Management (Ollama-style)
curl http://localhost:8000/v1/models # list
curl -X POST http://localhost:8000/v1/models/Systran/faster-whisper-tiny/download
curl -X DELETE http://localhost:8000/v1/models/Systran/faster-whisper-tiny
SDK
from vocal_sdk import VocalClient
from vocal_sdk.api.audio import text_to_speech_v1_audio_speech_post
from vocal_sdk.models import TTSRequest
client = VocalClient(base_url="http://localhost:8000")
audio = text_to_speech_v1_audio_speech_post.sync(
client=client,
body=TTSRequest(model="pyttsx3", input="Hello from the SDK."),
)
open("output.wav", "wb").write(audio)
Cross-Platform
| Platform | TTS Engine | Notes |
|---|---|---|
| Windows | SAPI5 (pyttsx3) | Built-in, no extra install |
| macOS | NSSpeechSynthesizer | Built-in, no extra install |
| Linux / WSL | espeak-ng (pyttsx3) | sudo apt install espeak-ng ffmpeg |
All audio formats (mp3, wav, opus, aac, flac, pcm) work on all platforms via ffmpeg.
Contributing
git clone https://github.com/niradler/vocal.git
cd vocal
make install
make lint && make test
See docs/developer/contributing.md for the full workflow.
License
Server Side Public License (SSPL-1.0) — free to use and self-host. If you offer Vocal as a managed service to third parties, you must open-source your full service stack under the same license.
Built with FastAPI, faster-whisper, HuggingFace Hub, and uv.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vocal_ai-0.3.6.tar.gz.
File metadata
- Download URL: vocal_ai-0.3.6.tar.gz
- Upload date:
- Size: 499.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c088067d21d871bd650e2d2a0169020b77ca328ec51bc6b7abab25a268bb1d85
|
|
| MD5 |
c6cc051945ec886b41a4182b4043c0de
|
|
| BLAKE2b-256 |
03ccaa46533b74bba5298e434069d774d5cf702ef511166c43b4d0b0337c63e5
|
File details
Details for the file vocal_ai-0.3.6-py3-none-any.whl.
File metadata
- Download URL: vocal_ai-0.3.6-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6758c7eb26c4ec88b06b47153e805675738a04cbb8e9e64cfaf88f25d4b710e
|
|
| MD5 |
50769a5b41f025968d129c53e95b765c
|
|
| BLAKE2b-256 |
56112d9edf5462b54a1e5429cddca45cc954cfdfad84fbf998d0dcd6aa49d29a
|