Ollama-style Voice Model Management - Self-hosted OpenAI-compatible Speech AI
Project description
Vocal
Generic Speech AI Platform - Ollama for Voice Models
Vocal is an API-first speech AI platform with automatic OpenAPI spec generation, auto-generated SDK, and Ollama-style model management. Built with a generic registry pattern supporting multiple providers.
๐ Quick Start (5 minutes)
# 1. Clone and setup
git clone <repo-url>
cd vocal
make install
# 2. Start API
make serve
# 3. Visit interactive docs
# Open: http://localhost:8000/docs
# 4. Use SDK to transcribe
python sdk_example.py your_audio.mp3
That's it! Models auto-download on first use.
Pro tip: Run make help to see all available commands.
Features
- ๐ฏ API-First Architecture: FastAPI with auto-generated OpenAPI spec
- ๐ Interactive Docs: Swagger UI at
/docsendpoint - ๐ฆ Auto-Generated SDK: Python SDK generated from OpenAPI spec
- ๐ Ollama-Style: Model registry with pull/list/delete commands
- ๐ Fast Inference: faster-whisper (4x faster than OpenAI Whisper)
- โก GPU Acceleration: Automatic CUDA detection with VRAM optimization
- ๐ 99+ Languages: Support for multilingual transcription
- ๐ Extensible: Generic provider pattern (HuggingFace, local, custom)
- ๐ค OpenAI Compatible:
/v1/audio/transcriptionsendpoint - ๐ Text-to-Speech: Neural TTS with Piper or system voices
- ๐จ CLI Tool: Typer-based CLI with rich console output
- โ Production Ready: 23/23 E2E tests passing with real audio assets
Quick Start
1. Installation
git clone <repo-url>
cd vocal
# Option 1: Using Makefile (recommended)
make install
# Option 2: Using uv directly
uv venv
uv sync
2. Start API Server
# Using Makefile
make serve
# Or using uv directly
uv run uvicorn vocal_api.main:app --port 8000
# Development mode with auto-reload
make serve-dev
The API will be available at:
- API: http://localhost:8000
- Interactive Docs: http://localhost:8000/docs ๐
- OpenAPI Spec: http://localhost:8000/openapi.json
- Health: http://localhost:8000/health
3. Use the SDK
from vocal_sdk import VocalSDK
# Initialize client
client = VocalSDK(base_url="http://localhost:8000")
# List models (Ollama-style)
models = client.models.list()
for model in models['models']:
print(f"{model['id']}: {model['status']}")
# Download model if needed (Ollama-style pull)
client.models.download("Systran/faster-whisper-tiny")
# Transcribe audio (OpenAI-compatible)
result = client.audio.transcribe(
file="audio.mp3",
model="Systran/faster-whisper-tiny"
)
print(result['text'])
# Text-to-Speech
audio = client.audio.text_to_speech("Hello, world!")
with open("output.wav", "wb") as f:
f.write(audio)
Or use the CLI:
# Transcribe audio
vocal run audio.mp3
# List models
vocal models list
# Download model
vocal models pull Systran/faster-whisper-tiny
# Start API server
vocal serve --port 8000
Or use the example:
uv run python sdk_example.py Recording.m4a
Architecture
See VOICESTACK_API_FIRST_ARCHITECTURE.md for detailed architecture documentation.
Key Principles:
- API-first design with auto-generated OpenAPI spec
- Generic registry pattern for extensibility
- Ollama-style model management
- OpenAI-compatible endpoints
- Type-safe throughout with Pydantic
vocal/
โโโ packages/
โ โโโ core/ # Model registry & adapters โ
โ โ โโโ vocal_core/
โ โ โโโ registry/ # Generic model registry
โ โ โ โโโ providers/ # HuggingFace, local, custom
โ โ โ โโโ model_info.py
โ โ โโโ adapters/ # STT/TTS adapters
โ โ โโโ stt/ # faster-whisper implementation
โ โ
โ โโโ api/ # FastAPI server โ
โ โ โโโ vocal_api/
โ โ โโโ models/ # Pydantic schemas
โ โ โโโ routes/ # API endpoints
โ โ โโโ services/ # Business logic
โ โ โโโ main.py # FastAPI app
โ โ
โ โโโ sdk/ # Auto-generated Python SDK โณ
โ โโโ cli/ # CLI using SDK โณ
โ
โโโ pyproject.toml # uv workspace config
โโโ .gitignore
API Endpoints
Model Management (Ollama-style)
GET /v1/models
List all available models
Query params:
status: Filter by status (available, downloading, not_downloaded)task: Filter by task (stt, tts)
GET /v1/models/{model_id}
Get model information
POST /v1/models/{model_id}/download
Download a model (Ollama-style "pull")
GET /v1/models/{model_id}/download/status
Check download progress
DELETE /v1/models/{model_id}
Delete a downloaded model
Audio Transcription (OpenAI-compatible)
POST /v1/audio/transcriptions
Transcribe audio to text.
Parameters:
file(required): Audio file (mp3, wav, m4a, etc.)model(required): Model ID (e.g., "Systran/faster-whisper-tiny")language(optional): 2-letter language code (e.g., "en", "es")response_format(optional): "json" (default), "text", "srt", "vtt"temperature(optional): Sampling temperature (0.0-1.0, default: 0.0)
Response:
{
"text": "Hello, how are you today?",
"language": "en",
"duration": 2.5,
"segments": [
{
"id": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello, how are you today?"
}
]
}
POST /v1/audio/translations
Translate audio to English text.
Text-to-Speech (OpenAI-compatible)
POST /v1/audio/speech
Convert text to speech.
Parameters:
model(required): TTS model to use (e.g., "hexgrad/Kokoro-82M", "coqui/XTTS-v2")input(required): Text to synthesizevoice(optional): Voice ID to usespeed(optional): Speech speed multiplier (0.25-4.0, default: 1.0)response_format(optional): Audio format (default: "wav")
Response: Returns audio file in specified format with headers:
X-Duration: Audio duration in secondsX-Sample-Rate: Audio sample rate
GET /v1/audio/voices
List available TTS voices.
Response:
{
"voices": [
{
"id": "default",
"name": "Default Voice",
"language": "en",
"gender": null
}
],
"total": 1
}
Health & Docs
GET /health
Health check endpoint
GET /docs
Interactive Swagger UI for API testing
GET /openapi.json
OpenAPI specification (auto-generated)
Available Models
| Model ID | Size | Parameters | VRAM | Speed | Status |
|---|---|---|---|---|---|
Systran/faster-whisper-tiny |
~75MB | 39M | 1GB+ | Fastest | CTranslate2 |
Systran/faster-whisper-base |
~145MB | 74M | 1GB+ | Fast | CTranslate2 |
Systran/faster-whisper-small |
~488MB | 244M | 2GB+ | Good | CTranslate2 |
Systran/faster-whisper-medium |
~1.5GB | 769M | 5GB+ | Better | CTranslate2 |
Systran/faster-whisper-large-v3 |
~3.1GB | 1.5B | 10GB+ | Best | CTranslate2 |
Systran/faster-distil-whisper-large-v3 |
~756MB | 809M | 6GB+ | Fast & Good | CTranslate2 |
All models support 99+ languages including English, Spanish, French, German, Chinese, Japanese, Arabic, and more.
Note: These use the CTranslate2-optimized models from Systran for faster-whisper, which are ~4x faster than the original OpenAI Whisper models.
Performance & Optimization
Vocal automatically detects and optimizes for your hardware:
GPU Acceleration
When NVIDIA GPU is available:
- Automatic Detection: GPU is detected and used automatically
- Optimal Compute Types:
- 8GB+ VRAM:
float16(best quality) - 4-8GB VRAM:
int8_float16(balanced) - <4GB VRAM:
int8(most efficient)
- 8GB+ VRAM:
- 4x-10x Faster: GPU inference is significantly faster than CPU
- Memory Management: Automatic GPU cache clearing
CPU Optimization
When GPU is not available:
- Multi-threading: Uses optimal CPU threads based on core count
- Quantization:
int8quantization for faster CPU inference - VAD Filtering: Voice Activity Detection for improved performance
Check Your Device
# View device info via API
curl http://localhost:8000/v1/system/device
# Or via SDK
from vocal_sdk import VocalSDK
client = VocalSDK()
info = client._request('GET', '/v1/system/device')
print(info)
Example output:
{
"platform": "Windows",
"cpu_count": 16,
"cuda_available": true,
"gpu_count": 1,
"gpu_devices": [{
"name": "NVIDIA GeForce RTX 4090",
"vram_gb": 24.0,
"compute_capability": "8.9"
}]
}
Optimization Tips
- GPU Usage: Models automatically use GPU when available
- Model Selection:
tiny/basemodels: Work well on CPUsmall/medium: Best on GPU with 4GB+ VRAMlarge: Requires GPU with 8GB+ VRAM
- Batch Processing: Load model once, transcribe multiple files
- VAD Filter: Enabled by default for better performance
CLI Usage
The CLI provides an intuitive command-line interface for common tasks.
Transcription
# Transcribe audio file
vocal run audio.mp3
# Specify model
vocal run audio.mp3 --model Systran/faster-whisper-base
# Specify language
vocal run audio.mp3 --language en
# Output formats
vocal run audio.mp3 --format text
vocal run audio.mp3 --format json
vocal run audio.mp3 --format srt
vocal run audio.mp3 --format vtt
Model Management
# List all models
vocal models list
# Filter by status
vocal models list --status available
vocal models list --status not_downloaded
# Download a model
vocal models pull Systran/faster-whisper-tiny
# Delete a model
vocal models delete Systran/faster-whisper-tiny
vocal models delete Systran/faster-whisper-tiny --force
Server Management
# Start API server (default: http://0.0.0.0:8000)
vocal serve
# Custom host and port
vocal serve --host localhost --port 9000
# Enable auto-reload for development
vocal serve --reload
Development
Project Structure
The project uses a uv workspace with multiple packages:
packages/core: Core model registry and adapters (no dependencies on API)packages/api: FastAPI server (depends on core)packages/sdk: Auto-generated SDK (generates from API OpenAPI spec)packages/cli: CLI tool (uses SDK)
Running Tests
All tests use real audio assets from test_assets/audio/ with validated transcriptions.
Quick Validation (< 30 seconds)
# Using Makefile
make test-quick
# Or directly
uv run python scripts/validate.py
Full E2E Test Suite (~ 2 minutes)
# Using Makefile
make test
# With verbose output
make test-verbose
# Or using pytest directly
uv run python -m pytest tests/test_e2e.py -v
Current Status: 23/23 tests passing โ
Test coverage includes:
- API health and device information (GPU detection)
- Model management (list, download, status, delete)
- Audio transcription with real M4A and MP3 files
- Text-to-Speech synthesis with speed control
- Error handling for invalid models and files
- Performance and model reuse optimization
Check GPU Support
make gpu-check
Code Quality
# Using Makefile
make lint # Check code quality
make format # Format code
make check # Lint + format check
# Or using ruff directly
uv run ruff format .
uv run ruff check .
Makefile Commands
Vocal includes a comprehensive Makefile for common tasks:
make help # Show all available commands
# Setup
make install # Install dependencies
make sync # Sync dependencies
# Testing
make test # Run full test suite
make test-quick # Quick validation
make test-verbose # Verbose test output
make gpu-check # Check GPU detection
# Development
make serve # Start API server
make serve-dev # Start with auto-reload
make cli # Show CLI help
make docs # Open API docs in browser
# Code Quality
make lint # Run linter
make format # Format code
make check # Lint + format check
# Cleanup
make clean # Remove cache files
make clean-models # Remove downloaded models
# Quick aliases
make t # Alias for test
make s # Alias for serve
make l # Alias for lint
make f # Alias for format
Implementation Status
-
โ Phase 0: Core Foundation
- Generic model registry with provider pattern
- HuggingFace provider with automatic downloads
- faster-whisper adapter (4x faster than OpenAI)
- Model storage & caching
-
โ Phase 1: API Layer
- FastAPI with auto-generated OpenAPI spec
- Model management endpoints (Ollama-style)
- Transcription endpoints (OpenAI-compatible)
- Interactive Swagger UI at
/docs - Health & status endpoints
-
โ Phase 2: SDK
- Auto-generated from OpenAPI spec
- Clean Python client interface
- Type-safe with Pydantic models
- Namespaced APIs (models, audio)
-
โ Phase 3: CLI
vocal run- Transcribe audio filesvocal models list/pull/delete- Model managementvocal serve- Start API server- Rich console output with progress
-
โ Phase 4: Text-to-Speech
- TTS API endpoints (
/v1/audio/speech) - Multiple adapters (pyttsx3, Piper)
- Voice selection and management
- Speed control and audio output
- TTS API endpoints (
-
โ Phase 5: GPU Optimization
- Automatic CUDA detection
- Dynamic compute type selection (float16/int8)
- VRAM-based optimization
- CPU multi-threading fallback
- System device info endpoint
-
โ Phase 6: Testing & Production Ready
- 23 comprehensive E2E integration tests
- Real audio asset validation (100% accuracy)
- Full API stack coverage
- TTS timeout handling
- Error handling and edge cases
- All tests passing: 23/23 โ
Configuration
Environment Variables
Create a .env file:
APP_NAME=Vocal API
VERSION=0.1.0
DEBUG=true
CORS_ORIGINS=["*"]
MAX_UPLOAD_SIZE=26214400
Model Storage
Models are cached at: ~/.cache/vocal/models/
Contributing
We welcome contributions! Here's how to get started:
Development Setup
# Clone the repository
git clone <repo-url>
cd vocal
# Set up environment
uv venv
uv sync
# Install packages in development mode
uv add --editable packages/core
uv add --editable packages/api
uv add --editable packages/sdk
# Run tests
uv run pytest packages/core/tests -v
Making Changes
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make your changes
- Add tests for new functionality
- Ensure all tests pass:
uv run pytest - Update documentation if needed
- Commit with clear messages:
git commit -m "feat: add feature X" - Push and create a pull request
Code Style
- Follow PEP 8 guidelines
- Use type hints for all functions
- Add docstrings for public APIs
- Keep functions focused and testable
Adding New Models
To add support for a new model provider:
- Create a new provider class in
packages/core/vocal_core/registry/providers/ - Implement the
ModelProviderinterface - Add tests
- Update documentation
Regenerating SDK
When API changes:
# Start API server
uv run uvicorn vocal_api.main:app --port 8000
# Download new OpenAPI spec
curl http://localhost:8000/openapi.json -o packages/sdk/openapi.json
# SDK client is hand-crafted, just update if needed
License
Server Side Public License (SSPL) v1
Vocal is open source but protects against exploitation:
- โ Free for personal and commercial use
- โ Free for self-hosting
- โ Free to modify and distribute
- โ Cannot offer as SaaS without open-sourcing your infrastructure
See LICENSE for full details.
Roadmap
- Core model registry with provider pattern
- Model management API (list, download, delete)
- SDK generation from OpenAPI spec
- Interactive Swagger UI docs
- CLI tool (Typer-based)
- Text-to-Speech (TTS) support
- Streaming transcription
- WebSocket support for real-time transcription
- Rate limiting middleware
- Authentication (optional - JWT/API keys)
- Docker deployment
- Batch transcription
- Custom model providers
Credits
Built with:
- FastAPI - Web framework
- faster-whisper - STT engine
- HuggingFace Hub - Model distribution
- uv - Python package manager
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vocal_ai-0.3.1.tar.gz.
File metadata
- Download URL: vocal_ai-0.3.1.tar.gz
- Upload date:
- Size: 55.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3735e6743e39ad09f1f265a804e19562db1ea5baa66317ea745f3a941fc700c0
|
|
| MD5 |
ddf891b9326e99816ab35968dbd4741d
|
|
| BLAKE2b-256 |
303f7f6a8ffaf9b25f027a5b40001049c2553eecbfa63edc132013f556fefd69
|
File details
Details for the file vocal_ai-0.3.1-py3-none-any.whl.
File metadata
- Download URL: vocal_ai-0.3.1-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf682ece22688c74bf4aa6d29cabf08b3ace1f82832a364fbd8b90218d07d3bb
|
|
| MD5 |
6449ea8df7dd07b1b542748dc02a700a
|
|
| BLAKE2b-256 |
f1ad2481db7561b1898d73f15d6ef6052d17e0547161c3304652bde2d4449b43
|