Ollama-style Voice Model Management - Self-hosted OpenAI-compatible Speech AI

These details have not been verified by PyPI

Project links

Project description

Vocal

Generic Speech AI Platform - Ollama for Voice Models

Vocal is an API-first speech AI platform with automatic OpenAPI spec generation, auto-generated SDK, and Ollama-style model management. Built with a generic registry pattern supporting multiple providers.

🚀 Quick Start (5 minutes)

# 1. Clone and setup
git clone <repo-url>
cd vocal
make install

# 2. Start API
make serve

# 3. Visit interactive docs
# Open: http://localhost:8000/docs

# 4. Use SDK to transcribe
python sdk_example.py your_audio.mp3

That's it! Models auto-download on first use.

Pro tip: Run make help to see all available commands.

Features

🎯 API-First Architecture: FastAPI with auto-generated OpenAPI spec
📖 Interactive Docs: Swagger UI at /docs endpoint
📦 Auto-Generated SDK: Python SDK generated from OpenAPI spec
🔄 Ollama-Style: Model registry with pull/list/delete commands
🚀 Fast Inference: faster-whisper (4x faster than OpenAI Whisper)
⚡ GPU Acceleration: Automatic CUDA detection with VRAM optimization
🌍 99+ Languages: Support for multilingual transcription
🔌 Extensible: Generic provider pattern (HuggingFace, local, custom)
🎤 OpenAI Compatible: /v1/audio/transcriptions endpoint
🔊 Text-to-Speech: Neural TTS with Piper or system voices
🎨 CLI Tool: Typer-based CLI with rich console output
✅ Production Ready: 23/23 E2E tests passing with real audio assets

Quick Start

1. Installation

git clone <repo-url>
cd vocal

# Option 1: Using Makefile (recommended)
make install

# Option 2: Using uv directly
uv venv
uv sync

2. Start API Server

# Using Makefile
make serve

# Or using uv directly
uv run uvicorn vocal_api.main:app --port 8000

# Development mode with auto-reload
make serve-dev

The API will be available at:

API: http://localhost:8000
Interactive Docs: http://localhost:8000/docs 🎉
OpenAPI Spec: http://localhost:8000/openapi.json
Health: http://localhost:8000/health

3. Use the SDK

from vocal_sdk import VocalSDK

# Initialize client
client = VocalSDK(base_url="http://localhost:8000")

# List models (Ollama-style)
models = client.models.list()
for model in models['models']:
    print(f"{model['id']}: {model['status']}")

# Download model if needed (Ollama-style pull)
client.models.download("Systran/faster-whisper-tiny")

# Transcribe audio (OpenAI-compatible)
result = client.audio.transcribe(
    file="audio.mp3",
    model="Systran/faster-whisper-tiny"
)
print(result['text'])

# Text-to-Speech
audio = client.audio.text_to_speech("Hello, world!")
with open("output.wav", "wb") as f:
    f.write(audio)

Or use the CLI:

# Transcribe audio
vocal run audio.mp3

# List models
vocal models list

# Download model
vocal models pull Systran/faster-whisper-tiny

# Start API server
vocal serve --port 8000

Or use the example:

uv run python sdk_example.py Recording.m4a

Architecture

See VOICESTACK_API_FIRST_ARCHITECTURE.md for detailed architecture documentation.

Key Principles:

API-first design with auto-generated OpenAPI spec
Generic registry pattern for extensibility
Ollama-style model management
OpenAI-compatible endpoints
Type-safe throughout with Pydantic

vocal/
├── packages/
│   ├── core/           # Model registry & adapters ✅
│   │   └── vocal_core/
│   │       ├── registry/      # Generic model registry
│   │       │   ├── providers/ # HuggingFace, local, custom
│   │       │   └── model_info.py
│   │       └── adapters/      # STT/TTS adapters
│   │           └── stt/       # faster-whisper implementation
│   │
│   ├── api/            # FastAPI server ✅
│   │   └── vocal_api/
│   │       ├── models/        # Pydantic schemas
│   │       ├── routes/        # API endpoints
│   │       ├── services/      # Business logic
│   │       └── main.py        # FastAPI app
│   │
│   ├── sdk/            # Auto-generated Python SDK ⏳
│   └── cli/            # CLI using SDK ⏳
│
├── pyproject.toml      # uv workspace config
└── .gitignore

API Endpoints

Model Management (Ollama-style)

`GET /v1/models`

List all available models

Query params:

status: Filter by status (available, downloading, not_downloaded)
task: Filter by task (stt, tts)

`GET /v1/models/{model_id}`

Get model information

`POST /v1/models/{model_id}/download`

Download a model (Ollama-style "pull")

`GET /v1/models/{model_id}/download/status`

Check download progress

`DELETE /v1/models/{model_id}`

Delete a downloaded model

Audio Transcription (OpenAI-compatible)

`POST /v1/audio/transcriptions`

Transcribe audio to text.

Parameters:

file (required): Audio file (mp3, wav, m4a, etc.)
model (required): Model ID (e.g., "Systran/faster-whisper-tiny")
language (optional): 2-letter language code (e.g., "en", "es")
response_format (optional): "json" (default), "text", "srt", "vtt"
temperature (optional): Sampling temperature (0.0-1.0, default: 0.0)

Response:

{
  "text": "Hello, how are you today?",
  "language": "en",
  "duration": 2.5,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, how are you today?"
    }
  ]
}

`POST /v1/audio/translations`

Translate audio to English text.

Text-to-Speech (OpenAI-compatible)

`POST /v1/audio/speech`

Convert text to speech.

Parameters:

model (required): TTS model to use (e.g., "hexgrad/Kokoro-82M", "coqui/XTTS-v2")
input (required): Text to synthesize
voice (optional): Voice ID to use
speed (optional): Speech speed multiplier (0.25-4.0, default: 1.0)
response_format (optional): Audio format (default: "wav")

Response: Returns audio file in specified format with headers:

X-Duration: Audio duration in seconds
X-Sample-Rate: Audio sample rate

`GET /v1/audio/voices`

List available TTS voices.

Response:

{
  "voices": [
    {
      "id": "default",
      "name": "Default Voice",
      "language": "en",
      "gender": null
    }
  ],
  "total": 1
}

Health & Docs

`GET /health`

Health check endpoint

`GET /docs`

Interactive Swagger UI for API testing

`GET /openapi.json`

OpenAPI specification (auto-generated)

Available Models

Model ID	Size	Parameters	VRAM	Speed	Status
`Systran/faster-whisper-tiny`	~75MB	39M	1GB+	Fastest	CTranslate2
`Systran/faster-whisper-base`	~145MB	74M	1GB+	Fast	CTranslate2
`Systran/faster-whisper-small`	~488MB	244M	2GB+	Good	CTranslate2
`Systran/faster-whisper-medium`	~1.5GB	769M	5GB+	Better	CTranslate2
`Systran/faster-whisper-large-v3`	~3.1GB	1.5B	10GB+	Best	CTranslate2
`Systran/faster-distil-whisper-large-v3`	~756MB	809M	6GB+	Fast & Good	CTranslate2

All models support 99+ languages including English, Spanish, French, German, Chinese, Japanese, Arabic, and more.

Note: These use the CTranslate2-optimized models from Systran for faster-whisper, which are ~4x faster than the original OpenAI Whisper models.

Performance & Optimization

Vocal automatically detects and optimizes for your hardware:

GPU Acceleration

When NVIDIA GPU is available:

Automatic Detection: GPU is detected and used automatically
Optimal Compute Types:
- 8GB+ VRAM: float16 (best quality)
- 4-8GB VRAM: int8_float16 (balanced)
- <4GB VRAM: int8 (most efficient)
4x-10x Faster: GPU inference is significantly faster than CPU
Memory Management: Automatic GPU cache clearing

CPU Optimization

When GPU is not available:

Multi-threading: Uses optimal CPU threads based on core count
Quantization: int8 quantization for faster CPU inference
VAD Filtering: Voice Activity Detection for improved performance

Check Your Device

# View device info via API
curl http://localhost:8000/v1/system/device

# Or via SDK
from vocal_sdk import VocalSDK
client = VocalSDK()
info = client._request('GET', '/v1/system/device')
print(info)

Example output:

{
  "platform": "Windows",
  "cpu_count": 16,
  "cuda_available": true,
  "gpu_count": 1,
  "gpu_devices": [{
    "name": "NVIDIA GeForce RTX 4090",
    "vram_gb": 24.0,
    "compute_capability": "8.9"
  }]
}

Optimization Tips

GPU Usage: Models automatically use GPU when available
Model Selection:
- tiny/base models: Work well on CPU
- small/medium: Best on GPU with 4GB+ VRAM
- large: Requires GPU with 8GB+ VRAM
Batch Processing: Load model once, transcribe multiple files
VAD Filter: Enabled by default for better performance

CLI Usage

The CLI provides an intuitive command-line interface for common tasks.

Transcription

# Transcribe audio file
vocal run audio.mp3

# Specify model
vocal run audio.mp3 --model Systran/faster-whisper-base

# Specify language
vocal run audio.mp3 --language en

# Output formats
vocal run audio.mp3 --format text
vocal run audio.mp3 --format json
vocal run audio.mp3 --format srt
vocal run audio.mp3 --format vtt

Model Management

# List all models
vocal models list

# Filter by status
vocal models list --status available
vocal models list --status not_downloaded

# Download a model
vocal models pull Systran/faster-whisper-tiny

# Delete a model
vocal models delete Systran/faster-whisper-tiny
vocal models delete Systran/faster-whisper-tiny --force

Server Management

# Start API server (default: http://0.0.0.0:8000)
vocal serve

# Custom host and port
vocal serve --host localhost --port 9000

# Enable auto-reload for development
vocal serve --reload

Development

Project Structure

The project uses a uv workspace with multiple packages:

packages/core: Core model registry and adapters (no dependencies on API)
packages/api: FastAPI server (depends on core)
packages/sdk: Auto-generated SDK (generates from API OpenAPI spec)
packages/cli: CLI tool (uses SDK)

Running Tests

All tests use real audio assets from test_assets/audio/ with validated transcriptions.

Quick Validation (< 30 seconds)

# Using Makefile
make test-quick

# Or directly
uv run python scripts/validate.py

Full E2E Test Suite (~ 2 minutes)

# Using Makefile
make test

# With verbose output
make test-verbose

# Or using pytest directly
uv run python -m pytest tests/test_e2e.py -v

Current Status: 23/23 tests passing ✅

Test coverage includes:

API health and device information (GPU detection)
Model management (list, download, status, delete)
Audio transcription with real M4A and MP3 files
Text-to-Speech synthesis with speed control
Error handling for invalid models and files
Performance and model reuse optimization

Check GPU Support

make gpu-check

Code Quality

# Using Makefile
make lint          # Check code quality
make format        # Format code
make check         # Lint + format check

# Or using ruff directly
uv run ruff format .
uv run ruff check .

Makefile Commands

Vocal includes a comprehensive Makefile for common tasks:

make help          # Show all available commands

# Setup
make install       # Install dependencies
make sync          # Sync dependencies

# Testing
make test          # Run full test suite
make test-quick    # Quick validation
make test-verbose  # Verbose test output
make gpu-check     # Check GPU detection

# Development
make serve         # Start API server
make serve-dev     # Start with auto-reload
make cli           # Show CLI help
make docs          # Open API docs in browser

# Code Quality
make lint          # Run linter
make format        # Format code
make check         # Lint + format check

# Cleanup
make clean         # Remove cache files
make clean-models  # Remove downloaded models

# Quick aliases
make t             # Alias for test
make s             # Alias for serve
make l             # Alias for lint
make f             # Alias for format

Implementation Status

✅ Phase 0: Core Foundation
- Generic model registry with provider pattern
- HuggingFace provider with automatic downloads
- faster-whisper adapter (4x faster than OpenAI)
- Model storage & caching
✅ Phase 1: API Layer
- FastAPI with auto-generated OpenAPI spec
- Model management endpoints (Ollama-style)
- Transcription endpoints (OpenAI-compatible)
- Interactive Swagger UI at /docs
- Health & status endpoints
✅ Phase 2: SDK
- Auto-generated from OpenAPI spec
- Clean Python client interface
- Type-safe with Pydantic models
- Namespaced APIs (models, audio)
✅ Phase 3: CLI
- vocal run - Transcribe audio files
- vocal models list/pull/delete - Model management
- vocal serve - Start API server
- Rich console output with progress
✅ Phase 4: Text-to-Speech
- TTS API endpoints (/v1/audio/speech)
- Multiple adapters (pyttsx3, Piper)
- Voice selection and management
- Speed control and audio output
✅ Phase 5: GPU Optimization
- Automatic CUDA detection
- Dynamic compute type selection (float16/int8)
- VRAM-based optimization
- CPU multi-threading fallback
- System device info endpoint
✅ Phase 6: Testing & Production Ready
- 23 comprehensive E2E integration tests
- Real audio asset validation (100% accuracy)
- Full API stack coverage
- TTS timeout handling
- Error handling and edge cases
- All tests passing: 23/23 ✅

Configuration

Environment Variables

Create a .env file:

APP_NAME=Vocal API
VERSION=0.1.0
DEBUG=true
CORS_ORIGINS=["*"]
MAX_UPLOAD_SIZE=26214400

Model Storage

Models are cached at: ~/.cache/vocal/models/

Contributing

We welcome contributions! Here's how to get started:

Development Setup

# Clone the repository
git clone <repo-url>
cd vocal

# Set up environment
uv venv
uv sync

# Install packages in development mode
uv add --editable packages/core
uv add --editable packages/api
uv add --editable packages/sdk

# Run tests
uv run pytest packages/core/tests -v

Making Changes

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Make your changes
Add tests for new functionality
Ensure all tests pass: uv run pytest
Update documentation if needed
Commit with clear messages: git commit -m "feat: add feature X"
Push and create a pull request

Code Style

Follow PEP 8 guidelines
Use type hints for all functions
Add docstrings for public APIs
Keep functions focused and testable

Adding New Models

To add support for a new model provider:

Create a new provider class in packages/core/vocal_core/registry/providers/
Implement the ModelProvider interface
Add tests
Update documentation

Regenerating SDK

When API changes:

# Start API server
uv run uvicorn vocal_api.main:app --port 8000

# Download new OpenAPI spec
curl http://localhost:8000/openapi.json -o packages/sdk/openapi.json

# SDK client is hand-crafted, just update if needed

License

Server Side Public License (SSPL) v1

Vocal is open source but protects against exploitation:

✅ Free for personal and commercial use
✅ Free for self-hosting
✅ Free to modify and distribute
❌ Cannot offer as SaaS without open-sourcing your infrastructure

See LICENSE for full details.

Roadmap

Core model registry with provider pattern
Model management API (list, download, delete)
SDK generation from OpenAPI spec
Interactive Swagger UI docs
CLI tool (Typer-based)
Text-to-Speech (TTS) support
Streaming transcription
WebSocket support for real-time transcription
Rate limiting middleware
Authentication (optional - JWT/API keys)
Docker deployment
Batch transcription
Custom model providers

Credits

Built with:

FastAPI - Web framework
faster-whisper - STT engine
HuggingFace Hub - Model distribution
uv - Python package manager

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.8

Apr 5, 2026

0.3.7

Apr 2, 2026

0.3.6

Mar 14, 2026

0.3.5

Mar 2, 2026

0.3.4

Mar 2, 2026

0.3.3

Feb 28, 2026

0.3.2

Feb 16, 2026

0.3.1

Feb 14, 2026

This version

0.3.0

Feb 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocal_ai-0.3.0.tar.gz (53.6 kB view details)

Uploaded Feb 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vocal_ai-0.3.0-py3-none-any.whl (10.4 kB view details)

Uploaded Feb 14, 2026 Python 3

File details

Details for the file vocal_ai-0.3.0.tar.gz.

File metadata

Download URL: vocal_ai-0.3.0.tar.gz
Upload date: Feb 14, 2026
Size: 53.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vocal_ai-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`3dfa1d6a64450fc63a25d901efce0b173a7595f4f7abe209bcc4245a909fae09`
MD5	`e7c5e13c1277b1c5531bc96c26b56b54`
BLAKE2b-256	`e5e29253dcdc62b546d5ecb23b866bad8741c95d49102195e15eeba040f77c8d`

See more details on using hashes here.

File details

Details for the file vocal_ai-0.3.0-py3-none-any.whl.

File metadata

Download URL: vocal_ai-0.3.0-py3-none-any.whl
Upload date: Feb 14, 2026
Size: 10.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vocal_ai-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`55efcca4b66497f9510b71deb511caf6c0c2b09496d0a172cb65db31cc0ec66f`
MD5	`89ab5815055106926dc2d5aea8eefba0`
BLAKE2b-256	`1d63d9e4f58e09a17bf90b98021bd9fb5064560d3bbf9f1c7e3e21c739746a05`

See more details on using hashes here.

vocal-ai 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Vocal

🚀 Quick Start (5 minutes)

Features

Quick Start

1. Installation

2. Start API Server

3. Use the SDK

Architecture

API Endpoints

Model Management (Ollama-style)

GET /v1/models

GET /v1/models/{model_id}

POST /v1/models/{model_id}/download

GET /v1/models/{model_id}/download/status

DELETE /v1/models/{model_id}

Audio Transcription (OpenAI-compatible)

POST /v1/audio/transcriptions

POST /v1/audio/translations

Text-to-Speech (OpenAI-compatible)

POST /v1/audio/speech

GET /v1/audio/voices

Health & Docs

GET /health

GET /docs

GET /openapi.json

Available Models

Performance & Optimization

GPU Acceleration

CPU Optimization

Check Your Device

Optimization Tips

CLI Usage

Transcription

Model Management

Server Management

Development

Project Structure

Running Tests

Quick Validation (< 30 seconds)

Full E2E Test Suite (~ 2 minutes)

Check GPU Support

Code Quality

Makefile Commands

Implementation Status

Configuration

Environment Variables

Model Storage

Contributing

Development Setup

Making Changes

Code Style

Adding New Models

Regenerating SDK

License

Roadmap

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

`GET /v1/models`

`GET /v1/models/{model_id}`

`POST /v1/models/{model_id}/download`

`GET /v1/models/{model_id}/download/status`

`DELETE /v1/models/{model_id}`

`POST /v1/audio/transcriptions`

`POST /v1/audio/translations`

`POST /v1/audio/speech`

`GET /v1/audio/voices`

`GET /health`

`GET /docs`

`GET /openapi.json`