Skip to main content

Ollama-style Voice Model Management - Self-hosted OpenAI-compatible Speech AI

Project description

Vocal

Generic Speech AI Platform - Ollama for Voice Models

Vocal is an API-first speech AI platform with automatic OpenAPI spec generation, auto-generated SDK, and Ollama-style model management. Built with a generic registry pattern supporting multiple providers.

License: SSPL Python 3.11+

๐Ÿš€ Quick Start (5 minutes)

# 1. Clone and setup
git clone <repo-url>
cd vocal
make install

# 2. Start API
make serve

# 3. Visit interactive docs
# Open: http://localhost:8000/docs

# 4. Use SDK to transcribe
python sdk_example.py your_audio.mp3

That's it! Models auto-download on first use.

Pro tip: Run make help to see all available commands.

Features

  • ๐ŸŽฏ API-First Architecture: FastAPI with auto-generated OpenAPI spec
  • ๐Ÿ“– Interactive Docs: Swagger UI at /docs endpoint
  • ๐Ÿ“ฆ Auto-Generated SDK: Python SDK generated from OpenAPI spec
  • ๐Ÿ”„ Ollama-Style: Model registry with pull/list/delete commands
  • ๐Ÿš€ Fast Inference: faster-whisper (4x faster than OpenAI Whisper)
  • โšก GPU Acceleration: Automatic CUDA detection with VRAM optimization
  • ๐ŸŒ 99+ Languages: Support for multilingual transcription
  • ๐Ÿ”Œ Extensible: Generic provider pattern (HuggingFace, local, custom)
  • ๐ŸŽค OpenAI Compatible: /v1/audio/transcriptions endpoint
  • ๐Ÿ”Š Text-to-Speech: Neural TTS with Piper or system voices
  • ๐ŸŽจ CLI Tool: Typer-based CLI with rich console output
  • โœ… Production Ready: 23/23 E2E tests passing with real audio assets

Quick Start

1. Installation

git clone <repo-url>
cd vocal

# Option 1: Using Makefile (recommended)
make install

# Option 2: Using uv directly
uv venv
uv sync

2. Start API Server

# Using Makefile
make serve

# Or using uv directly
uv run uvicorn vocal_api.main:app --port 8000

# Development mode with auto-reload
make serve-dev

The API will be available at:

3. Use the SDK

from vocal_sdk import VocalSDK

# Initialize client
client = VocalSDK(base_url="http://localhost:8000")

# List models (Ollama-style)
models = client.models.list()
for model in models['models']:
    print(f"{model['id']}: {model['status']}")

# Download model if needed (Ollama-style pull)
client.models.download("Systran/faster-whisper-tiny")

# Transcribe audio (OpenAI-compatible)
result = client.audio.transcribe(
    file="audio.mp3",
    model="Systran/faster-whisper-tiny"
)
print(result['text'])

# Text-to-Speech
audio = client.audio.text_to_speech("Hello, world!")
with open("output.wav", "wb") as f:
    f.write(audio)

Or use the CLI:

# Transcribe audio
vocal run audio.mp3

# List models
vocal models list

# Download model
vocal models pull Systran/faster-whisper-tiny

# Start API server
vocal serve --port 8000

Or use the example:

uv run python sdk_example.py Recording.m4a

Architecture

See VOICESTACK_API_FIRST_ARCHITECTURE.md for detailed architecture documentation.

Key Principles:

  • API-first design with auto-generated OpenAPI spec
  • Generic registry pattern for extensibility
  • Ollama-style model management
  • OpenAI-compatible endpoints
  • Type-safe throughout with Pydantic
vocal/
โ”œโ”€โ”€ packages/
โ”‚   โ”œโ”€โ”€ core/           # Model registry & adapters โœ…
โ”‚   โ”‚   โ””โ”€โ”€ vocal_core/
โ”‚   โ”‚       โ”œโ”€โ”€ registry/      # Generic model registry
โ”‚   โ”‚       โ”‚   โ”œโ”€โ”€ providers/ # HuggingFace, local, custom
โ”‚   โ”‚       โ”‚   โ””โ”€โ”€ model_info.py
โ”‚   โ”‚       โ””โ”€โ”€ adapters/      # STT/TTS adapters
โ”‚   โ”‚           โ””โ”€โ”€ stt/       # faster-whisper implementation
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ api/            # FastAPI server โœ…
โ”‚   โ”‚   โ””โ”€โ”€ vocal_api/
โ”‚   โ”‚       โ”œโ”€โ”€ models/        # Pydantic schemas
โ”‚   โ”‚       โ”œโ”€โ”€ routes/        # API endpoints
โ”‚   โ”‚       โ”œโ”€โ”€ services/      # Business logic
โ”‚   โ”‚       โ””โ”€โ”€ main.py        # FastAPI app
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ sdk/            # Auto-generated Python SDK โณ
โ”‚   โ””โ”€โ”€ cli/            # CLI using SDK โณ
โ”‚
โ”œโ”€โ”€ pyproject.toml      # uv workspace config
โ””โ”€โ”€ .gitignore

API Endpoints

Model Management (Ollama-style)

GET /v1/models

List all available models

Query params:

  • status: Filter by status (available, downloading, not_downloaded)
  • task: Filter by task (stt, tts)

GET /v1/models/{model_id}

Get model information

POST /v1/models/{model_id}/download

Download a model (Ollama-style "pull")

GET /v1/models/{model_id}/download/status

Check download progress

DELETE /v1/models/{model_id}

Delete a downloaded model

Audio Transcription (OpenAI-compatible)

POST /v1/audio/transcriptions

Transcribe audio to text.

Parameters:

  • file (required): Audio file (mp3, wav, m4a, etc.)
  • model (required): Model ID (e.g., "Systran/faster-whisper-tiny")
  • language (optional): 2-letter language code (e.g., "en", "es")
  • response_format (optional): "json" (default), "text", "srt", "vtt"
  • temperature (optional): Sampling temperature (0.0-1.0, default: 0.0)

Response:

{
  "text": "Hello, how are you today?",
  "language": "en",
  "duration": 2.5,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, how are you today?"
    }
  ]
}

POST /v1/audio/translations

Translate audio to English text.

Text-to-Speech (OpenAI-compatible)

POST /v1/audio/speech

Convert text to speech.

Parameters:

  • model (required): TTS model to use (e.g., "hexgrad/Kokoro-82M", "coqui/XTTS-v2")
  • input (required): Text to synthesize
  • voice (optional): Voice ID to use
  • speed (optional): Speech speed multiplier (0.25-4.0, default: 1.0)
  • response_format (optional): Audio format (default: "wav")

Response: Returns audio file in specified format with headers:

  • X-Duration: Audio duration in seconds
  • X-Sample-Rate: Audio sample rate

GET /v1/audio/voices

List available TTS voices.

Response:

{
  "voices": [
    {
      "id": "default",
      "name": "Default Voice",
      "language": "en",
      "gender": null
    }
  ],
  "total": 1
}

Health & Docs

GET /health

Health check endpoint

GET /docs

Interactive Swagger UI for API testing

GET /openapi.json

OpenAPI specification (auto-generated)

Available Models

Model ID Size Parameters VRAM Speed Status
Systran/faster-whisper-tiny ~75MB 39M 1GB+ Fastest CTranslate2
Systran/faster-whisper-base ~145MB 74M 1GB+ Fast CTranslate2
Systran/faster-whisper-small ~488MB 244M 2GB+ Good CTranslate2
Systran/faster-whisper-medium ~1.5GB 769M 5GB+ Better CTranslate2
Systran/faster-whisper-large-v3 ~3.1GB 1.5B 10GB+ Best CTranslate2
Systran/faster-distil-whisper-large-v3 ~756MB 809M 6GB+ Fast & Good CTranslate2

All models support 99+ languages including English, Spanish, French, German, Chinese, Japanese, Arabic, and more.

Note: These use the CTranslate2-optimized models from Systran for faster-whisper, which are ~4x faster than the original OpenAI Whisper models.

Performance & Optimization

Vocal automatically detects and optimizes for your hardware:

GPU Acceleration

When NVIDIA GPU is available:

  • Automatic Detection: GPU is detected and used automatically
  • Optimal Compute Types:
    • 8GB+ VRAM: float16 (best quality)
    • 4-8GB VRAM: int8_float16 (balanced)
    • <4GB VRAM: int8 (most efficient)
  • 4x-10x Faster: GPU inference is significantly faster than CPU
  • Memory Management: Automatic GPU cache clearing

CPU Optimization

When GPU is not available:

  • Multi-threading: Uses optimal CPU threads based on core count
  • Quantization: int8 quantization for faster CPU inference
  • VAD Filtering: Voice Activity Detection for improved performance

Check Your Device

# View device info via API
curl http://localhost:8000/v1/system/device

# Or via SDK
from vocal_sdk import VocalSDK
client = VocalSDK()
info = client._request('GET', '/v1/system/device')
print(info)

Example output:

{
  "platform": "Windows",
  "cpu_count": 16,
  "cuda_available": true,
  "gpu_count": 1,
  "gpu_devices": [{
    "name": "NVIDIA GeForce RTX 4090",
    "vram_gb": 24.0,
    "compute_capability": "8.9"
  }]
}

Optimization Tips

  1. GPU Usage: Models automatically use GPU when available
  2. Model Selection:
    • tiny/base models: Work well on CPU
    • small/medium: Best on GPU with 4GB+ VRAM
    • large: Requires GPU with 8GB+ VRAM
  3. Batch Processing: Load model once, transcribe multiple files
  4. VAD Filter: Enabled by default for better performance

CLI Usage

The CLI provides an intuitive command-line interface for common tasks.

Transcription

# Transcribe audio file
vocal run audio.mp3

# Specify model
vocal run audio.mp3 --model Systran/faster-whisper-base

# Specify language
vocal run audio.mp3 --language en

# Output formats
vocal run audio.mp3 --format text
vocal run audio.mp3 --format json
vocal run audio.mp3 --format srt
vocal run audio.mp3 --format vtt

Model Management

# List all models
vocal models list

# Filter by status
vocal models list --status available
vocal models list --status not_downloaded

# Download a model
vocal models pull Systran/faster-whisper-tiny

# Delete a model
vocal models delete Systran/faster-whisper-tiny
vocal models delete Systran/faster-whisper-tiny --force

Server Management

# Start API server (default: http://0.0.0.0:8000)
vocal serve

# Custom host and port
vocal serve --host localhost --port 9000

# Enable auto-reload for development
vocal serve --reload

Development

Project Structure

The project uses a uv workspace with multiple packages:

  • packages/core: Core model registry and adapters (no dependencies on API)
  • packages/api: FastAPI server (depends on core)
  • packages/sdk: Auto-generated SDK (generates from API OpenAPI spec)
  • packages/cli: CLI tool (uses SDK)

Running Tests

All tests use real audio assets from test_assets/audio/ with validated transcriptions.

Quick Validation (< 30 seconds)

# Using Makefile
make test-quick

# Or directly
uv run python scripts/validate.py

Full E2E Test Suite (~ 2 minutes)

# Using Makefile
make test

# With verbose output
make test-verbose

# Or using pytest directly
uv run python -m pytest tests/test_e2e.py -v

Current Status: 23/23 tests passing โœ…

Test coverage includes:

  • API health and device information (GPU detection)
  • Model management (list, download, status, delete)
  • Audio transcription with real M4A and MP3 files
  • Text-to-Speech synthesis with speed control
  • Error handling for invalid models and files
  • Performance and model reuse optimization

Check GPU Support

make gpu-check

Code Quality

# Using Makefile
make lint          # Check code quality
make format        # Format code
make check         # Lint + format check

# Or using ruff directly
uv run ruff format .
uv run ruff check .

Makefile Commands

Vocal includes a comprehensive Makefile for common tasks:

make help          # Show all available commands

# Setup
make install       # Install dependencies
make sync          # Sync dependencies

# Testing
make test          # Run full test suite
make test-quick    # Quick validation
make test-verbose  # Verbose test output
make gpu-check     # Check GPU detection

# Development
make serve         # Start API server
make serve-dev     # Start with auto-reload
make cli           # Show CLI help
make docs          # Open API docs in browser

# Code Quality
make lint          # Run linter
make format        # Format code
make check         # Lint + format check

# Cleanup
make clean         # Remove cache files
make clean-models  # Remove downloaded models

# Quick aliases
make t             # Alias for test
make s             # Alias for serve
make l             # Alias for lint
make f             # Alias for format

Implementation Status

  • โœ… Phase 0: Core Foundation

    • Generic model registry with provider pattern
    • HuggingFace provider with automatic downloads
    • faster-whisper adapter (4x faster than OpenAI)
    • Model storage & caching
  • โœ… Phase 1: API Layer

    • FastAPI with auto-generated OpenAPI spec
    • Model management endpoints (Ollama-style)
    • Transcription endpoints (OpenAI-compatible)
    • Interactive Swagger UI at /docs
    • Health & status endpoints
  • โœ… Phase 2: SDK

    • Auto-generated from OpenAPI spec
    • Clean Python client interface
    • Type-safe with Pydantic models
    • Namespaced APIs (models, audio)
  • โœ… Phase 3: CLI

    • vocal run - Transcribe audio files
    • vocal models list/pull/delete - Model management
    • vocal serve - Start API server
    • Rich console output with progress
  • โœ… Phase 4: Text-to-Speech

    • TTS API endpoints (/v1/audio/speech)
    • Multiple adapters (pyttsx3, Piper)
    • Voice selection and management
    • Speed control and audio output
  • โœ… Phase 5: GPU Optimization

    • Automatic CUDA detection
    • Dynamic compute type selection (float16/int8)
    • VRAM-based optimization
    • CPU multi-threading fallback
    • System device info endpoint
  • โœ… Phase 6: Testing & Production Ready

    • 23 comprehensive E2E integration tests
    • Real audio asset validation (100% accuracy)
    • Full API stack coverage
    • TTS timeout handling
    • Error handling and edge cases
    • All tests passing: 23/23 โœ…

Configuration

Environment Variables

Create a .env file:

APP_NAME=Vocal API
VERSION=0.1.0
DEBUG=true
CORS_ORIGINS=["*"]
MAX_UPLOAD_SIZE=26214400

Model Storage

Models are cached at: ~/.cache/vocal/models/

Contributing

We welcome contributions! Here's how to get started:

Development Setup

# Clone the repository
git clone <repo-url>
cd vocal

# Set up environment
uv venv
uv sync

# Install packages in development mode
uv add --editable packages/core
uv add --editable packages/api
uv add --editable packages/sdk

# Run tests
uv run pytest packages/core/tests -v

Making Changes

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass: uv run pytest
  6. Update documentation if needed
  7. Commit with clear messages: git commit -m "feat: add feature X"
  8. Push and create a pull request

Code Style

  • Follow PEP 8 guidelines
  • Use type hints for all functions
  • Add docstrings for public APIs
  • Keep functions focused and testable

Adding New Models

To add support for a new model provider:

  1. Create a new provider class in packages/core/vocal_core/registry/providers/
  2. Implement the ModelProvider interface
  3. Add tests
  4. Update documentation

Regenerating SDK

When API changes:

# Start API server
uv run uvicorn vocal_api.main:app --port 8000

# Download new OpenAPI spec
curl http://localhost:8000/openapi.json -o packages/sdk/openapi.json

# SDK client is hand-crafted, just update if needed

License

Server Side Public License (SSPL) v1

Vocal is open source but protects against exploitation:

  • โœ… Free for personal and commercial use
  • โœ… Free for self-hosting
  • โœ… Free to modify and distribute
  • โŒ Cannot offer as SaaS without open-sourcing your infrastructure

See LICENSE for full details.

Roadmap

  • Core model registry with provider pattern
  • Model management API (list, download, delete)
  • SDK generation from OpenAPI spec
  • Interactive Swagger UI docs
  • CLI tool (Typer-based)
  • Text-to-Speech (TTS) support
  • Streaming transcription
  • WebSocket support for real-time transcription
  • Rate limiting middleware
  • Authentication (optional - JWT/API keys)
  • Docker deployment
  • Batch transcription
  • Custom model providers

Credits

Built with:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocal_ai-0.3.0.tar.gz (53.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vocal_ai-0.3.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file vocal_ai-0.3.0.tar.gz.

File metadata

  • Download URL: vocal_ai-0.3.0.tar.gz
  • Upload date:
  • Size: 53.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vocal_ai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3dfa1d6a64450fc63a25d901efce0b173a7595f4f7abe209bcc4245a909fae09
MD5 e7c5e13c1277b1c5531bc96c26b56b54
BLAKE2b-256 e5e29253dcdc62b546d5ecb23b866bad8741c95d49102195e15eeba040f77c8d

See more details on using hashes here.

File details

Details for the file vocal_ai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: vocal_ai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for vocal_ai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 55efcca4b66497f9510b71deb511caf6c0c2b09496d0a172cb65db31cc0ec66f
MD5 89ab5815055106926dc2d5aea8eefba0
BLAKE2b-256 1d63d9e4f58e09a17bf90b98021bd9fb5064560d3bbf9f1c7e3e21c739746a05

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page