Aavaaz — production-grade speech-to-text platform built on WhisperLive
Project description
Aavaaz
Production-grade speech-to-text platform built on WhisperLive.
Aavaaz (आवाज़, "voice" in Hindi) is an open source extension of WhisperLive with enterprise features that compete with Deepgram, ElevenLabs, and AssemblyAI.
Features
| Category | Capabilities |
|---|---|
| Transcription | Real-time WebSocket streaming, REST API (OpenAI-compatible), batch inference, multichannel audio |
| Intelligence | Speaker diarization, sentiment analysis, topic detection, entity extraction, summarization |
| Post-processing | Smart formatting, PII redaction, profanity filtering, noise reduction, utterance/paragraph segmentation |
| Platform | Webhook delivery, transcript search & tagging, storage backends (local/S3), ACL/auth, GDPR compliance, Prometheus metrics |
| Deployment | Docker, Helm charts, Terraform (AWS), serverless (Lambda), Modal (GPU), GPU auto-detection, model caching, SSE streaming |
Quick Start
Option 1: Install from PyPI (Recommended)
# Create a virtualenv (Python 3.12 required)
python3.12 -m venv .venv && source .venv/bin/activate
# Install aavaaz with WhisperLive + ML stack
pip install "aavaaz[whisper]"
# Start the server
aavaaz serve --model large-v3
# Transcribe a file
aavaaz transcribe audio.wav
Option 2: Using uv (Fast & Reproducible)
# Install uv: https://docs.astral.sh/uv/
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create venv with Python 3.12 (required; 3.14 not yet supported by ML deps)
uv venv .venv --python python3.12
source .venv/bin/activate
# Install from PyPI
uv pip install "aavaaz[whisper]"
# Start the server
aavaaz serve --model large-v3
# Transcribe a file
aavaaz transcribe audio.wav
Fedora 43+ / Python 3.14 note: The ML stack (PyTorch, faster-whisper) does not yet publish wheels for Python 3.14. Use
python3.12explicitly when creating the virtualenv. On Fedora:sudo dnf install python3.12
Option 3: Local Development Install
git clone git@github.com:collabora/aavaaz.git
cd aavaaz
python3.12 -m venv .venv && source .venv/bin/activate
# Local editable install
pip install -e .
# With WhisperLive + dev tooling
pip install -e ".[whisper,dev]"
Option 4: Using pip with Requirements Files
# Create a virtualenv (Python 3.12 required)
python3.12 -m venv .venv && source .venv/bin/activate
# Install base + ML stack (large ~20GB download for torch/onnx)
pip install -r requirements/whisper.txt
# Or install just base (fast, no ML):
# pip install -r requirements/base.txt
# Start the server (requires ML stack)
aavaaz serve --model large-v3
# Transcribe a file
aavaaz transcribe audio.wav
# OpenAI-compatible REST endpoint
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav -F model=large-v3
Note on Storage
The full ML stack (torch, onnxruntime, torchaudio) requires ~20GB disk space. If you hit disk quota errors, consider:
- Using
uvwhich is faster and handles large downloads better - Installing on a machine with more space
- Using serverless deployments (AWS Lambda / Modal) instead of local
Requirements Files
requirements/base.txt— Core dependencies only (fastapi, uvicorn, boto3)requirements/whisper.txt— Full ML stack (torch, whisper-live>=0.9.0, etc)requirements/dev.txt— Development tools (pytest, ruff, etc)
Architecture
Aavaaz uses WhisperLive as its transcription engine and extends it via the plugin system:
┌─────────────────────────────────────────┐
│ Aavaaz Server │
│ ┌─────────────────────────────────┐ │
│ │ REST API / WebSocket / Web UI │ │
│ └──────────────┬──────────────────┘ │
│ ┌──────────────┴──────────────────┐ │
│ │ Plugin Pipeline │ │
│ │ diarization → formatting → │ │
│ │ PII redaction → intelligence │ │
│ └──────────────┬──────────────────┘ │
│ ┌──────────────┴──────────────────┐ │
│ │ WhisperLive Core Engine │ │
│ │ faster-whisper / TensorRT / │ │
│ │ OpenVINO │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
Advanced Features
Word-Level Timestamps
Enable per-word timing and confidence scores in transcription segments:
from aavaaz import AavaazServer
server = AavaazServer()
server.serve(word_timestamps=True)
When enabled, each segment includes a words array:
{
"segments": [{
"start": "0.000", "end": "2.500", "text": "Hello world",
"words": [
{"word": "Hello", "start": "0.000", "end": "0.800", "probability": 0.95},
{"word": " world", "start": "0.900", "end": "2.500", "probability": 0.88}
]
}]
}
Custom Vocabulary / Hotwords
Boost recognition of specific terms (product names, acronyms, domain jargon):
from aavaaz import AavaazServer
server = AavaazServer()
server.serve(hotwords="Aavaaz,TensorRT,OpenVINO")
The hotwords parameter is a comma-separated string passed directly to faster-whisper's keyword boosting. Also available in the REST API via the hotwords form field.
Speaker Diarization
Real-time speaker identification using pyannote.audio embeddings:
pip install pyannote.audio
from aavaaz import AavaazServer
server = AavaazServer()
server.serve(enable_diarization=True, max_speakers=4)
When enabled, completed segments include a speaker field:
{"start": "0.000", "end": "2.500", "text": "Hello", "speaker": "SPEAKER_00", "completed": true}
Authentication
Protect both REST API and WebSocket connections with a shared API key:
aavaaz serve --model large-v3 --api-key "my-secret-key"
- REST API: Requires
Authorization: Bearer my-secret-keyheader - WebSocket: Requires either
Authorization: Bearer my-secret-keyheader or?token=my-secret-keyquery parameter
Unauthenticated connections receive HTTP 401 before any GPU resources are allocated.
Rate Limiting
Limit REST API requests per client IP (sliding 60-second window):
aavaaz serve --model large-v3 --rate-limit-rpm 60
Clients exceeding the limit receive HTTP 429.
Auto-Reconnect
Automatically reconnect when the WebSocket connection drops unexpectedly:
from whisper_live.client import TranscriptionClient
client = TranscriptionClient(
"localhost", 9090,
max_retries=5,
retry_delay=3,
)
Batch Inference
Batch multiple client sessions into single GPU calls for higher throughput:
aavaaz serve --model large-v3 --batch-inference --batch-max-size 8 --batch-window-ms 50
Prometheus Metrics
Monitor server health with a Prometheus /metrics endpoint:
aavaaz serve --model large-v3 --metrics-port 9091
Tracks active connections, transcription latency, segment counts, and error rates.
SSE Streaming
Stream transcription results via Server-Sent Events from the REST API:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav -F stream=true
Returns real-time segment events as text/event-stream.
Plugin System
Extend the transcription pipeline with custom post-processors:
from aavaaz.plugins import PluginRegistry
registry = PluginRegistry()
registry.register("my_plugin", my_post_processor_fn, priority=50)
server = AavaazServer(plugin_registry=registry)
server.serve()
Plugins receive each transcription segment and can modify, enrich, or filter it before delivery to the client.
Scaling Guide
Single GPU
aavaaz serve --model large-v3 --batch-inference
Multi-GPU (Docker Compose)
services:
aavaaz:
image: collabora/aavaaz:latest
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
ports:
- "9090:9090"
- "8000:8000"
Kubernetes (Helm)
helm install aavaaz deploy/helm/aavaaz \
--set model=large-v3 \
--set replicas=3 \
--set gpu.enabled=true
AWS (Terraform)
cd deploy/terraform
terraform init
terraform apply -var="model=large-v3" -var="api_key=my-secret"
Provisions VPC, ALB, ECS with GPU instances (g5.xlarge), ECR, and CloudWatch. See deploy/terraform/README.md for full options.
AWS Lambda (Serverless)
For batch file transcription without managing servers:
Production role: Lambda is the batch transcription path.
# Build and push the Lambda container image
docker build -f Dockerfile.lambda --build-arg WHISPER_MODEL=small -t aavaaz-lambda .
# Deploy infrastructure
cd deploy/terraform-lambda
terraform init
terraform apply
# Upload audio — transcript appears automatically in the output bucket
aws s3 cp recording.wav s3://$(terraform output -raw audio_input_bucket)/
# Or use the REST API
curl -X POST $(terraform output -raw api_endpoint) \
-H "Content-Type: application/json" \
-d '{"audio_url": "s3://my-bucket/recording.wav"}'
See docs/SERVERLESS.md for full configuration, model selection, cost estimates, and limitations.
Modal (GPU Serverless)
Deploy on Modal for on-demand GPU transcription with zero infrastructure:
Production role: Modal is used for live WebSocket transcription (deploy/modal/app_live.py).
The deploy/modal/app.py endpoint is an optional GPU batch API.
cd deploy/modal
pip install modal
modal setup
modal deploy app_live.py
# Optional: deploy the GPU batch API endpoint
# modal deploy app.py
# Transcribe
# Live websocket URL is exposed by app_live.py deployment output.
Auto-scales to zero when idle, GPU containers spin up in seconds. See docs/MODAL.md for full configuration.
Development
git clone git@github.com:collabora/aavaaz.git
cd aavaaz
pip install -e ".[dev]"
pytest
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aavaaz-0.9.0.tar.gz.
File metadata
- Download URL: aavaaz-0.9.0.tar.gz
- Upload date:
- Size: 105.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
428a7b5620b5e26828432122607a08e26a460e61b902fc780bbe8001a05bafc8
|
|
| MD5 |
fff252313d83e126f4929822bf442570
|
|
| BLAKE2b-256 |
22a0bf92414c0ffe470201d5583b5bd41de2f7aaa1ac5f823a31836a66478617
|
File details
Details for the file aavaaz-0.9.0-py3-none-any.whl.
File metadata
- Download URL: aavaaz-0.9.0-py3-none-any.whl
- Upload date:
- Size: 82.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8502d74d0614833473deea0f0a303a682b4425803942902f8e28dd40f82401fe
|
|
| MD5 |
d71a8d4efdf7eb1b0a0c8e365a7a636a
|
|
| BLAKE2b-256 |
5fe8b93dfff50565ad459a990dd1eb49fe319accae71877b1cfd283ee552c12a
|