Skip to main content

Voice runtime (STT + TTS) with OpenAI-compatible API

Project description

Macaw OpenVoice

Macaw OpenVoice

Voice runtime (STT + TTS) with OpenAI-compatible API

Version 1.0.0 License Python 3.11+ Tests PyPI

Quick Start · Core Capabilities · Architecture · API Docs · Demo · Full Documentation


Production Voice Runtime Infrastructure Real-time Speech-to-Text and Text-to-Speech with OpenAI-compatible API, streaming session control, and extensible execution architecture.

Overview

Macaw OpenVoice is a production-grade runtime for voice systems.

It standardizes and operationalizes the execution of Speech-to-Text (STT) and Text-to-Speech (TTS) models in real environments by providing:

  • a unified execution interface for multiple inference engines
  • real-time audio streaming with controlled latency
  • continuous session management
  • bidirectional speech interaction
  • operational observability
  • production-ready APIs

Macaw acts as the infrastructure layer between voice models and production applications, abstracting complexity related to streaming, synchronization, state management, and execution control.

Technology Positioning

Macaw OpenVoice plays the same role for voice systems that:

  • vLLM plays for LLM serving
  • Triton Inference Server plays for GPU inference
  • Ollama plays for local model execution

It transforms voice models into operational services.


Core Capabilities

Unified Interface

  • OpenAI-compatible Audio API
  • Real-time full-duplex WebSocket streaming
  • Local runtime CLI

Bidirectional Speech Streaming

  • simultaneous STT and TTS in the same session
  • automatic speech detection
  • barge-in support (interruptible speech)
  • automatic mute during synthesis

Session Management

  • state machine for continuous audio processing
  • ring buffer with persistence
  • crash recovery without context loss
  • cross-segment coherence

Audio Processing Pipeline

  • automatic resampling
  • DC offset removal
  • gain normalization
  • voice activity detection

Multi-Engine Execution

  • multiple STT and TTS engines
  • subprocess isolation
  • declarative model registry
  • pluggable architecture

Operational Control

  • priority-based scheduler
  • dynamic batching
  • latency tracking
  • Prometheus metrics

Production Use Cases

Macaw is designed for real-world voice workloads:

  • real-time conversational voice agents
  • telephony automation (SIP / VoIP)
  • live transcription systems
  • embedded voice interfaces
  • multimodal assistants
  • interactive media streaming
  • continuous audio processing pipelines

Quick Start

# Install
pip install macaw-openvoice[server,grpc,faster-whisper]

# Pull a model
macaw pull faster-whisper-tiny

# Start the runtime
macaw serve
$ macaw serve
  ╔══════════════════════════════════════════════╗
  ║         Macaw OpenVoice v1.0.0              ║
  ╚══════════════════════════════════════════════╝

INFO     Scanning models in ~/.macaw/models
INFO     Found 2 model(s): faster-whisper-tiny (STT), kokoro-v1 (TTS)
INFO     Spawning STT worker   faster-whisper-tiny  port=50051  engine=faster-whisper
INFO     Spawning TTS worker   kokoro-v1            port=50052  engine=kokoro
INFO     Scheduler started     aging=30.0s  batch_ms=75.0  batch_max=8
INFO     Uvicorn running on http://127.0.0.1:8000

Transcribe a file

# Via REST API
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny

# Via CLI
macaw transcribe audio.wav --model faster-whisper-tiny

Streaming via WebSocket

wscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-tiny"
# Send binary audio frames, receive JSON transcript events

Text-to-Speech

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Hello, how can I help you?", "voice": "default"}' \
  --output speech.wav

Voice Cloning

# Pull the voice cloning model
macaw pull qwen3-tts-0.6b-base

# Clone a voice from ~3 seconds of reference audio
REF_AUDIO=$(base64 -w0 reference.wav)

curl -s http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d "{\"model\": \"qwen3-tts-0.6b-base\", \"input\": \"Text with cloned voice\", \"language\": \"English\", \"ref_audio\": \"$REF_AUDIO\", \"ref_text\": \"Transcript of reference\"}" \
  --output cloned.wav

Architecture

                         Clients
          CLI / REST / WebSocket (full-duplex)
                           |
                           v
  +----------------------------------------------------+
  |              API Server (FastAPI)                    |
  |                                                    |
  |  POST /v1/audio/transcriptions    (STT batch)      |
  |  POST /v1/audio/translations      (STT translate)  |
  |  POST /v1/audio/speech            (TTS)            |
  |  GET  /v1/voices                  (list voices)    |
  |  POST /v1/voices                  (save voice)     |
  |  WS   /v1/realtime                (STT+TTS)        |
  +----------------------------------------------------+
  |              Scheduler                              |
  |  Priority queue (realtime > batch), cancellation,   |
  |  dynamic batching, latency tracking                 |
  +----------------------------------------------------+
  |              Model Registry                         |
  |  Declarative manifest (macaw.yaml), lifecycle        |
  +----------+-------------------+---------------------+
             |                   |
    +--------+--------+  +------+-------+
    |  STT Workers    |  |  TTS Workers |
    |  (subprocess    |  |  (subprocess |
    |   gRPC)         |  |   gRPC)      |
    |                 |  |              |
    | Faster-Whisper  |  | Kokoro       |
    |                 |  | Qwen3-TTS    |
    +-----------------+  +--------------+
             |
  +----------+-------------------------------------+
  |  Audio Preprocessing Pipeline                   |
  |  Resample -> DC Remove -> Gain Normalize        |
  +------------------------------------------------+
  |  Session Manager (STT only)                     |
  |  6 states, ring buffer, WAL, LocalAgreement,    |
  |  cross-segment context, crash recovery          |
  +------------------------------------------------+
  |  VAD (Energy Pre-filter + Silero VAD)           |
  +------------------------------------------------+
  |  Post-Processing (ITN via NeMo)                 |
  +------------------------------------------------+

Supported Models

11 models available via macaw pull, across 3 engines + built-in VAD.

Speech-to-Text

Model Engine Size Languages Hardware Install
faster-whisper-large-v3 Faster-Whisper 3 GB 100+ GPU recommended macaw pull faster-whisper-large-v3
faster-whisper-medium Faster-Whisper 1.5 GB 100+ GPU recommended macaw pull faster-whisper-medium
faster-whisper-small Faster-Whisper 512 MB 100+ CPU / GPU macaw pull faster-whisper-small
faster-whisper-tiny Faster-Whisper 256 MB 100+ CPU macaw pull faster-whisper-tiny
distil-whisper-large-v3 Faster-Whisper 1.5 GB English GPU recommended macaw pull distil-whisper-large-v3

All STT models support: streaming partials (LocalAgreement), word timestamps, language detection, translation (to English), and batch inference.

Text-to-Speech

Model Engine Size Languages Capability Install
kokoro-v1 Kokoro 82M 8 Preset voices macaw pull kokoro-v1
qwen3-tts-0.6b-custom Qwen3-TTS 0.6B 10 9 preset speakers macaw pull qwen3-tts-0.6b-custom
qwen3-tts-1.7b-custom Qwen3-TTS 1.7B 10 9 preset speakers macaw pull qwen3-tts-1.7b-custom
qwen3-tts-0.6b-base Qwen3-TTS 0.6B 10 Voice cloning (~3s ref) macaw pull qwen3-tts-0.6b-base
qwen3-tts-1.7b-base Qwen3-TTS 1.7B 10 Voice cloning (~3s ref) macaw pull qwen3-tts-1.7b-base
qwen3-tts-1.7b-design Qwen3-TTS 1.7B 10 Voice design (natural language) macaw pull qwen3-tts-1.7b-design

Voice Activity Detection

Model Type License
Silero VAD Energy pre-filter + Neural VAD MIT

VAD runs as a built-in library in the runtime (not a worker subprocess).

Adding Your Own Engine

Adding a new engine requires ~400-700 lines of code and zero changes to the runtime core. See the Adding an Engine guide.

API Compatibility

Macaw implements the OpenAI Audio API contract, so existing SDKs work without modification:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Transcription
result = client.audio.transcriptions.create(
    model="faster-whisper-tiny",
    file=open("audio.wav", "rb"),
)
print(result.text)

# Text-to-Speech
response = client.audio.speech.create(
    model="kokoro-v1",
    input="Hello, how can I help you?",
    voice="default",
)
response.stream_to_file("output.wav")

WebSocket Protocol

The /v1/realtime endpoint supports full-duplex STT + TTS:

Client -> Server:
  Binary frames     PCM 16-bit audio (any sample rate)
  session.configure  Configure VAD, language, hot words, TTS model
  tts.speak          Trigger text-to-speech synthesis
  tts.cancel         Cancel active TTS

Server -> Client:
  session.created     Session established
  vad.speech_start    Speech detected
  transcript.partial  Intermediate hypothesis
  transcript.final    Confirmed segment (with ITN)
  vad.speech_end      Speech ended
  tts.speaking_start  TTS started (STT muted)
  Binary frames       TTS audio output
  tts.speaking_end    TTS finished (STT unmuted)
  error               Error with recoverable flag

CLI

macaw serve                                   # Start API server
macaw transcribe audio.wav                    # Transcribe file
macaw transcribe audio.wav --format srt       # Generate subtitles
macaw transcribe --stream                     # Stream from microphone
macaw translate audio.wav                     # Translate to English
macaw list                                    # List installed models
macaw pull faster-whisper-tiny                # Download a model
macaw inspect faster-whisper-tiny             # Model details

Demo

An interactive demo with a React/Next.js frontend is included:

./demo/start.sh

This starts the FastAPI backend (port 9000) and the Next.js frontend (port 3000) together. The demo includes a dashboard for batch transcriptions, real-time streaming STT with VAD visualization, and a TTS playground. See demo/README.md for details.

Development

# Setup (requires Python 3.11+ and uv)
uv venv --python 3.12
uv sync --all-extras

# Development workflow
make check       # format + lint + typecheck
make test-unit   # unit tests (preferred during development)
make test        # all tests (1707 passing)
make ci          # full pipeline: format + lint + typecheck + test

Documentation

Full documentation is available at usemacaw.github.io/macaw-openvoice.

Contributing

We welcome contributions! Please read our Contributing Guide before submitting a pull request.

Contact

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macaw_openvoice-0.1.7.tar.gz (615.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

macaw_openvoice-0.1.7-py3-none-any.whl (181.2 kB view details)

Uploaded Python 3

File details

Details for the file macaw_openvoice-0.1.7.tar.gz.

File metadata

  • Download URL: macaw_openvoice-0.1.7.tar.gz
  • Upload date:
  • Size: 615.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for macaw_openvoice-0.1.7.tar.gz
Algorithm Hash digest
SHA256 64d2ad7b2ecb07b45b85eaa9d8542f6425879e06a200adfe0bdf79ce9a38cd0d
MD5 576bbf149150d4708e549d0c83fa49b6
BLAKE2b-256 803f354c3981e2e4c462dc05abdda5000a471327d413f14a4e7f7aea09d542fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for macaw_openvoice-0.1.7.tar.gz:

Publisher: release.yml on usemacaw/macaw-openvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file macaw_openvoice-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: macaw_openvoice-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 181.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for macaw_openvoice-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 07f9b07199d566c0f1e249cfc50c32e223dce9ca0da7f340c398e1416efe91bf
MD5 76fb9f1b0e81f717e414eb16eaff6f0f
BLAKE2b-256 d867ecbc426a5423a8a7420e01e0dc7efcf3721ab06fa30b56939e17d3bfd340

See more details on using hashes here.

Provenance

The following attestation bundles were made for macaw_openvoice-0.1.7-py3-none-any.whl:

Publisher: release.yml on usemacaw/macaw-openvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page