macaw-openvoice

Voice runtime (STT + TTS) with OpenAI-compatible API

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

paulohenriquevn

These details have not been verified by PyPI

Project links

Documentation

Project description

Macaw OpenVoice

Voice runtime (STT + TTS) with OpenAI-compatible API

Quick Start · Core Capabilities · Architecture · API Docs · Demo · Full Documentation

Production Voice Runtime Infrastructure Real-time Speech-to-Text and Text-to-Speech with OpenAI-compatible API, streaming session control, and extensible execution architecture.

Overview

Macaw OpenVoice is a production-grade runtime for voice systems.

It standardizes and operationalizes the execution of Speech-to-Text (STT) and Text-to-Speech (TTS) models in real environments by providing:

a unified execution interface for multiple inference engines
real-time audio streaming with controlled latency
continuous session management
bidirectional speech interaction
operational observability
production-ready APIs

Macaw acts as the infrastructure layer between voice models and production applications, abstracting complexity related to streaming, synchronization, state management, and execution control.

Technology Positioning

Macaw OpenVoice plays the same role for voice systems that:

vLLM plays for LLM serving
Triton Inference Server plays for GPU inference
Ollama plays for local model execution

It transforms voice models into operational services.

Core Capabilities

Unified Interface

OpenAI-compatible Audio API
Real-time full-duplex WebSocket streaming
Local runtime CLI

Bidirectional Speech Streaming

simultaneous STT and TTS in the same session
automatic speech detection
barge-in support (interruptible speech)
automatic mute during synthesis

Session Management

state machine for continuous audio processing
ring buffer with persistence
crash recovery without context loss
cross-segment coherence

Audio Processing Pipeline

automatic resampling
DC offset removal
gain normalization
voice activity detection

Multi-Engine Execution

multiple STT and TTS engines
subprocess isolation
declarative model registry
pluggable architecture

Operational Control

priority-based scheduler
dynamic batching
latency tracking
Prometheus metrics

Production Use Cases

Macaw is designed for real-world voice workloads:

real-time conversational voice agents
telephony automation (SIP / VoIP)
live transcription systems
embedded voice interfaces
multimodal assistants
interactive media streaming
continuous audio processing pipelines

Quick Start

# Install
pip install macaw-openvoice[server,grpc,faster-whisper]

# Pull a model
macaw pull faster-whisper-tiny

# Start the runtime
macaw serve

$ macaw serve
  ╔══════════════════════════════════════════════╗
  ║         Macaw OpenVoice v1.0.0              ║
  ╚══════════════════════════════════════════════╝

INFO     Scanning models in ~/.macaw/models
INFO     Found 2 model(s): faster-whisper-tiny (STT), kokoro-v1 (TTS)
INFO     Spawning STT worker   faster-whisper-tiny  port=50051  engine=faster-whisper
INFO     Spawning TTS worker   kokoro-v1            port=50052  engine=kokoro
INFO     Scheduler started     aging=30.0s  batch_ms=75.0  batch_max=8
INFO     Uvicorn running on http://127.0.0.1:8000

Transcribe a file

# Via REST API
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=faster-whisper-tiny

# Via CLI
macaw transcribe audio.wav --model faster-whisper-tiny

Streaming via WebSocket

wscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-tiny"
# Send binary audio frames, receive JSON transcript events

Text-to-Speech

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "kokoro-v1", "input": "Hello, how can I help you?", "voice": "default"}' \
  --output speech.wav

Voice Cloning

# Pull the voice cloning model
macaw pull qwen3-tts-0.6b-base

# Clone a voice from ~3 seconds of reference audio
REF_AUDIO=$(base64 -w0 reference.wav)

curl -s http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d "{\"model\": \"qwen3-tts-0.6b-base\", \"input\": \"Text with cloned voice\", \"language\": \"English\", \"ref_audio\": \"$REF_AUDIO\", \"ref_text\": \"Transcript of reference\"}" \
  --output cloned.wav

Architecture

                         Clients
          CLI / REST / WebSocket (full-duplex)
                           |
                           v
  +----------------------------------------------------+
  |              API Server (FastAPI)                    |
  |                                                    |
  |  POST /v1/audio/transcriptions    (STT batch)      |
  |  POST /v1/audio/translations      (STT translate)  |
  |  POST /v1/audio/speech            (TTS)            |
  |  GET  /v1/voices                  (list voices)    |
  |  POST /v1/voices                  (save voice)     |
  |  WS   /v1/realtime                (STT+TTS)        |
  +----------------------------------------------------+
  |              Scheduler                              |
  |  Priority queue (realtime > batch), cancellation,   |
  |  dynamic batching, latency tracking                 |
  +----------------------------------------------------+
  |              Model Registry                         |
  |  Declarative manifest (macaw.yaml), lifecycle        |
  +----------+-------------------+---------------------+
             |                   |
    +--------+--------+  +------+-------+
    |  STT Workers    |  |  TTS Workers |
    |  (subprocess    |  |  (subprocess |
    |   gRPC)         |  |   gRPC)      |
    |                 |  |              |
    | Faster-Whisper  |  | Kokoro       |
    |                 |  | Qwen3-TTS    |
    +-----------------+  +--------------+
             |
  +----------+-------------------------------------+
  |  Audio Preprocessing Pipeline                   |
  |  Resample -> DC Remove -> Gain Normalize        |
  +------------------------------------------------+
  |  Session Manager (STT only)                     |
  |  6 states, ring buffer, WAL, LocalAgreement,    |
  |  cross-segment context, crash recovery          |
  +------------------------------------------------+
  |  VAD (Energy Pre-filter + Silero VAD)           |
  +------------------------------------------------+
  |  Post-Processing (ITN via NeMo)                 |
  +------------------------------------------------+

Supported Models

11 models available via macaw pull, across 3 engines + built-in VAD.

Speech-to-Text

Model	Engine	Size	Languages	Hardware	Install
faster-whisper-large-v3	Faster-Whisper	3 GB	100+	GPU recommended	`macaw pull faster-whisper-large-v3`
faster-whisper-medium	Faster-Whisper	1.5 GB	100+	GPU recommended	`macaw pull faster-whisper-medium`
faster-whisper-small	Faster-Whisper	512 MB	100+	CPU / GPU	`macaw pull faster-whisper-small`
faster-whisper-tiny	Faster-Whisper	256 MB	100+	CPU	`macaw pull faster-whisper-tiny`
distil-whisper-large-v3	Faster-Whisper	1.5 GB	English	GPU recommended	`macaw pull distil-whisper-large-v3`

All STT models support: streaming partials (LocalAgreement), word timestamps, language detection, translation (to English), and batch inference.

Text-to-Speech

Model	Engine	Size	Languages	Capability	Install
kokoro-v1	Kokoro	82M	8	Preset voices	`macaw pull kokoro-v1`
qwen3-tts-0.6b-custom	Qwen3-TTS	0.6B	10	9 preset speakers	`macaw pull qwen3-tts-0.6b-custom`
qwen3-tts-1.7b-custom	Qwen3-TTS	1.7B	10	9 preset speakers	`macaw pull qwen3-tts-1.7b-custom`
qwen3-tts-0.6b-base	Qwen3-TTS	0.6B	10	Voice cloning (~3s ref)	`macaw pull qwen3-tts-0.6b-base`
qwen3-tts-1.7b-base	Qwen3-TTS	1.7B	10	Voice cloning (~3s ref)	`macaw pull qwen3-tts-1.7b-base`
qwen3-tts-1.7b-design	Qwen3-TTS	1.7B	10	Voice design (natural language)	`macaw pull qwen3-tts-1.7b-design`

Voice Activity Detection

Model	Type	License
Silero VAD	Energy pre-filter + Neural VAD	MIT

VAD runs as a built-in library in the runtime (not a worker subprocess).

Adding Your Own Engine

Adding a new engine requires ~400-700 lines of code and zero changes to the runtime core. See the Adding an Engine guide.

API Compatibility

Macaw implements the OpenAI Audio API contract, so existing SDKs work without modification:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Transcription
result = client.audio.transcriptions.create(
    model="faster-whisper-tiny",
    file=open("audio.wav", "rb"),
)
print(result.text)

# Text-to-Speech
response = client.audio.speech.create(
    model="kokoro-v1",
    input="Hello, how can I help you?",
    voice="default",
)
response.stream_to_file("output.wav")

WebSocket Protocol

The /v1/realtime endpoint supports full-duplex STT + TTS:

Client -> Server:
  Binary frames     PCM 16-bit audio (any sample rate)
  session.configure  Configure VAD, language, hot words, TTS model
  tts.speak          Trigger text-to-speech synthesis
  tts.cancel         Cancel active TTS

Server -> Client:
  session.created     Session established
  vad.speech_start    Speech detected
  transcript.partial  Intermediate hypothesis
  transcript.final    Confirmed segment (with ITN)
  vad.speech_end      Speech ended
  tts.speaking_start  TTS started (STT muted)
  Binary frames       TTS audio output
  tts.speaking_end    TTS finished (STT unmuted)
  error               Error with recoverable flag

CLI

macaw serve                                   # Start API server
macaw transcribe audio.wav                    # Transcribe file
macaw transcribe audio.wav --format srt       # Generate subtitles
macaw transcribe --stream                     # Stream from microphone
macaw translate audio.wav                     # Translate to English
macaw list                                    # List installed models
macaw pull faster-whisper-tiny                # Download a model
macaw inspect faster-whisper-tiny             # Model details

Demo

An interactive demo with a React/Next.js frontend is included:

./demo/start.sh

This starts the FastAPI backend (port 9000) and the Next.js frontend (port 3000) together. The demo includes a dashboard for batch transcriptions, real-time streaming STT with VAD visualization, and a TTS playground. See demo/README.md for details.

Development

# Setup (requires Python 3.11+ and uv)
uv venv --python 3.12
uv sync --all-extras

# Development workflow
make check       # format + lint + typecheck
make test-unit   # unit tests (preferred during development)
make test        # all tests (1707 passing)
make ci          # full pipeline: format + lint + typecheck + test

Documentation

Full documentation is available at usemacaw.github.io/macaw-openvoice.

Contributing

We welcome contributions! Please read our Contributing Guide before submitting a pull request.

Contact

Website: usemacaw.io
Email: hello@usemacaw.io
GitHub: github.com/usemacaw/macaw-openvoice

License

Apache License 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

paulohenriquevn

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.1.7

Feb 15, 2026

0.1.6

Feb 13, 2026

0.1.5

Feb 13, 2026

0.1.4

Feb 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macaw_openvoice-0.1.7.tar.gz (615.4 kB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

macaw_openvoice-0.1.7-py3-none-any.whl (181.2 kB view details)

Uploaded Feb 15, 2026 Python 3

File details

Details for the file macaw_openvoice-0.1.7.tar.gz.

File metadata

Download URL: macaw_openvoice-0.1.7.tar.gz
Upload date: Feb 15, 2026
Size: 615.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for macaw_openvoice-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`64d2ad7b2ecb07b45b85eaa9d8542f6425879e06a200adfe0bdf79ce9a38cd0d`
MD5	`576bbf149150d4708e549d0c83fa49b6`
BLAKE2b-256	`803f354c3981e2e4c462dc05abdda5000a471327d413f14a4e7f7aea09d542fe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for macaw_openvoice-0.1.7.tar.gz:

Publisher: release.yml on usemacaw/macaw-openvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: macaw_openvoice-0.1.7.tar.gz
- Subject digest: 64d2ad7b2ecb07b45b85eaa9d8542f6425879e06a200adfe0bdf79ce9a38cd0d
- Sigstore transparency entry: 953549803
- Sigstore integration time: Feb 15, 2026
Source repository:
- Permalink: usemacaw/macaw-openvoice@25cf539ddf18a8c5fc398ae604aeeae8a9cebbac
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/usemacaw
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@25cf539ddf18a8c5fc398ae604aeeae8a9cebbac
- Trigger Event: push

File details

Details for the file macaw_openvoice-0.1.7-py3-none-any.whl.

File metadata

Download URL: macaw_openvoice-0.1.7-py3-none-any.whl
Upload date: Feb 15, 2026
Size: 181.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for macaw_openvoice-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07f9b07199d566c0f1e249cfc50c32e223dce9ca0da7f340c398e1416efe91bf`
MD5	`76fb9f1b0e81f717e414eb16eaff6f0f`
BLAKE2b-256	`d867ecbc426a5423a8a7420e01e0dc7efcf3721ab06fa30b56939e17d3bfd340`

See more details on using hashes here.

Provenance

The following attestation bundles were made for macaw_openvoice-0.1.7-py3-none-any.whl:

Publisher: release.yml on usemacaw/macaw-openvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: macaw_openvoice-0.1.7-py3-none-any.whl
- Subject digest: 07f9b07199d566c0f1e249cfc50c32e223dce9ca0da7f340c398e1416efe91bf
- Sigstore transparency entry: 953549805
- Sigstore integration time: Feb 15, 2026
Source repository:
- Permalink: usemacaw/macaw-openvoice@25cf539ddf18a8c5fc398ae604aeeae8a9cebbac
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/usemacaw
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@25cf539ddf18a8c5fc398ae604aeeae8a9cebbac
- Trigger Event: push

macaw-openvoice 0.1.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Macaw OpenVoice

Overview

Technology Positioning

Core Capabilities

Unified Interface

Bidirectional Speech Streaming

Session Management

Audio Processing Pipeline

Multi-Engine Execution

Operational Control

Production Use Cases

Quick Start

Transcribe a file

Streaming via WebSocket

Text-to-Speech

Voice Cloning

Architecture

Supported Models

Speech-to-Text

Text-to-Speech

Voice Activity Detection

Adding Your Own Engine

API Compatibility

WebSocket Protocol

CLI

Demo

Development

Documentation

Contributing

Contact

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance