OpenAI-compatible HTTP server for OmniVoice TTS

These details have not been verified by PyPI

Project links

Project description

omnivoice-server

OpenAI-compatible HTTP server for OmniVoice text-to-speech.

Author: zamery (@maemreyo) | Email: matthew.ngo1114@gmail.com

⚠️ Early Development Notice

This is a new repository built on top of OmniVoice (released 2026). Both the upstream model and this server wrapper are under active development. Expect:

API changes and breaking updates

Performance improvements as PyTorch MPS support matures

New features and bug fixes

Documentation updates

Current Status: Functional on CPU and CUDA. MPS (Apple Silicon) has known issues. See Verification Status below.

Features

OpenAI-compatible API - Drop-in replacement for OpenAI TTS endpoints
Three voice modes:
- Auto: Model selects voice automatically
- Design: Specify voice attributes (gender, age, accent, pitch, style)
- Clone: Voice cloning from reference audio
Voice profile management - Save and reuse cloned voices
Streaming synthesis - Low-latency sentence-level streaming
Concurrent requests - Configurable thread pool for parallel synthesis
Multiple audio formats - WAV and raw PCM output
Speed control - 0.25x to 4.0x playback speed
Optional authentication - Bearer token support
Production-ready - Request timeouts, health checks, metrics

Quick Start

Prerequisites

PyTorch must be installed before installing omnivoice-server. The correct PyTorch variant depends on your hardware:

# CPU only (works everywhere, but slow)
pip install torchcodec==0.11 torch==2.8.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cpu

# NVIDIA GPU (CUDA) - recommended for production
pip install torchcodec==0.11 torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --index-url https://download.pytorch.org/whl/cu128

# Apple Silicon (MPS) - currently broken, use CPU instead
# See docs/verification/MPS_ISSUE.md for details

For other CUDA versions or more options, see the official PyTorch installation guide.

Installation

# Option 1: Install from PyPI (recommended)
pip install omnivoice-server

# Option 2: Install with uv (faster)
uv tool install omnivoice-server

# Option 3: Install from GitHub (latest development version)
pip install git+https://github.com/maemreyo/omnivoice-server.git

# Option 4: Clone and install locally for development
git clone https://github.com/maemreyo/omnivoice-server.git
cd omnivoice-server
pip install -e .

Start the Server

# Basic usage (downloads model on first run)
omnivoice-server

# With custom settings
omnivoice-server --host 0.0.0.0 --port 8880 --device cuda

# With authentication
export OMNIVOICE_API_KEY="your-secret-key"
omnivoice-server

The server will start at http://127.0.0.1:8880 by default.

/v1/audio/speech accepts explicit instructions, plus OpenAI-style preset names in voice or speaker. If none are provided, it falls back to the default design prompt: male, middle-aged, moderate pitch, british accent

⚠️ Verification Status

Last Updated: 2026-04-04 Status: ✅ Working (CPU only)

Quick Summary

✅ System works - Produces clear, high-quality audio for English and Vietnamese
❌ MPS broken - Apple Silicon GPU has PyTorch bugs, use CPU instead
⚠️ CPU slow - RTF=4.92 (5x slower than real-time, ~10s per voice)
✅ No memory leaks - Stable memory usage verified

Benchmark Results (CPU)

Metric	Value	Status
Latency (mean)	10.2 seconds	⚠️ Slow
RTF (Real-Time Factor)	4.92	⚠️ 5x slower than real-time
Memory leak	None	✅ Stable
Audio quality	Excellent	✅ Clear speech

Production Recommendation

For production, deploy on NVIDIA GPU (CUDA):

20-25x faster than CPU (RTF~0.2)
Cloud options: AWS g5.xlarge (~~$1/hr), GCP T4/V100, RunPod (~~$0.40/hr)

Detailed reports: See docs/verification/ for full verification results and technical details.

Audio Samples

Listen to verified voice samples:

English (Female, American accent) - 199KB

Download English sample

Vietnamese (Female) - 203KB

Download Vietnamese sample

Both samples demonstrate clear, natural speech quality on CPU device.

First Request

curl -X POST http://127.0.0.1:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "omnivoice",
    "input": "Hello, this is OmniVoice text-to-speech!"
  }' \
  --output speech.wav

API Usage

Basic Synthesis

import httpx

response = httpx.post(
    "http://127.0.0.1:8880/v1/audio/speech",
    json={
        "model": "omnivoice",
        "input": "Hello world!",
        "response_format": "wav"
    }
)

with open("output.wav", "wb") as f:
    f.write(response.content)

Voice Design

Specify voice attributes to design a custom voice:

response = httpx.post(
    "http://127.0.0.1:8880/v1/audio/speech",
    json={
        "model": "omnivoice",
        "input": "This voice has specific attributes.",
        "instructions": "female,british accent,young adult,high pitch"
    }
)

instructions is the strongest control and overrides preset selection. If instructions is absent, /v1/audio/speech also accepts OpenAI-style preset names in voice or speaker, such as alloy, nova, onyx, and shimmer. Unknown values are ignored, and the server falls back to the default design prompt male, middle-aged, moderate pitch, british accent.

Available attributes:

Gender: male, female
Age: child, teenager, young adult, middle-aged, elderly
Pitch: very low pitch, low pitch, moderate pitch, high pitch, very high pitch
Style: whisper
Accent (English): american accent, british accent, australian accent, chinese accent, canadian accent, indian accent, korean accent, portuguese accent, russian accent, japanese accent
Dialect (Chinese): 河南话, 陕西话, 四川话, 贵州话, 云南话, 桂林话, 济南话, 石家庄话, 甘肃话, 宁夏话, 青岛话, 东北话

OpenAI-compatible local presets:

alloy, ash, ballad, cedar, coral, echo, fable, marin, nova, onyx, sage, shimmer, verse

Preset mapping table:

Preset	Local design prompt
`alloy`	`female, young adult, moderate pitch, american accent`
`ash`	`male, young adult, low pitch, american accent`
`ballad`	`male, middle-aged, low pitch, british accent`
`cedar`	`male, middle-aged, low pitch, american accent`
`coral`	`female, young adult, high pitch, australian accent`
`echo`	`male, middle-aged, moderate pitch, canadian accent`
`fable`	`female, middle-aged, moderate pitch, british accent`
`marin`	`female, middle-aged, moderate pitch, canadian accent`
`nova`	`female, young adult, high pitch, american accent`
`onyx`	`male, middle-aged, very low pitch, british accent`
`sage`	`female, elderly, low pitch, british accent`
`shimmer`	`female, young adult, very high pitch, american accent`
`verse`	`male, young adult, moderate pitch, british accent`

Voice Cloning

Option 1: Save a Profile (Reusable)

# Create a profile
with open("reference.wav", "rb") as f:
    response = httpx.post(
        "http://127.0.0.1:8880/v1/voices/profiles",
        data={
            "profile_id": "my_voice",
            "ref_text": "This is the reference text."
        },
        files={"ref_audio": f}
    )

# Profiles are stored for management and inspection.
# For synthesis, use POST /v1/audio/speech/clone with reference audio.

Option 2: One-Shot Cloning

with open("reference.wav", "rb") as f:
    response = httpx.post(
        "http://127.0.0.1:8880/v1/audio/speech/clone",
        data={
            "text": "This is one-shot cloning.",
            "ref_text": "Reference text."
        },
        files={"ref_audio": f}
    )

Streaming

Stream audio in real-time for lower latency:

with httpx.stream(
    "POST",
    "http://127.0.0.1:8880/v1/audio/speech",
    json={
        "model": "omnivoice",
        "input": "Long text to stream...",
        "stream": True
    }
) as response:
    for chunk in response.iter_bytes():
        # Process PCM audio chunks
        play_audio(chunk)

See examples/streaming_player.py for a complete example.

CLI Usage

# Start server with defaults
omnivoice-server

# Custom host and port
omnivoice-server --host 0.0.0.0 --port 8880

# Use GPU
omnivoice-server --device cuda

# Adjust inference quality (higher = better quality, slower)
omnivoice-server --num-step 32

# Enable authentication
omnivoice-server --api-key your-secret-key

# Adjust concurrency
omnivoice-server --max-concurrent 4

# Custom model path
omnivoice-server --model-id /path/to/local/model

Environment Variables

All CLI options can be set via environment variables with OMNIVOICE_ prefix:

export OMNIVOICE_HOST=0.0.0.0
export OMNIVOICE_PORT=8880
export OMNIVOICE_DEVICE=cuda
export OMNIVOICE_API_KEY=your-secret-key
export OMNIVOICE_NUM_STEP=32
export OMNIVOICE_MAX_CONCURRENT=4

omnivoice-server

Configuration

Option	Env Var	Default	Description
`--host`	`OMNIVOICE_HOST`	`127.0.0.1`	Bind host
`--port`	`OMNIVOICE_PORT`	`8880`	Bind port
`--device`	`OMNIVOICE_DEVICE`	`cpu`	Device: cpu, cuda (MPS broken)
`--num-step`	`OMNIVOICE_NUM_STEP`	`32`	Inference steps (1-64, higher=better quality)
`--max-concurrent`	`OMNIVOICE_MAX_CONCURRENT`	`2`	Max concurrent requests
`--api-key`	`OMNIVOICE_API_KEY`	`""`	Bearer token (empty = no auth)
`--model-id`	`OMNIVOICE_MODEL_ID`	`k2-fsa/OmniVoice`	HuggingFace repo or local path
`--profile-dir`	`OMNIVOICE_PROFILE_DIR`	`~/.omnivoice/profiles`	Voice profiles directory
`--log-level`	`OMNIVOICE_LOG_LEVEL`	`info`	Logging level

API Reference

Endpoints

`POST /v1/audio/speech`

Generate speech from text (OpenAI-compatible).

Request body:

{
  "model": "omnivoice",
  "input": "Text to synthesize",
  "voice": "alloy",
  "speaker": "onyx",
  "instructions": "female,british accent",
  "response_format": "wav",
  "speed": 1.0,
  "stream": false,
  "num_step": 32
}

Precedence:

instructions
speaker preset
voice preset
server default prompt

Response: Audio file (WAV or PCM)

`POST /v1/audio/speech/clone`

One-shot voice cloning (multipart form).

Form fields:

text (required): Text to synthesize
ref_audio (required): Reference audio file
ref_text (optional): Reference transcript
speed (optional): Playback speed (default: 1.0)
num_step (optional): Inference steps

Response: Audio file (WAV)

`GET /v1/voices`

List available voices and profiles.

Response:

{
  "voices": [
    {
      "id": "auto",
      "type": "auto",
      "description": "Ignored by /v1/audio/speech; server default design prompt is male, middle-aged, moderate pitch, british accent"
    },
    {"id": "design:<attributes>", "type": "design", "description": "..."},
    {"id": "alloy", "type": "preset", "description": "..."},
    {"id": "clone:my_voice", "type": "clone", "profile_id": "my_voice"}
  ],
  "design_attributes": {...},
  "total": 3
}

`POST /v1/voices/profiles`

Create a voice cloning profile.

Form fields:

profile_id (required): Unique identifier (alphanumeric, dashes, underscores)
ref_audio (required): Reference audio file
ref_text (optional): Reference transcript
overwrite (optional): Overwrite existing profile (default: false)

Response:

{
  "profile_id": "my_voice",
  "created_at": "2026-04-04T12:00:00Z",
  "ref_text": "Reference text"
}

`GET /v1/voices/profiles/{profile_id}`

Get profile details.

`PATCH /v1/voices/profiles/{profile_id}`

Update profile (ref_audio and/or ref_text).

`DELETE /v1/voices/profiles/{profile_id}`

Delete a profile.

`GET /v1/models`

List available models (OpenAI-compatible).

`GET /health`

Health check endpoint.

`GET /metrics`

Prometheus-style metrics.

Advanced Features

Non-Verbal Symbols

OmniVoice natively supports non-verbal symbols inline in text. These are pass-through features from the upstream model:

response = httpx.post(
    "http://127.0.0.1:8880/v1/audio/speech",
    json={
        "input": "Hello [laughter] this is amazing [breath] really cool [sigh]"
    }
)

Supported symbols:

[laughter] - Natural laughter
[breath] - Breathing sound
[sigh] - Sighing sound
Other non-verbal expressions supported by OmniVoice

Pronunciation Correction

For Chinese text, you can provide pinyin hints for pronunciation correction:

response = httpx.post(
    "http://127.0.0.1:8880/v1/audio/speech",
    json={
        "input": "这是拼音(pīn yīn)提示的例子"
    }
)

The server passes these hints directly to OmniVoice without modification.

Advanced Generation Parameters

Fine-tune synthesis quality and characteristics with per-request parameters:

response = httpx.post(
    "http://127.0.0.1:8880/v1/audio/speech",
    json={
        "input": "Hello world",
        "num_step": 32,                 # Inference steps (1-64, higher=better quality)
        "guidance_scale": 3.0,          # CFG scale (0-10, higher=stronger conditioning)
        "denoise": True,                # Enable denoising (recommended)
        "t_shift": 0.1,                 # Noise schedule shift (0-2, affects quality/speed)
        "position_temperature": 5.0,    # Voice diversity (0=deterministic, higher=more variation)
        "class_temperature": 0.0,       # Token sampling temperature (0=greedy, higher=random)
        "duration": 3.5                 # Fixed output duration in seconds (overrides speed)
    }
)

Voice Consistency & Reproducibility:

For deterministic, reproducible output (same voice every time):

{
    "position_temperature": 0.0,  # Greedy/deterministic voice rendering
    "class_temperature": 0.0      # Greedy token sampling
}

This is especially useful for:

Streaming with consistent voice across sentences
Reproducible synthesis for testing
Fixed voice character in production

Higher position_temperature (default 5.0) produces more variation from the default design prompt and may cause inconsistency when streaming.

Fixed Duration for Video Sync:

Use duration to generate audio of exact length for syncing with video or animations:

{
    "duration": 5.0  # Generate exactly 5 seconds of audio
}

When both duration and speed are provided, duration takes precedence and speed is ignored.

These parameters override server defaults on a per-request basis.

Examples

See the examples/ directory:

python_client.py - Comprehensive Python client examples
streaming_player.py - Real-time streaming audio player
curl_examples.sh - cURL command examples

Run examples:

# Python client
cd examples
python python_client.py

# Streaming player (requires pyaudio)
pip install pyaudio
python streaming_player.py "Hello, this is streaming audio!"

# cURL examples
chmod +x curl_examples.sh
./curl_examples.sh

Docker Deployment

Quick Start with Docker Compose

# Start the server
docker-compose up -d

# View logs
docker-compose logs -f

# Stop the server
docker-compose down

The server will be available at http://localhost:8880. Voice profiles are persisted in the ./profiles directory.

Build and Run Manually

# Build the image
docker build -t omnivoice-server .

# Run the container
docker run -d \
  -p 8880:8880 \
  -v $(pwd)/profiles:/app/profiles \
  -e OMNIVOICE_API_KEY=your-secret-key \
  --name omnivoice \
  omnivoice-server

# View logs
docker logs -f omnivoice

Configuration

Set environment variables in docker-compose.yml or pass them with -e:

OMNIVOICE_HOST=0.0.0.0 - Bind host (must be 0.0.0.0 in Docker)
OMNIVOICE_PORT=8880 - Server port
OMNIVOICE_DEVICE=cpu - Device (cpu, cuda)
OMNIVOICE_NUM_STEP=32 - Inference steps
OMNIVOICE_API_KEY=secret - Optional authentication

For CUDA GPU support, see comments in docker-compose.yml.

Development

Setup

# Clone repository
git clone https://github.com/maemreyo/omnivoice-server.git
cd omnivoice-server

# Install with dev dependencies
pip install -e ".[dev]"

Run Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=omnivoice_server --cov-report=term-missing

# Run specific test
pytest tests/test_streaming.py -v

Code Quality

# Lint
ruff check omnivoice_server/ tests/

# Format
ruff format omnivoice_server/ tests/

# Type check
mypy omnivoice_server/

CI/CD

GitHub Actions workflow runs on every push:

Linting (ruff)
Type checking (mypy)
Tests (pytest)
Python 3.10, 3.11, 3.12

Hardware Requirements

CPU: 4+ cores recommended
RAM: 8GB minimum, 16GB recommended
GPU:
- ✅ NVIDIA GPU with CUDA - Recommended for production (20-25x faster than CPU)
- ❌ Apple Silicon (MPS) - Currently broken due to PyTorch bugs, do not use
- ✅ CPU - Works but slow (5x slower than real-time)
Storage: 3GB for model cache

Device Comparison

Device	Audio Quality	Speed (RTF)	Status
CPU	✅ Excellent	4.92 (slow)	Use for dev
MPS (Apple Silicon)	❌ Broken	N/A	Do not use
CUDA (NVIDIA GPU)	✅ Excellent	~0.2 (fast)	Use for prod

Note: Default device is now cpu due to MPS issues. See docs/verification/MPS_ISSUE.md for technical details.

Performance

Verified benchmark results (CPU, num_step=32):

Metric	Value
Latency	10.2 seconds per voice
RTF (Real-Time Factor)	4.92
Memory	Stable, no leaks

Expected performance on different hardware:

Hardware	num_step	Latency (short text)	RTF
CPU (Intel i7)	32	~10s	4.92
GPU (RTX 3090)	32	~0.5s	~0.2
Apple M1 Max (MPS)	32	❌ Broken audio	N/A

Streaming mode reduces perceived latency by sending audio as soon as the first sentence is ready.

Troubleshooting

Model Download Issues

The model is downloaded from HuggingFace on first run. If you encounter issues:

# Pre-download the model
python -c "from omnivoice import OmniVoice; OmniVoice.from_pretrained('k2-fsa/OmniVoice')"

# Or use a local model
omnivoice-server --model-id /path/to/local/model

CUDA Out of Memory

Reduce concurrent requests or use CPU:

omnivoice-server --max-concurrent 1 --device cpu

Audio Quality Issues

Increase inference steps for better quality:

omnivoice-server --num-step 32

Known Limitations

Streaming Voice Consistency

When using stream=True, each sentence is synthesized independently from the same instructions or default design prompt. With non-zero temperature settings, timbre can still drift across chunks because there is no shared state between sentence-level synthesis calls.

Workarounds:

Set position_temperature=0 for deterministic voice rendering (recommended):

with httpx.stream(
    "POST",
    "http://127.0.0.1:8880/v1/audio/speech",
    json={
        "input": "Long text...",
        "stream": True,
        "position_temperature": 0.0  # Deterministic voice rendering
    }
) as response:
    for chunk in response.iter_bytes():
        play_audio(chunk)

This minimizes chunk-to-chunk variation and provides more consistent streaming output.

Use one-shot voice cloning for consistent results:

with open("reference.wav", "rb") as f:
    response = httpx.post(
        "http://127.0.0.1:8880/v1/audio/speech/clone",
        data={"text": "Long text..."},
        files={"ref_audio": f}
    )
if response.status_code == 200:
    audio_bytes = response.content

Use explicit instructions for a stable voice character:

{
    "instructions": "female,british accent",
    "stream": True
}

This limitation is inherent to the sentence-by-sentence streaming architecture and does not affect non-streaming synthesis.

Documentation

Comprehensive technical documentation is available in the docs/ directory:

Document	Description
verification/VERIFICATION_RESULTS.md	⭐ Verification results and benchmark data
verification/MPS_ISSUE.md	Technical analysis of Apple Silicon MPS bug
system/ecosystem.md	System context, hardware requirements, deployment
system/specification.md	Complete system specification
architecture/overview.md	Architecture diagrams and component maps
design/dataflow.md	Data flow and API design details

License

MIT

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes with tests
Run code quality checks
Submit a pull request

Acknowledgments

Built on top of OmniVoice by k2-fsa.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.4

May 12, 2026

0.2.3

Apr 25, 2026

0.2.2

Apr 20, 2026

0.2.1

Apr 18, 2026

0.2.0

Apr 17, 2026

0.1.2

Apr 17, 2026

This version

0.1.1

Apr 16, 2026

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnivoice_server-0.1.1.tar.gz (766.3 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

omnivoice_server-0.1.1-py3-none-any.whl (33.7 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file omnivoice_server-0.1.1.tar.gz.

File metadata

Download URL: omnivoice_server-0.1.1.tar.gz
Upload date: Apr 16, 2026
Size: 766.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for omnivoice_server-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3c577a1fd15af1ce9f671cc26b72e83b1b23a5bd07dc73c9435c91e012620fcd`
MD5	`a8da607c7268a121b8f7b8f708fe8c64`
BLAKE2b-256	`1f373618759aa3833a5474dc5f1dfbe380579d082b8b49bafbca4c91fc2921e0`

See more details on using hashes here.

File details

Details for the file omnivoice_server-0.1.1-py3-none-any.whl.

File metadata

Download URL: omnivoice_server-0.1.1-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 33.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for omnivoice_server-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e13311d9d9023184b0b2b73b1fe89c2360ec163c6901d094427ce62fe5804a1b`
MD5	`33da138269ed0d8ace5de812098f7998`
BLAKE2b-256	`2e29fd4b40c9d362108fa78b52c8c02d6368cadfa3bb44b598d201995cbd44a0`

See more details on using hashes here.

omnivoice-server 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

omnivoice-server

Features

Quick Start

Prerequisites

Installation

Start the Server

⚠️ Verification Status

Quick Summary

Benchmark Results (CPU)

Production Recommendation

Audio Samples

First Request

API Usage

Basic Synthesis

Voice Design

Voice Cloning

Option 1: Save a Profile (Reusable)

Option 2: One-Shot Cloning

Streaming

CLI Usage

Environment Variables

Configuration

API Reference

Endpoints

POST /v1/audio/speech

POST /v1/audio/speech/clone

GET /v1/voices

POST /v1/voices/profiles

GET /v1/voices/profiles/{profile_id}

PATCH /v1/voices/profiles/{profile_id}

DELETE /v1/voices/profiles/{profile_id}

GET /v1/models

GET /health

GET /metrics

Advanced Features

Non-Verbal Symbols

Pronunciation Correction

Advanced Generation Parameters

Examples

Docker Deployment

Quick Start with Docker Compose

Build and Run Manually

Configuration

Development

Setup

Run Tests

Code Quality

CI/CD

Hardware Requirements

Device Comparison

Performance

Troubleshooting

Model Download Issues

CUDA Out of Memory

Audio Quality Issues

Known Limitations

Streaming Voice Consistency

Documentation

License

Contributing

Acknowledgments

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

`POST /v1/audio/speech`

`POST /v1/audio/speech/clone`

`GET /v1/voices`

`POST /v1/voices/profiles`

`GET /v1/voices/profiles/{profile_id}`

`PATCH /v1/voices/profiles/{profile_id}`

`DELETE /v1/voices/profiles/{profile_id}`

`GET /v1/models`

`GET /health`

`GET /metrics`