OmniVoice multilingual zero-shot TTS toolkit for Strands Agents — voice cloning, voice design, and 600+ language synthesis as agent tools
Project description
strands-omnivoice
Multilingual zero-shot TTS toolkit for Strands Agents — 600+ languages, voice cloning, and voice design as agent tools.
Wraps k2-fsa/OmniVoice — a state-of-the-art diffusion-language-model TTS that supports 600+ languages with RTF as low as 0.025 — as a clean set of @tool functions that any Strands Agent can call.
✨ Features
- 600+ languages — broadest zero-shot TTS coverage available
- Voice cloning — clone any speaker from 3–10s of reference audio
- Voice design — describe the speaker via attributes (
female, british accent, whisper) - Auto voice — let the model pick a voice
- Built-in ASR — transcribe reference audio with the bundled Whisper model
- Batch synthesis — generate many WAVs in one call, sharing a loaded model
- Inline tags —
[laughter],[sigh], pinyin (ZHE2), CMU phonemes ([B EY1 S]) - Apple Silicon + CUDA + CPU — auto-device with
STRANDS_OMNIVOICE_DEVICEoverride - Singleton loader — every tool shares one cached checkpoint, no reloads
📦 Install
pip install strands-omnivoice
That installs strands-omnivoice plus its omnivoice>=0.1.5 runtime. Pick a PyTorch flavour matching your hardware:
# NVIDIA CUDA (Linux/Windows)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
# Apple Silicon (MPS)
pip install torch==2.8.0 torchaudio==2.8.0
Developer setup
git clone https://github.com/cagataycali/strands-omnivoice && cd strands-omnivoice
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest -q
🚀 Quick Start
from strands import Agent
from strands_omnivoice import (
omnivoice_tts, omnivoice_clone, omnivoice_design,
omnivoice_sysinfo, audio_play,
)
agent = Agent(tools=[
omnivoice_tts, omnivoice_clone, omnivoice_design,
omnivoice_sysinfo, audio_play,
])
# Auto voice
agent("Synthesize 'Hello world' to /tmp/hello.wav and play it.")
# Voice cloning
agent("Clone the speaker in /tmp/ref.wav and say 'Bonjour le monde' to /tmp/fr.wav.")
# Voice design
agent("Make a british female elderly whisper saying 'Once upon a time' to /tmp/story.wav.")
🧰 Tools
| Tool | Purpose |
|---|---|
omnivoice_tts |
Auto-voice synthesis — text → WAV |
omnivoice_clone |
Voice cloning from a 3–10 s reference clip |
omnivoice_design |
Voice design via attributes (gender, age, pitch, accent, dialect) |
omnivoice_batch |
Multi-item synthesis sharing a single loaded model |
omnivoice_transcribe |
ASR via OmniVoice's bundled Whisper model |
omnivoice_load_model |
Pre-warm / reload the model |
omnivoice_unload_model |
Drop cached weights and free GPU memory |
omnivoice_download_model |
Snapshot-download the checkpoint without loading |
omnivoice_sysinfo |
Device, dtype, OmniVoice version, loaded-state diagnostics |
omnivoice_list_languages |
Browse the 600+ supported languages |
audio_probe |
Inspect any audio file (duration / SR / channels / format) |
audio_play |
Play a WAV via host's default player (afplay/aplay/paplay/ffplay) |
omnivoice_demo_serve |
Launch the upstream Gradio web UI as a background process |
All tools return the standard Strands tool result shape — they compose freely inside Agent(tools=[...]).
🎛️ Voice Design — Attribute Reference
instruct= accepts a comma-separated list of attributes. Categories below are mutually exclusive within each row; combine across rows freely.
| Category | Values |
|---|---|
| Gender | male, female |
| Age | child, teenager, young adult, middle-aged, elderly |
| Pitch | very low pitch, low pitch, moderate pitch, high pitch, very high pitch |
| Style | whisper |
| English accent (EN text only) | american, british, australian, canadian, indian, chinese, korean, portuguese, russian, japanese accent |
| Chinese dialect (ZH text only) | 四川话, 陕西话, 东北话, 云南话, 河南话, ... |
Examples:
"female, young adult, high pitch, british accent"
"male, elderly, low pitch, whisper"
"女, 青年, 四川话"
See the upstream voice-design docs for the full table.
🔊 Inline Tags & Pronunciation Control
agent("""omnivoice_tts text="[laughter] You really got me." output="/tmp/laugh.wav" """)
# Chinese — pinyin pronunciation override
agent("""omnivoice_tts text="这批货物打ZHE2出售。" output="/tmp/pinyin.wav" """)
# English — CMU phoneme override
agent("""omnivoice_tts text="He plays the [B EY1 S] guitar." output="/tmp/cmu.wav" """)
Supported tags: [laughter], [sigh], [confirmation-en], [question-en], [question-ah/oh/ei/yi], [surprise-ah/oh/wa/yo], [dissatisfaction-hnn].
⚙️ Configuration
Environment variables override defaults:
| Var | Default | Description |
|---|---|---|
STRANDS_OMNIVOICE_MODEL |
k2-fsa/OmniVoice |
HF repo or local checkpoint path |
STRANDS_OMNIVOICE_DEVICE |
auto (cuda → mps → cpu) | Force device |
STRANDS_OMNIVOICE_DTYPE |
auto | float16, float32, bfloat16 |
Or pass per-call via model_id= / device= arguments to any tool.
🧪 Testing the Agent
python agent.py "Show sysinfo, then synthesize 'Привет мир' to /tmp/ru.wav and play it."
Without args, agent.py lists every registered tool.
🏗️ Architecture
strands_omnivoice/
├── __init__.py # exports: 13 tools + loader API
├── _common.py # ToolResult builders (ok/err) + path helpers
├── _loader.py # singleton OmniVoice loader (thread-safe)
└── tools/
├── tts.py # auto-voice synthesis
├── clone.py # voice cloning
├── design.py # voice design (attributes)
├── batch.py # multi-item generation
├── transcribe.py # ASR
├── model_lifecycle.py # load / unload / download
├── info.py # sysinfo + list_languages
├── audio_utils.py # probe + play
└── demo_server.py # Gradio UI launcher
The loader caches one model per (model_id, device) key — every tool gets the same instance, so a workflow that calls omnivoice_clone then omnivoice_design only loads weights once.
🤝 Acknowledgments
- k2-fsa/OmniVoice — the upstream model. Massive credit to Han Zhu and the k2-fsa team.
- Strands Agents — the agent framework.
- strands-cosmos — sister project that inspired this scaffold.
📄 License
Apache 2.0 — same as upstream OmniVoice. See LICENSE.
Disclaimer: as with the upstream model, you are strictly prohibited from using this for unauthorized voice cloning, impersonation, fraud, or any illegal/unethical activity. Use responsibly.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strands_omnivoice-0.1.0.tar.gz.
File metadata
- Download URL: strands_omnivoice-0.1.0.tar.gz
- Upload date:
- Size: 27.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da0c835c840e3011ba1081a328380debb2e508531c8fbd267550c8e36879b6e0
|
|
| MD5 |
31f6e5a19bc6e9393d479002ef9879ea
|
|
| BLAKE2b-256 |
eb7896470ea47ef2a07bd963c43ef5f7f86cf1f353b6cc499f50e0fea8f2698a
|
File details
Details for the file strands_omnivoice-0.1.0-py3-none-any.whl.
File metadata
- Download URL: strands_omnivoice-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b8cfe9a2eed74f8bcad6f3b744f795826466330b078d3ed13dfc49773727667
|
|
| MD5 |
23abac66111610cb1c6c051bfcae7a91
|
|
| BLAKE2b-256 |
2f7ffb87d1d6c444cad8d3e4e2d0f8b0ed17f91701d0aa6294efd840ca4e5503
|