Skip to main content

OmniVoice multilingual zero-shot TTS toolkit for Strands Agents — voice cloning, voice design, and 600+ language synthesis as agent tools

Project description

strands-omnivoice

License Python Strands Agents OmniVoice

Awesome Strands Agents

strands-omnivoice

Multilingual zero-shot TTS toolkit for Strands Agents — 600+ languages, voice cloning, and voice design as agent tools.

Wraps k2-fsa/OmniVoice — a state-of-the-art diffusion-language-model TTS that supports 600+ languages with RTF as low as 0.025 — as a clean set of @tool functions that any Strands Agent can call.


✨ Features

  • 600+ languages — broadest zero-shot TTS coverage available
  • Voice cloning — clone any speaker from 3–10s of reference audio
  • Voice design — describe the speaker via attributes (female, british accent, whisper)
  • Auto voice — let the model pick a voice
  • Built-in ASR — transcribe reference audio with the bundled Whisper model
  • Batch synthesis — generate many WAVs in one call, sharing a loaded model
  • Inline tags[laughter], [sigh], pinyin (ZHE2), CMU phonemes ([B EY1 S])
  • Apple Silicon + CUDA + CPU — auto-device with STRANDS_OMNIVOICE_DEVICE override
  • Singleton loader — every tool shares one cached checkpoint, no reloads

📦 Install

pip install strands-omnivoice

That installs strands-omnivoice plus its omnivoice>=0.1.5 runtime. Pick a PyTorch flavour matching your hardware:

# NVIDIA CUDA (Linux/Windows)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

# Apple Silicon (MPS)
pip install torch==2.8.0 torchaudio==2.8.0

Developer setup

git clone https://github.com/cagataycali/strands-omnivoice && cd strands-omnivoice
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest -q

🚀 Quick Start

from strands import Agent
from strands_omnivoice import (
    omnivoice_tts, omnivoice_clone, omnivoice_design,
    omnivoice_sysinfo, audio_play,
)

agent = Agent(tools=[
    omnivoice_tts, omnivoice_clone, omnivoice_design,
    omnivoice_sysinfo, audio_play,
])

# Auto voice
agent("Synthesize 'Hello world' to /tmp/hello.wav and play it.")

# Voice cloning
agent("Clone the speaker in /tmp/ref.wav and say 'Bonjour le monde' to /tmp/fr.wav.")

# Voice design
agent("Make a british female elderly whisper saying 'Once upon a time' to /tmp/story.wav.")

🧰 Tools

Tool Purpose
omnivoice_tts Auto-voice synthesis — text → WAV
omnivoice_clone Voice cloning from a 3–10 s reference clip
omnivoice_design Voice design via attributes (gender, age, pitch, accent, dialect)
omnivoice_batch Multi-item synthesis sharing a single loaded model
omnivoice_transcribe ASR via OmniVoice's bundled Whisper model
omnivoice_load_model Pre-warm / reload the model
omnivoice_unload_model Drop cached weights and free GPU memory
omnivoice_download_model Snapshot-download the checkpoint without loading
omnivoice_sysinfo Device, dtype, OmniVoice version, loaded-state diagnostics
omnivoice_list_languages Browse the 600+ supported languages
audio_probe Inspect any audio file (duration / SR / channels / format)
audio_play Play a WAV via host's default player (afplay/aplay/paplay/ffplay)
omnivoice_demo_serve Launch the upstream Gradio web UI as a background process

All tools return the standard Strands tool result shape — they compose freely inside Agent(tools=[...]).


🎛️ Voice Design — Attribute Reference

instruct= accepts a comma-separated list of attributes. Categories below are mutually exclusive within each row; combine across rows freely.

Category Values
Gender male, female
Age child, teenager, young adult, middle-aged, elderly
Pitch very low pitch, low pitch, moderate pitch, high pitch, very high pitch
Style whisper
English accent (EN text only) american, british, australian, canadian, indian, chinese, korean, portuguese, russian, japanese accent
Chinese dialect (ZH text only) 四川话, 陕西话, 东北话, 云南话, 河南话, ...

Examples:

"female, young adult, high pitch, british accent"
"male, elderly, low pitch, whisper"
"女, 青年, 四川话"

See the upstream voice-design docs for the full table.


🔊 Inline Tags & Pronunciation Control

agent("""omnivoice_tts text="[laughter] You really got me." output="/tmp/laugh.wav" """)

# Chinese — pinyin pronunciation override
agent("""omnivoice_tts text="这批货物打ZHE2出售。" output="/tmp/pinyin.wav" """)

# English — CMU phoneme override
agent("""omnivoice_tts text="He plays the [B EY1 S] guitar." output="/tmp/cmu.wav" """)

Supported tags: [laughter], [sigh], [confirmation-en], [question-en], [question-ah/oh/ei/yi], [surprise-ah/oh/wa/yo], [dissatisfaction-hnn].


⚙️ Configuration

Environment variables override defaults:

Var Default Description
STRANDS_OMNIVOICE_MODEL k2-fsa/OmniVoice HF repo or local checkpoint path
STRANDS_OMNIVOICE_DEVICE auto (cuda → mps → cpu) Force device
STRANDS_OMNIVOICE_DTYPE auto float16, float32, bfloat16

Or pass per-call via model_id= / device= arguments to any tool.


🧪 Testing the Agent

python agent.py "Show sysinfo, then synthesize 'Привет мир' to /tmp/ru.wav and play it."

Without args, agent.py lists every registered tool.


🏗️ Architecture

strands_omnivoice/
├── __init__.py           # exports: 13 tools + loader API
├── _common.py            # ToolResult builders (ok/err) + path helpers
├── _loader.py            # singleton OmniVoice loader (thread-safe)
└── tools/
    ├── tts.py            # auto-voice synthesis
    ├── clone.py          # voice cloning
    ├── design.py         # voice design (attributes)
    ├── batch.py          # multi-item generation
    ├── transcribe.py     # ASR
    ├── model_lifecycle.py  # load / unload / download
    ├── info.py           # sysinfo + list_languages
    ├── audio_utils.py    # probe + play
    └── demo_server.py    # Gradio UI launcher

The loader caches one model per (model_id, device) key — every tool gets the same instance, so a workflow that calls omnivoice_clone then omnivoice_design only loads weights once.


🤝 Acknowledgments


📄 License

Apache 2.0 — same as upstream OmniVoice. See LICENSE.

Disclaimer: as with the upstream model, you are strictly prohibited from using this for unauthorized voice cloning, impersonation, fraud, or any illegal/unethical activity. Use responsibly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strands_omnivoice-0.1.0.tar.gz (27.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strands_omnivoice-0.1.0-py3-none-any.whl (27.4 kB view details)

Uploaded Python 3

File details

Details for the file strands_omnivoice-0.1.0.tar.gz.

File metadata

  • Download URL: strands_omnivoice-0.1.0.tar.gz
  • Upload date:
  • Size: 27.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for strands_omnivoice-0.1.0.tar.gz
Algorithm Hash digest
SHA256 da0c835c840e3011ba1081a328380debb2e508531c8fbd267550c8e36879b6e0
MD5 31f6e5a19bc6e9393d479002ef9879ea
BLAKE2b-256 eb7896470ea47ef2a07bd963c43ef5f7f86cf1f353b6cc499f50e0fea8f2698a

See more details on using hashes here.

File details

Details for the file strands_omnivoice-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for strands_omnivoice-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b8cfe9a2eed74f8bcad6f3b744f795826466330b078d3ed13dfc49773727667
MD5 23abac66111610cb1c6c051bfcae7a91
BLAKE2b-256 2f7ffb87d1d6c444cad8d3e4e2d0f8b0ed17f91701d0aa6294efd840ca4e5503

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page