Whisper-style CLI for Qwen3-TTS text-to-speech
Project description
qwen-tts-cli
Whisper-style CLI for Qwen3-TTS text-to-speech. One command, instant speech.
Install
# Apple Silicon (recommended for Mac — 6x faster)
pip install "qwen-tts-cli[mlx]"
# CUDA / CPU
pip install "qwen-tts-cli[transformers]"
Usage
# Just speak
qwen-tts "Hello, world!"
# Choose a speaker and style
qwen-tts "I can't believe it!" --speaker Aiden --instruct "Speak with excitement"
# Save to a specific file
qwen-tts "Good morning." -o greeting.wav
# Use the larger model
qwen-tts "Higher quality voice." --model 1.7B
# Force a specific backend (auto-detected by default)
qwen-tts "Fast on Mac!" --backend mlx
# Clone a voice from a 3-second sample
qwen-tts "Now I sound like someone else." --clone reference.wav --ref-text "Transcript of the reference audio."
# Design a voice from a description
qwen-tts "Hi there!" --design --instruct "A warm, deep male voice with a calm tone"
# Read from a file
qwen-tts -f article.txt
# Read from stdin
echo "Pipe text in" | qwen-tts -
# List available speakers
qwen-tts --list-speakers
Streaming (MLX backend)
Stream long text as real-time audio instead of waiting for full generation:
# Token-level streaming — lowest latency, plays audio as tokens are generated
qwen-tts --stream "The ocean has always called to us."
# Sentence chunking — generates each sentence fully, no mid-word cuts
qwen-tts --chunk-sentences 1 -f article.txt
# Two sentences per chunk — longer prosody, fewer pauses
qwen-tts --chunk-sentences 2 -f article.txt
# Paragraph chunking — one chunk per paragraph (split on blank lines)
qwen-tts --chunk-paragraphs -f article.txt
# Hybrid mode — sentence chunks with token streaming within each (best for slower hardware)
qwen-tts --stream --chunk-sentences 2 -f article.txt
# Hybrid with paragraphs
qwen-tts --stream --chunk-paragraphs -f article.txt
Options
positional arguments:
text Text to speak. Use "-" to read from stdin.
options:
-f, --file FILE Read text from a file
-o, --output FILE Output audio file (default: output.wav)
-m, --model SIZE Model: 0.6B, 1.7B, or full HF ID (default: 0.6B)
-b, --backend BACKEND Inference backend: transformers, mlx (default: auto)
-s, --speaker NAME Speaker voice (default: Ryan)
-l, --language LANG Language (default: Auto)
-i, --instruct TEXT Style/emotion instruction
--device DEVICE Force device: cuda:0, mps, cpu (default: auto, transformers only)
--play / --no-play Play audio after generation (default: on for macOS)
--list-speakers List available speakers and exit
voice cloning:
--clone AUDIO Reference audio for voice cloning
--ref-text TEXT Transcript of reference audio
voice design:
--design Design a voice using --instruct description
streaming (MLX backend only):
--stream Token-level streaming (combine with chunking for hybrid mode)
--stream-interval SECS Seconds per token-level chunk (default: 2.0)
--chunk-sentences N Stream in chunks of N sentences
--chunk-paragraphs Stream in paragraph chunks (split on blank lines)
Speakers
| Speaker | Description | Language |
|---|---|---|
| Ryan | Dynamic rhythmic male | English |
| Aiden | Sunny clear male | English |
| Vivian | Bright young female | Chinese |
| Serena | Warm gentle female | Chinese |
| Uncle_Fu | Seasoned mellow male | Chinese |
| Dylan | Clear natural male | Chinese (Beijing) |
| Eric | Lively bright male | Chinese (Sichuan) |
| Ono_Anna | Playful light female | Japanese |
| Sohee | Warm emotional female | Korean |
Supported languages
Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian.
Backends
Transformers (default)
Uses PyTorch + HuggingFace Transformers. Works on all platforms.
| Platform | Device | Precision |
|---|---|---|
| NVIDIA GPU | cuda | bfloat16 |
| Apple Silicon | mps | float32 |
| CPU | cpu | float32 |
MLX (Apple Silicon)
Uses mlx-audio with quantized models from mlx-community for native Apple Silicon acceleration. All modes (speak, clone, design) are supported.
qwen-tts "Hello!" --backend mlx
qwen-tts "Hello!" --backend mlx --model 0.6B # smaller, faster
qwen-tts "Hi!" --backend mlx --design --instruct "warm" # voice design
| Size | Mode | Quant | HuggingFace ID |
|---|---|---|---|
| 0.6B | speak | 6-bit | mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-6bit |
| 0.6B | clone | 4-bit | mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit |
| 1.7B | speak | 4-bit | mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-4bit |
| 1.7B | clone | 8-bit | mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit |
| 1.7B | design | 8-bit | mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-8bit |
Additional quantization variants (4bit, 5bit, 6bit, 8bit, bf16) are available on HuggingFace for all families. Use benchmark_quant.py to find the best variant for your hardware.
Benchmark (Apple Silicon)
Tested on a 16GB M1 MacBook Pro with the same input text (~14s of audio output):
| Model | Load | Avg Gen | RTF |
|---|---|---|---|
| Transformers 0.6B (mps) | 10.6s | 61.4s | 4.36 |
| Transformers 1.7B (mps) | 85.0s | 117.7s | 8.08 |
| MLX 1.7B 8-bit | 2.3s | 10.2s | 1.00 |
MLX is 6x faster than the equivalent transformers model while using less memory. RTF (real-time factor) of 1.0 means generation runs at real-time speed.
License
Apache-2.0 (same as Qwen3-TTS)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qwen_tts_cli-0.5.0.tar.gz.
File metadata
- Download URL: qwen_tts_cli-0.5.0.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ad47d14a0a7ab87e6c64a78ba491e19ebffdcb7c6b3e513355a21d0a2c584b2
|
|
| MD5 |
90fef34a74efdf0ad6e55a99730cd600
|
|
| BLAKE2b-256 |
0a3435cd56796961af9326dfb68693357df1b06984638090dc81e9e9aaaa1a49
|
File details
Details for the file qwen_tts_cli-0.5.0-py3-none-any.whl.
File metadata
- Download URL: qwen_tts_cli-0.5.0-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07d7d8a2f7dc3a9f9c450ba71d325135b289616cd7c8a37d4aebd92a6a1b26e3
|
|
| MD5 |
a975d1344d473bc0caff4680e1cbbd2a
|
|
| BLAKE2b-256 |
e70eab37fd0f80c5af52000b513ba790a932acc47f7e6f2c6d47ea35301bc662
|