FastAPI web UI for Qwen3-TTS: custom voices, voice design, voice cloning, and per-request model selection.
Project description
Qwen3 TTS Web App
A FastAPI + vanilla JS UI to run Qwen3-TTS locally: custom voices, voice design, voice cloning, and per-request model selection.
Documentation
- Overview
- Quickstart
- API
- Configuration
- Docker
- Architecture
- Troubleshooting
- AI Output Disclaimer
- Anti-Fraud Warning
- Not a Companion
Prerequisites
- Python 3.10+ with a GPU-enabled PyTorch build (GPU strongly recommended).
- Disk/bandwidth for model downloads (several GB on first load).
- Optional: FlashAttention 2 if your GPU supports it (
pip install -U flash-attn --no-build-isolation).
Setup
pip install -r requirements.txt
If your machine cannot download weights during runtime, pre-download a model (e.g. huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice) and point QWEN_TTS_MODEL to that path.
Run
uvicorn app.main:app --reload --port 8000
Open http://localhost:8000 for the UI. API endpoints live under /api/*.
Docker
Build + run (CPU)
docker build -t qwen-tts .
docker run --rm -p 8000:8000 qwen-tts
Build + run (GPU)
Requires NVIDIA Container Toolkit and a CUDA-capable host.
docker build -t qwen-tts .
docker run --rm --gpus all -e QWEN_TTS_DEVICE=cuda:0 -p 8000:8000 qwen-tts
Docker Compose
docker compose up --build
Compose defaults to GPU (QWEN_TTS_DEVICE=cuda:0). For CPU-only, set QWEN_TTS_DEVICE=cpu in docker-compose.yml.
Configuration (env vars)
QWEN_TTS_MODEL— default model id or local path (default:Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice).QWEN_TTS_DEVICE— device map (default:cuda:0if available, elsecpu).QWEN_TTS_USE_FLASH— set to1to try FlashAttention 2.QWEN_TTS_CUSTOM_MODEL— override default for Custom Voice mode (else usesQWEN_TTS_MODEL).QWEN_TTS_VD_MODEL— override default for Voice Design mode (default:Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign).QWEN_TTS_CLONE_MODEL— override default for Voice Clone mode (default:Qwen/Qwen3-TTS-12Hz-1.7B-Base).QWEN_TTS_VIDEO_FONT— full path to a font file for video transcript rendering (useful for CJK/foreign text).
Requests can override model_id and device per call, but the UI auto-selects the recommended models per mode from the upstream README.
Model quick reference (from upstream README)
- Custom Voice:
Qwen/Qwen3-TTS-12Hz-{0.6B,1.7B}-CustomVoice(speaker list included). - Voice Design:
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign(describe persona; no speaker list). - Voice Clone:
Qwen/Qwen3-TTS-12Hz-{0.6B,1.7B}-Base(provide ref audio + transcript). - Tokenizer (encode/decode only):
Qwen/Qwen3-TTS-Tokenizer-12Hz.
Features
- Custom Voice: pick a provided speaker, language, and optional style prompt.
- Voice Design: describe a persona and language; the model invents the voice.
- Voice Clone: supply a reference audio (URL/path/base64) plus transcript to clone a voice.
- Model selection: choose any released model id or local directory per request.
- UI: shows available speakers/languages, plays inline, and offers WAV download.
- Recording/Upload for cloning: record in-browser or upload; the UI converts to WAV before sending.
- Saved voices: build a reusable voice profile (clone prompt) once and reuse it without re-uploading audio.
- MP3 download: generation stays WAV; pick MP3 in the UI to convert the generated clip on demand (requires
pydub+ffmpegavailable). - Video export: render a vertical/square/landscape MP4 with waveform/spectrum visuals and transcript (requires
ffmpegwithdrawtext).
API Examples
Custom Voice
curl -X POST http://localhost:8000/api/tts \
-H "Content-Type: application/json" \
-o custom.wav \
-d '{
"mode": "custom_voice",
"model_id": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
"language": "English",
"speaker": "Ryan",
"instruct": "Energetic podcast intro with a smile.",
"text": "Welcome back to our weekend build session. Grab your coffee and let us ship!"
}'
Voice Design
curl -X POST http://localhost:8000/api/tts \
-H "Content-Type: application/json" \
-o design.wav \
-d '{
"mode": "voice_design",
"model_id": "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
"language": "English",
"instruct": "Late-night radio host, warm baritone, unhurried pace with soft consonants.",
"text": "You are tuned to 88.5 FM. Outside the city is sleeping, but we are still here with you."
}'
Voice Clone
curl -X POST http://localhost:8000/api/tts \
-H "Content-Type: application/json" \
-o clone.wav \
-d '{
"mode": "voice_clone",
"model_id": "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
"language": "English",
"ref_audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav",
"ref_text": "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you.",
"text": "This is a cloned voice reading a new paragraph. We can keep the tone calm and measured."
}'
For quick experiments without a transcript, set "x_vector_only_mode": true and omit ref_text (quality may drop).
Save a voice profile (reuse clone prompt)
curl -X POST http://localhost:8000/api/voice_profiles \
-H "Content-Type: application/json" \
-d '{
"name": "my_radio_host",
"model_id": "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
"ref_audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav",
"ref_text": "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
}'
Then synthesize with that cached prompt:
curl -X POST http://localhost:8000/api/tts \
-H "Content-Type: application/json" \
-o clone_with_profile.wav \
-d '{
"mode": "voice_clone",
"voice_profile": "my_radio_host",
"text": "We can keep reusing this voice without re-uploading audio.",
"language": "English"
}'
Voice Design → Clone Reuse
- Use the Voice Design model to synthesize a short clip with the desired persona.
- Feed that clip and its text back as
ref_audio/ref_textwithmode: "voice_clone"using the Base model.
This keeps a consistent designed voice for longer scripts.
Frontend
The UI exposes the same options: pick mode, enter model id/path, language, speaker (custom voice), style (voice design), or ref audio/transcript (voice clone). It streams back a WAV, plays inline, and offers a download link.
Notes
- GPU + bfloat16/float16 greatly reduces latency and memory; CPU runs will be slow.
- Reference audio can be a public URL, local path, or base64 data URI. Keep it clean and ~3–10s for best cloning.
- The page pulls a Google Font; remove the
<link>infrontend/index.htmlif you need offline-only assets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qwen_tts_webui-1.0.2.tar.gz.
File metadata
- Download URL: qwen_tts_webui-1.0.2.tar.gz
- Upload date:
- Size: 62.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f9f808fe5f302edb453e5d89ba08836cda3956277b71da0eb7be99c6a641e36
|
|
| MD5 |
2401a3d1b0067cb6f7aeb9189c23bdfa
|
|
| BLAKE2b-256 |
e00c9a4d37dc6e75b58c28a7644538d3d44e5c48bc027a15712bd77ec52b7d04
|
Provenance
The following attestation bundles were made for qwen_tts_webui-1.0.2.tar.gz:
Publisher:
pypi-publish.yml on h1ddenpr0cess20/qwen-tts-webui
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qwen_tts_webui-1.0.2.tar.gz -
Subject digest:
7f9f808fe5f302edb453e5d89ba08836cda3956277b71da0eb7be99c6a641e36 - Sigstore transparency entry: 1109248506
- Sigstore integration time:
-
Permalink:
h1ddenpr0cess20/qwen-tts-webui@7e5e1f1dc8b8827007e5bcb877643ff43f731c0c -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/h1ddenpr0cess20
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@7e5e1f1dc8b8827007e5bcb877643ff43f731c0c -
Trigger Event:
release
-
Statement type:
File details
Details for the file qwen_tts_webui-1.0.2-py3-none-any.whl.
File metadata
- Download URL: qwen_tts_webui-1.0.2-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9631611e64df471833fa001bdad9d7ff005ba184d18d489917358309d4dd86d9
|
|
| MD5 |
85ec124bbcbddb7cae049567a44e6168
|
|
| BLAKE2b-256 |
2025a7f02847d28dbb88170e4321c63b06a0e545391d07d1d26f0b68594820fb
|
Provenance
The following attestation bundles were made for qwen_tts_webui-1.0.2-py3-none-any.whl:
Publisher:
pypi-publish.yml on h1ddenpr0cess20/qwen-tts-webui
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
qwen_tts_webui-1.0.2-py3-none-any.whl -
Subject digest:
9631611e64df471833fa001bdad9d7ff005ba184d18d489917358309d4dd86d9 - Sigstore transparency entry: 1109248514
- Sigstore integration time:
-
Permalink:
h1ddenpr0cess20/qwen-tts-webui@7e5e1f1dc8b8827007e5bcb877643ff43f731c0c -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/h1ddenpr0cess20
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@7e5e1f1dc8b8827007e5bcb877643ff43f731c0c -
Trigger Event:
release
-
Statement type: