Skip to main content

FastAPI web UI for Qwen3-TTS: custom voices, voice design, voice cloning, and per-request model selection.

Project description

Qwen3 TTS Web App

A FastAPI + vanilla JS UI to run Qwen3-TTS locally: custom voices, voice design, voice cloning, and per-request model selection.

Documentation

Prerequisites

  • Python 3.10+ with a GPU-enabled PyTorch build (GPU strongly recommended).
  • Disk/bandwidth for model downloads (several GB on first load).
  • Optional: FlashAttention 2 if your GPU supports it (pip install -U flash-attn --no-build-isolation).

Setup

pip install -r requirements.txt

If your machine cannot download weights during runtime, pre-download a model (e.g. huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice) and point QWEN_TTS_MODEL to that path.

Run

uvicorn app.main:app --reload --port 8000

Open http://localhost:8000 for the UI. API endpoints live under /api/*.

Docker

Build + run (CPU)

docker build -t qwen-tts .
docker run --rm -p 8000:8000 qwen-tts

Build + run (GPU)

Requires NVIDIA Container Toolkit and a CUDA-capable host.

docker build -t qwen-tts .
docker run --rm --gpus all -e QWEN_TTS_DEVICE=cuda:0 -p 8000:8000 qwen-tts

Docker Compose

docker compose up --build

Compose defaults to GPU (QWEN_TTS_DEVICE=cuda:0). For CPU-only, set QWEN_TTS_DEVICE=cpu in docker-compose.yml.

Configuration (env vars)

  • QWEN_TTS_MODEL — default model id or local path (default: Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice).
  • QWEN_TTS_DEVICE — device map (default: cuda:0 if available, else cpu).
  • QWEN_TTS_USE_FLASH — set to 1 to try FlashAttention 2.
  • QWEN_TTS_CUSTOM_MODEL — override default for Custom Voice mode (else uses QWEN_TTS_MODEL).
  • QWEN_TTS_VD_MODEL — override default for Voice Design mode (default: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign).
  • QWEN_TTS_CLONE_MODEL — override default for Voice Clone mode (default: Qwen/Qwen3-TTS-12Hz-1.7B-Base).
  • QWEN_TTS_VIDEO_FONT — full path to a font file for video transcript rendering (useful for CJK/foreign text).

Requests can override model_id and device per call, but the UI auto-selects the recommended models per mode from the upstream README.

Model quick reference (from upstream README)

  • Custom Voice: Qwen/Qwen3-TTS-12Hz-{0.6B,1.7B}-CustomVoice (speaker list included).
  • Voice Design: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign (describe persona; no speaker list).
  • Voice Clone: Qwen/Qwen3-TTS-12Hz-{0.6B,1.7B}-Base (provide ref audio + transcript).
  • Tokenizer (encode/decode only): Qwen/Qwen3-TTS-Tokenizer-12Hz.

Features

  • Custom Voice: pick a provided speaker, language, and optional style prompt.
  • Voice Design: describe a persona and language; the model invents the voice.
  • Voice Clone: supply a reference audio (URL/path/base64) plus transcript to clone a voice.
  • Model selection: choose any released model id or local directory per request.
  • UI: shows available speakers/languages, plays inline, and offers WAV download.
  • Recording/Upload for cloning: record in-browser or upload; the UI converts to WAV before sending.
  • Saved voices: build a reusable voice profile (clone prompt) once and reuse it without re-uploading audio.
  • MP3 download: generation stays WAV; pick MP3 in the UI to convert the generated clip on demand (requires pydub + ffmpeg available).
  • Video export: render a vertical/square/landscape MP4 with waveform/spectrum visuals and transcript (requires ffmpeg with drawtext).

API Examples

Custom Voice

curl -X POST http://localhost:8000/api/tts \
  -H "Content-Type: application/json" \
  -o custom.wav \
  -d '{
    "mode": "custom_voice",
    "model_id": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    "language": "English",
    "speaker": "Ryan",
    "instruct": "Energetic podcast intro with a smile.",
    "text": "Welcome back to our weekend build session. Grab your coffee and let us ship!"
  }'

Voice Design

curl -X POST http://localhost:8000/api/tts \
  -H "Content-Type: application/json" \
  -o design.wav \
  -d '{
    "mode": "voice_design",
    "model_id": "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    "language": "English",
    "instruct": "Late-night radio host, warm baritone, unhurried pace with soft consonants.",
    "text": "You are tuned to 88.5 FM. Outside the city is sleeping, but we are still here with you."
  }'

Voice Clone

curl -X POST http://localhost:8000/api/tts \
  -H "Content-Type: application/json" \
  -o clone.wav \
  -d '{
    "mode": "voice_clone",
    "model_id": "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    "language": "English",
    "ref_audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav",
    "ref_text": "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you.",
    "text": "This is a cloned voice reading a new paragraph. We can keep the tone calm and measured."
  }'

For quick experiments without a transcript, set "x_vector_only_mode": true and omit ref_text (quality may drop).

Save a voice profile (reuse clone prompt)

curl -X POST http://localhost:8000/api/voice_profiles \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my_radio_host",
    "model_id": "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    "ref_audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav",
    "ref_text": "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
  }'

Then synthesize with that cached prompt:

curl -X POST http://localhost:8000/api/tts \
  -H "Content-Type: application/json" \
  -o clone_with_profile.wav \
  -d '{
    "mode": "voice_clone",
    "voice_profile": "my_radio_host",
    "text": "We can keep reusing this voice without re-uploading audio.",
    "language": "English"
  }'

Voice Design → Clone Reuse

  1. Use the Voice Design model to synthesize a short clip with the desired persona.
  2. Feed that clip and its text back as ref_audio/ref_text with mode: "voice_clone" using the Base model.
    This keeps a consistent designed voice for longer scripts.

Frontend

The UI exposes the same options: pick mode, enter model id/path, language, speaker (custom voice), style (voice design), or ref audio/transcript (voice clone). It streams back a WAV, plays inline, and offers a download link.

Notes

  • GPU + bfloat16/float16 greatly reduces latency and memory; CPU runs will be slow.
  • Reference audio can be a public URL, local path, or base64 data URI. Keep it clean and ~3–10s for best cloning.
  • The page pulls a Google Font; remove the <link> in frontend/index.html if you need offline-only assets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwen_tts_webui-1.0.2.tar.gz (62.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwen_tts_webui-1.0.2-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file qwen_tts_webui-1.0.2.tar.gz.

File metadata

  • Download URL: qwen_tts_webui-1.0.2.tar.gz
  • Upload date:
  • Size: 62.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qwen_tts_webui-1.0.2.tar.gz
Algorithm Hash digest
SHA256 7f9f808fe5f302edb453e5d89ba08836cda3956277b71da0eb7be99c6a641e36
MD5 2401a3d1b0067cb6f7aeb9189c23bdfa
BLAKE2b-256 e00c9a4d37dc6e75b58c28a7644538d3d44e5c48bc027a15712bd77ec52b7d04

See more details on using hashes here.

Provenance

The following attestation bundles were made for qwen_tts_webui-1.0.2.tar.gz:

Publisher: pypi-publish.yml on h1ddenpr0cess20/qwen-tts-webui

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qwen_tts_webui-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: qwen_tts_webui-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qwen_tts_webui-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9631611e64df471833fa001bdad9d7ff005ba184d18d489917358309d4dd86d9
MD5 85ec124bbcbddb7cae049567a44e6168
BLAKE2b-256 2025a7f02847d28dbb88170e4321c63b06a0e545391d07d1d26f0b68594820fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for qwen_tts_webui-1.0.2-py3-none-any.whl:

Publisher: pypi-publish.yml on h1ddenpr0cess20/qwen-tts-webui

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page