Skip to main content

Real-time Qwen3-TTS inference using manual CUDA graph capture

Project description

Faster Qwen3-TTS

Real-time Qwen3-TTS inference using CUDA graph capture. No Flash Attention, no vLLM, no Triton. Just torch.cuda.CUDAGraph. Supports both streaming and non-streaming generation.

Results

Benchmarks include tokenization + inference (apples-to-apples with baseline). RTF > 1.0 = faster than real-time. TTFA measured as time to first playable audio chunk using streaming (chunk_size=8).

CustomVoice Models

CustomVoice uses predefined speaker IDs (no reference audio). These benchmarks use the first available speaker ID from the model.

Model CUDA Graphs RTF CUDA Graphs TTFA
0.6B CustomVoice 5.53 154ms
1.7B CustomVoice 4.78 171ms

0.6B Model

GPU Baseline RTF Baseline TTFA CUDA Graphs RTF CUDA Graphs TTFA Speedup
Jetson AGX Orin 64GB 0.175 2,572ms 1.57 556ms 9.0x / 4.6x
Jetson Thor 0.803 862ms 1.50 505ms 1.9x / 1.7x
DGX Spark (GB10) 1.19 631ms 2.26 364ms 1.9x / 1.7x
RTX 4090 1.34 462ms 5.56 152ms 4.1x / 3.0x
H100 80GB HBM3 0.59 1,049ms 4.19 224ms 7.1x / 4.7x

1.7B Model

GPU Baseline RTF Baseline TTFA CUDA Graphs RTF CUDA Graphs TTFA Speedup
Jetson AGX Orin 64GB 0.130 2,594ms 1.27 650ms 9.8x / 4.0x
Jetson Thor 0.772 912ms 1.26 595ms 1.6x / 1.5x
DGX Spark (GB10) 0.975 749ms 1.66 464ms 1.7x / 1.6x
RTX 4090 1.32 468ms 4.85 170ms 3.7x / 2.8x
H100 80GB HBM3 0.59 1,045ms 3.98 236ms 6.7x / 4.4x

Note: Baseline TTFA values are streaming TTFA from the community Qwen3-TTS-streaming fork (which adds streaming). The official Qwen3-TTS repo does not currently support streaming, so its “TTFA” is effectively time-to-full-audio. With RTF near 1.0, that means waiting for the entire sentence/paragraph to finish speaking before you hear anything. CUDA graphs uses generate_voice_clone_streaming(chunk_size=8) for TTFA. Both include text tokenization for fair comparison. Speedup shows throughput / TTFA improvement. The streaming fork reports additional speedups that appear tied to torch.compile; we couldn’t reproduce those on Jetson-class devices where torch.compile isn’t available.

GPU architecture notes: RTX 4090 (2.5 GHz clocks) outperforms H100 (1.8 GHz) for single-stream workloads. H100's lower baseline (RTF 0.59 vs 4090's 1.34) reflects design optimization for batch processing rather than single-stream inference.

Quick Start

git clone https://github.com/andimarafioti/faster-qwen3-tts
cd faster-qwen3-tts
./setup.sh       # creates venv with uv, installs deps, downloads models
./benchmark.sh   # runs full benchmark, saves JSON + audio samples

Requires: Python 3.10+, NVIDIA GPU with CUDA, uv.

Install (PyPI)

pip install faster-qwen3-tts

Note: This pulls qwen-tts from the official GitHub repo because it is not published on PyPI yet.

Install from source:

pip install -e .

Benchmark a specific model

./benchmark.sh 0.6B
./benchmark.sh 1.7B
./benchmark.sh both   # default

Results are saved as bench_results_<GPU_NAME>.json and audio samples as sample_0.6B.wav / sample_1.7B.wav.

How It Works

Qwen3-TTS runs two autoregressive transformers per decode step:

  1. Talker (28 layers): generates the first codebook token from text
  2. Code Predictor (5 layers): generates 15 additional codebook tokens

A single step involves ~500 small CUDA kernel launches with Python overhead between them. The GPU spends more time waiting for the next kernel than computing.

CUDA graphs capture the entire decode step and replay it as a single GPU operation:

  1. Static KV cache: pre-allocated fixed-size tensors (no dynamic allocation)
  2. Model's own forward: SDPA + RoPE via the model's native attention layers
  3. Graph capture: torch.cuda.CUDAGraph for both predictor and talker
  4. Padded attention: attention mask handles variable-length KV within fixed buffers

Per-component breakdown (Jetson AGX Orin, 0.6B)

Component Before After
Talker (28 layers) 75ms 12ms
Predictor (15 steps) 190ms 26ms
Overhead 65ms 16ms
Total per step 330ms 54ms

Streaming

CUDA graphs support streaming output — audio chunks are yielded during generation with the same per-step performance as non-streaming mode.

Chunk size vs performance (Jetson AGX Orin, 0.6B)

chunk_size TTFA RTF Audio per chunk
1 240ms 0.750 83ms
2 266ms 1.042 167ms
4 362ms 1.251 333ms
8 556ms 1.384 667ms
12 753ms 1.449 1000ms
Non-streaming 1.36 all at once

Smaller chunks = lower latency but more decode overhead. chunk_size=2 is the smallest that stays real-time on Jetson.

Usage

from faster_qwen3_tts import FasterQwen3TTS

model = FasterQwen3TTS.from_pretrained("Qwen/Qwen3-TTS-12Hz-0.6B-Base")

# Streaming — yields audio chunks during generation
for audio_chunk, sr, timing in model.generate_voice_clone_streaming(
    text="Hello world!", language="English",
    ref_audio="ref.wav", ref_text="...",
    chunk_size=8,  # 8 steps ≈ 667ms of audio per chunk
):
    play(audio_chunk, sr)  # process/send each chunk immediately

# Non-streaming — returns all audio at once (unchanged API)
audio_list, sr = model.generate_voice_clone(
    text="Hello world!", language="English",
    ref_audio="ref.wav", ref_text="...",
)

CLI

Voice cloning (reference audio):

faster-qwen3-tts clone \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --text "Hello world!" \
  --language English \
  --ref-audio ref.wav \
  --ref-text "Reference transcript" \
  --output out.wav

CustomVoice (predefined speaker IDs):

faster-qwen3-tts custom --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --list-speakers
faster-qwen3-tts custom \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
  --speaker aiden \
  --text "Hello!" \
  --language English \
  --output out.wav

VoiceDesign (instruction-based):

faster-qwen3-tts design \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
  --instruct "Warm, confident narrator with slight British accent" \
  --text "Welcome to the show." \
  --language English \
  --output out.wav

Streaming (prints RTF after write):

faster-qwen3-tts custom \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
  --speaker aiden \
  --text "Hello!" \
  --language English \
  --output out.wav \
  --streaming

Server mode (keep model hot, stop with exit):

faster-qwen3-tts serve \
  --mode custom \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
  --speaker aiden \
  --language English \
  --streaming

How it works

The CUDA graphs are unchanged — both predictor and talker graphs are replayed per step. The streaming generator yields codec ID chunks every chunk_size steps, and the model wrapper decodes each chunk to audio using a sliding window with 25-frame left context (matching the upstream codec's chunked_decode pattern) to avoid boundary artifacts.

Voice Cloning: ICL Phoneme Artifact

In ICL (In-Context Learning) mode — the default voice cloning path — the model's prefill sequence ends with the last codec token of the reference audio. The model conditions its first generated token on whatever phoneme the reference audio happens to end on. If the reference ends mid-word or on a consonant cluster, that phoneme bleeds into the very start of the generated speech.

The fix is applied automatically. The wrapper appends 0.5 seconds of silence to the reference audio before encoding it. This ensures the last codec tokens in the prefill represent silence, giving the model a clean starting point regardless of how the reference recording ends — no changes to your calling code required.

Voice Cloning with Precomputed Speaker Embeddings

For production use, extract the speaker embedding once and reuse it:

# 1. Extract speaker embedding from reference audio (one-time, ~10s)
python examples/extract_speaker.py --ref_audio voice.wav --output speaker.pt

# 2. Generate speech with CUDA graphs (real-time)
python examples/generate_with_embedding.py --speaker speaker.pt --text "Hello!" --language English --output en.wav
python examples/generate_with_embedding.py --speaker speaker.pt --text "Bonjour!" --language French --output fr.wav
python examples/generate_with_embedding.py --speaker speaker.pt --text "Hallo!" --language German --output de.wav

The speaker embedding is a 4KB file (2048-dim bf16 vector). In x_vector_only mode:

  • No accent bleed: native pronunciation per language
  • Shorter prefill: 10 tokens vs ~80+ in full ICL clone mode
  • No ref audio at runtime: just the 4KB embedding file

Files

faster_qwen3_tts/
  model.py                        # Wrapper API
  generate.py                     # Non-streaming generation loop
  streaming.py                    # Streaming generation loop
  predictor_graph.py              # Predictor CUDA graph with StaticCache
  talker_graph.py                 # Talker CUDA graph with StaticCache
examples/
  extract_speaker.py              # Extract speaker embedding from ref audio
  generate_with_embedding.py      # Generate with precomputed speaker embedding
benchmarks/
  throughput.py                   # Throughput benchmark (RTF + audio samples)
  streaming.py                    # Streaming benchmark (TTFA + chunk timing)
  chunk_sweep.py                  # Chunk size sweep (RTF vs latency tradeoff)
  baseline.py                     # Baseline qwen-tts benchmark
  custom_voice.py                 # CustomVoice benchmark
benchmark.sh                      # Run benchmarks
setup.sh                          # Setup venv + download models

License

MIT

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

faster_qwen3_tts-0.1.0.tar.gz (38.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

faster_qwen3_tts-0.1.0-py3-none-any.whl (25.2 kB view details)

Uploaded Python 3

File details

Details for the file faster_qwen3_tts-0.1.0.tar.gz.

File metadata

  • Download URL: faster_qwen3_tts-0.1.0.tar.gz
  • Upload date:
  • Size: 38.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for faster_qwen3_tts-0.1.0.tar.gz
Algorithm Hash digest
SHA256 06c5ce2d1332ee090656e21370dab438556583a36eb9c68af0425eb0c02af995
MD5 89ab00d42d4c77434e9a2a8817b711a6
BLAKE2b-256 354a2359135897b6011363632cb4bbc132201d16a4ec74e770592a00504de71e

See more details on using hashes here.

File details

Details for the file faster_qwen3_tts-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for faster_qwen3_tts-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 985fe028a178b5f3598e0866fd111074b206dd044ce1311ab53666efc45f1d8d
MD5 c5b7c9707b06dc68c443151c970d1282
BLAKE2b-256 d3e49d939e8c254dc3ea9aa2165938ae200aba9df15ec71b6866ef49ce72dfde

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page