Real-time Qwen3-TTS inference using manual CUDA graph capture

These details have not been verified by PyPI

Project links

Project description

Faster Qwen3-TTS

Real-time Qwen3-TTS inference using CUDA graph capture. No Flash Attention, no vLLM, no Triton. Just torch.cuda.CUDAGraph. Supports both streaming and non-streaming generation.

Results

Benchmarks include tokenization + inference (apples-to-apples with baseline). RTF > 1.0 = faster than real-time. TTFA measured as time to first playable audio chunk using streaming (chunk_size=8).

0.6B Model

GPU	Baseline RTF	Baseline TTFA	CUDA Graphs RTF	CUDA Graphs TTFA	Speedup
Jetson AGX Orin 64GB	0.175	2,572ms	1.57	556ms	9.0x / 4.6x
Jetson Thor	0.803	862ms	1.50	505ms	1.9x / 1.7x
DGX Spark (GB10)	1.19	631ms	2.26	364ms	1.9x / 1.7x
RTX 4090	1.34	462ms	5.56	152ms	4.1x / 3.0x
H100 80GB HBM3	0.59	1,049ms	4.19	224ms	7.1x / 4.7x

1.7B Model

GPU	Baseline RTF	Baseline TTFA	CUDA Graphs RTF	CUDA Graphs TTFA	Speedup
Jetson AGX Orin 64GB	0.130	2,594ms	1.27	650ms	9.8x / 4.0x
Jetson Thor	0.772	912ms	1.26	595ms	1.6x / 1.5x
DGX Spark (GB10)	0.975	749ms	1.66	464ms	1.7x / 1.6x
RTX 4090	1.32	468ms	4.85	170ms	3.7x / 2.8x
H100 80GB HBM3	0.59	1,045ms	3.98	236ms	6.7x / 4.4x

Note: Baseline TTFA values are streaming TTFA from the community Qwen3-TTS-streaming fork (which adds streaming). The official Qwen3-TTS repo does not currently support streaming, so its “TTFA” is effectively time-to-full-audio. With RTF near 1.0, that means waiting for the entire sentence/paragraph to finish speaking before you hear anything. CUDA graphs uses generate_voice_clone_streaming(chunk_size=8) for TTFA. Both include text tokenization for fair comparison. Speedup shows throughput / TTFA improvement. The streaming fork reports additional speedups that appear tied to torch.compile; we couldn’t reproduce those on Jetson-class devices where torch.compile isn’t available.

GPU architecture notes: RTX 4090 (2.5 GHz clocks) outperforms H100 (1.8 GHz) for single-stream workloads. H100's lower baseline (RTF 0.59 vs 4090's 1.34) reflects design optimization for batch processing rather than single-stream inference.

Demo UI

A minimal web UI that streams audio in real time and shows TTFA and RTF live:

pip install -e ".[demo]"
python demo/server.py --model Qwen/Qwen3-TTS-12Hz-0.6B-Base
# open http://localhost:7860

Features: voice clone (upload any WAV), voice design (1.7B-VoiceDesign model), streaming/non-streaming toggle, adjustable chunk size, live TTFA/RTF metrics, WAV download.

Quick Start

git clone https://github.com/andimarafioti/faster-qwen3-tts
cd faster-qwen3-tts
./setup.sh       # creates venv with uv, installs deps, downloads models
./benchmark.sh   # runs full benchmark, saves JSON + audio samples

Requires: Python 3.10+, NVIDIA GPU with CUDA, uv.

Install (PyPI)

pip install faster-qwen3-tts

Note: This installs the qwen-tts PyPI package (>=0.1.1).

Install from source:

pip install -e .

Benchmark a specific model

./benchmark.sh 0.6B
./benchmark.sh 1.7B
./benchmark.sh both   # default

Results are saved as bench_results_<GPU_NAME>.json and audio samples as sample_0.6B.wav / sample_1.7B.wav.

How It Works

Qwen3-TTS runs two autoregressive transformers per decode step:

Talker (28 layers): generates the first codebook token from text
Code Predictor (5 layers): generates 15 additional codebook tokens

A single step involves ~500 small CUDA kernel launches with Python overhead between them. The GPU spends more time waiting for the next kernel than computing.

CUDA graphs capture the entire decode step and replay it as a single GPU operation:

Static KV cache: pre-allocated fixed-size tensors (no dynamic allocation)
Model's own forward: SDPA + RoPE via the model's native attention layers
Graph capture: torch.cuda.CUDAGraph for both predictor and talker
Padded attention: attention mask handles variable-length KV within fixed buffers

Per-component breakdown (Jetson AGX Orin, 0.6B)

Component	Before	After
Talker (28 layers)	75ms	12ms
Predictor (15 steps)	190ms	26ms
Overhead	65ms	16ms
Total per step	330ms	54ms

Streaming

CUDA graphs support streaming output — audio chunks are yielded during generation with the same per-step performance as non-streaming mode.

Chunk size vs performance (Jetson AGX Orin, 0.6B)

chunk_size	TTFA	RTF	Audio per chunk
1	240ms	0.750	83ms
2	266ms	1.042	167ms
4	362ms	1.251	333ms
8	556ms	1.384	667ms
12	753ms	1.449	1000ms
Non-streaming	—	1.57	all at once

Smaller chunks = lower latency but more decode overhead. chunk_size=2 is the smallest that stays real-time on Jetson.

Model mode parity: In hot-path (post CUDA-graph capture) runs, the different model modes are effectively the same speed. Use benchmarks/compare_modes.py to reproduce. Example on 0.6B, chunk_size=8:

Mode	TTFA (ms)	RTF	ms/step
VoiceClone xvec	152 ± 11	5.470 ± 0.032	15.2 ± 0.1
VoiceClone full ICL	149 ± 1	5.497 ± 0.026	15.2 ± 0.1
CustomVoice	148 ± 1	5.537 ± 0.020	15.0 ± 0.1

Usage

from faster_qwen3_tts import FasterQwen3TTS

model = FasterQwen3TTS.from_pretrained("Qwen/Qwen3-TTS-12Hz-0.6B-Base")

# Streaming — yields audio chunks during generation
for audio_chunk, sr, timing in model.generate_voice_clone_streaming(
    text="Hello world!", language="English",
    ref_audio="ref.wav", ref_text="...",
    chunk_size=8,  # 8 steps ≈ 667ms of audio per chunk
):
    play(audio_chunk, sr)  # process/send each chunk immediately

# Non-streaming — returns all audio at once (unchanged API)
audio_list, sr = model.generate_voice_clone(
    text="Hello world!", language="English",
    ref_audio="ref.wav", ref_text="...",
)

CLI

Voice cloning (reference audio):

faster-qwen3-tts clone \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --text "Hello world!" \
  --language English \
  --ref-audio ref.wav \
  --ref-text "Reference transcript" \
  --output out.wav

CustomVoice (predefined speaker IDs):

faster-qwen3-tts custom --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --list-speakers
faster-qwen3-tts custom \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
  --speaker aiden \
  --text "Hello!" \
  --language English \
  --output out.wav

VoiceDesign (instruction-based):

faster-qwen3-tts design \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
  --instruct "Warm, confident narrator with slight British accent" \
  --text "Welcome to the show." \
  --language English \
  --output out.wav

Streaming (prints RTF after write):

faster-qwen3-tts custom \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
  --speaker aiden \
  --text "Hello!" \
  --language English \
  --output out.wav \
  --streaming

Server mode (keep model hot, stop with exit):

faster-qwen3-tts serve \
  --mode custom \
  --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
  --speaker aiden \
  --language English \
  --streaming

How it works

The CUDA graphs are unchanged — both predictor and talker graphs are replayed per step. The streaming generator yields codec ID chunks every chunk_size steps, and the model wrapper decodes each chunk to audio using a sliding window with 25-frame left context (matching the upstream codec's chunked_decode pattern) to avoid boundary artifacts.

Voice Cloning: ICL Phoneme Artifact

In ICL (In-Context Learning) mode — the default voice cloning path — the model's prefill sequence ends with the last codec token of the reference audio. The model conditions its first generated token on whatever phoneme the reference audio happens to end on. If the reference ends mid-word or on a consonant cluster, that phoneme bleeds into the very start of the generated speech.

The fix is applied automatically. The wrapper appends 0.5 seconds of silence to the reference audio before encoding it. This ensures the last codec tokens in the prefill represent silence, giving the model a clean starting point regardless of how the reference recording ends — no changes to your calling code required.

Voice Cloning with Precomputed Speaker Embeddings

For production use, extract the speaker embedding once and reuse it:

# 1. Extract speaker embedding from reference audio (one-time, ~10s)
python examples/extract_speaker.py --ref_audio voice.wav --output speaker.pt

# 2. Generate speech with CUDA graphs (real-time)
python examples/generate_with_embedding.py --speaker speaker.pt --text "Hello!" --language English --output en.wav
python examples/generate_with_embedding.py --speaker speaker.pt --text "Bonjour!" --language French --output fr.wav
python examples/generate_with_embedding.py --speaker speaker.pt --text "Hallo!" --language German --output de.wav

The speaker embedding is a 4KB file (2048-dim bf16 vector). In x_vector_only mode:

No accent bleed: native pronunciation per language
Shorter prefill: 10 tokens vs ~80+ in full ICL clone mode
No ref audio at runtime: just the 4KB embedding file

License

MIT

Acknowledgments

Qwen3-TTS by the Qwen team
Qwen3-TTS-streaming for ideas and code we adapted for streaming
nano-qwen3tts-vllm for inspiration on CUDA graph usage
NVIDIA for providing the Jetson AGX Orin board

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.5

Mar 27, 2026

0.2.4

Mar 4, 2026

0.2.2

Feb 27, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 26, 2026

This version

0.1.2

Feb 24, 2026

0.1.1

Feb 23, 2026

0.1.0

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

faster_qwen3_tts-0.1.2-py3-none-any.whl (25.7 kB view details)

Uploaded Feb 24, 2026 Python 3

File details

Details for the file faster_qwen3_tts-0.1.2-py3-none-any.whl.

File metadata

Download URL: faster_qwen3_tts-0.1.2-py3-none-any.whl
Upload date: Feb 24, 2026
Size: 25.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.7

File hashes

Hashes for faster_qwen3_tts-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d93bc949f0f3c6b9f444dc382e62b1f2099af409ca92830241a1109019be9cf`
MD5	`303919a2aa4a1bafbbc4bafa233ca9fb`
BLAKE2b-256	`90edf994ad051f6c80c5de528326812d4698b8556b4a2f009916dc003532ad5b`

See more details on using hashes here.

faster-qwen3-tts 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Faster Qwen3-TTS

Results

0.6B Model

1.7B Model

Demo UI

Quick Start

Install (PyPI)

Benchmark a specific model

How It Works

Per-component breakdown (Jetson AGX Orin, 0.6B)

Streaming

Chunk size vs performance (Jetson AGX Orin, 0.6B)

Usage

CLI

How it works

Voice Cloning: ICL Phoneme Artifact

Voice Cloning with Precomputed Speaker Embeddings

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes