Skip to main content

Real-time text-to-speech on Intel NPU/CPU via OpenVINO

Project description

BabelVox

Real-time text-to-speech on Intel NPU via OpenVINO. Runs Qwen3-TTS 0.6B inference entirely on Intel NPU (AI Boost), achieving RTF=1.0x (real-time) speech synthesis on a Lunar Lake ultrabook.

No PyTorch at runtime. Dependencies: openvino, numpy, librosa, soundfile, scipy, transformers (tokenizer only).

Installation

pip install babelvox

Or from source:

git clone https://github.com/Djwarf/babelvox.git
cd babelvox
pip install -e .

Quick start

Models are downloaded automatically from HuggingFace on first run (~2.5 GB, cached for future use).

As a library

from babelvox import BabelVox

# Models auto-download on first use
tts = BabelVox(device="NPU", precision="int8",
               use_cp_kv_cache=True, talker_buckets=[64, 128, 256])

wav, sr = tts.generate("Don't panic.", language="English")

import soundfile as sf
sf.write("output.wav", wav, sr)

From the command line

# Real-time on NPU with all optimizations (models auto-download)
babelvox \
  --device NPU \
  --int8 \
  --cp-kv-cache \
  --talker-buckets "64,128,256" \
  --text "Hello, this is real-time speech synthesis on an Intel NPU." \
  --output hello.wav

# CPU-only (no NPU required, ~1.1x RTF)
babelvox \
  --device CPU \
  --int8 \
  --cp-kv-cache \
  --text "Hello world" \
  --output hello.wav

Model export (one-time setup)

BabelVox needs pre-exported OpenVINO IR models. The export scripts in tools/ require PyTorch and the original Qwen3-TTS model (~2.4 GB download):

pip install torch qwen-tts nncf

# Export OpenVINO IR models
python tools/export_tts_lm.py
python tools/export_speaker_encoder.py
python tools/export_decoder.py
python tools/export_tokenizer_encoder.py
python tools/export_cp_kvcache.py
python tools/export_weights.py

# Quantize to INT8 (recommended)
python tools/quantize_models.py --int8

After export, PyTorch is no longer needed.

Performance

Optimization progression

Optimization RTF Per-step Notes
FP16 NPU baseline 3.0x 246 ms Full-recompute, padded to 256 tokens
+ INT8 quantization 2.1x 156 ms NNCF INT8_SYM weight compression
+ CP KV cache 1.4x 106 ms Eliminates redundant code predictor recomputation
+ Multi-bucket talker 1.0x ~80 ms Dynamically picks smallest NPU shape per step

RTF = Real-Time Factor. RTF=1.0x means generating 1 second of audio takes 1 second of compute.

Where the time goes (INT8 + CP KV cache, 256-token bucket)

Component Device Time Share
Talker (28-layer transformer) NPU 61 ms 57%
Code predictor (15 groups) CPU 45 ms 43%
Numpy overhead (embeddings, sampling) CPU <1 ms <1%

Multi-bucket scaling

The talker scales linearly with sequence length on NPU. Pre-compiling at multiple sizes and routing to the smallest bucket that fits dramatically reduces wasted compute:

Bucket size Talker time Total (+ 45ms CP) Effective RTF
64 15 ms 60 ms 0.72x
128 22 ms 67 ms 0.80x
192 31 ms 76 ms 0.91x
256 43 ms 88 ms 1.06x

Hardware tested

  • CPU: Intel Core Ultra 7 258V (Lunar Lake)
  • NPU: Intel AI Boost (~13 TOPS)
  • RAM: 32 GB LPDDR5x
  • Device: Samsung Galaxy Book5 Pro

Architecture

Qwen3-TTS uses 5 model components orchestrated in an autoregressive loop:

Text --> Tokenizer --> Text Embeddings --> Talker (28L transformer) --> Codec code_0
                                               |
                       Speaker Embedding ------+    code_0 --> Code Predictor (5L) --> codes 1-15
                       (from reference audio)            \--> repeat 15x with KV cache
                                                                     |
                                           All 16 codes --> Tokenizer Decoder --> Waveform
Component Layers Hidden Heads Device INT8 size
Talker 28 1024 16Q/8KV NPU 444 MB
Code predictor 5 1024 16Q/8KV CPU 79 MB
Tokenizer encoder -- -- -- NPU 48 MB
Tokenizer decoder -- -- -- NPU 114 MB
Speaker encoder -- -- -- NPU 9 MB

CLI reference

Flag Default Description
--device CPU CPU or NPU
--int8 off Use INT8 quantized models
--precision fp16 fp16, int8, int4, or fp32
--cp-kv-cache off KV cache for code predictor (recommended)
--talker-buckets none Comma-separated NPU bucket sizes (e.g. 64,128,256)
--kv-cache off KV cache for talker (not recommended on NPU)
--max-tokens 200 Maximum generation steps
--max-talker-seq 256 Fixed talker padding (when not using buckets)
--max-decoder-frames 256 Max codec frames for audio decoder
--max-kv-len 256 KV cache buffer size (if --kv-cache)
--text demo text Text to synthesize
--language English Language for synthesis
--ref-audio none Reference audio for voice cloning
--output / -o output.wav Output WAV file path
--export-dir openvino_export Directory with exported models
--model-path Qwen/Qwen3-TTS-12Hz-0.6B-Base HuggingFace model (tokenizer)

Acknowledgments

Based on Qwen3-TTS by Alibaba Qwen Team (Apache-2.0).

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelvox-0.2.0.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babelvox-0.2.0-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file babelvox-0.2.0.tar.gz.

File metadata

  • Download URL: babelvox-0.2.0.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for babelvox-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a8561aababbd77755bf620f1fa21d544a87e6848a4e7127a2e94b89bb0d131a5
MD5 dcf95ce7d4807246f226b0400045ebe7
BLAKE2b-256 0d266e202de4a5139195ce1bf2b35ea53b6bb1ab17b45109c2e9d1b9be22717d

See more details on using hashes here.

File details

Details for the file babelvox-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: babelvox-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for babelvox-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 73bc6627f51cddb2c4fa8f7c491dbee2aa9cab9b8579966820f1687e57abf9e6
MD5 7e9132a2374585c49a0577208799cf51
BLAKE2b-256 05f578331c7bfdc9a0eaae696e44f507acdc772edc569c63c8712a1bd2847c61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page