Real-time text-to-speech on Intel NPU/CPU via OpenVINO

These details have not been verified by PyPI

Project description

BabelVox

Real-time text-to-speech on Intel NPU via OpenVINO. Runs Qwen3-TTS 0.6B inference entirely on Intel NPU (AI Boost), achieving RTF=1.0x (real-time) speech synthesis on a Lunar Lake ultrabook.

No PyTorch at runtime. Dependencies: openvino, numpy, librosa, soundfile, scipy, transformers (tokenizer only).

Installation

pip install babelvox

Or from source:

git clone https://github.com/Djwarf/babelvox.git
cd babelvox
pip install -e .

Quick start

Models are downloaded automatically from HuggingFace on first run (~2.5 GB, cached for future use).

As a library

from babelvox import BabelVox

# Models auto-download on first use
tts = BabelVox(device="NPU", precision="int8",
               use_cp_kv_cache=True, talker_buckets=[64, 128, 256])

wav, sr = tts.generate("Don't panic.", language="English")

import soundfile as sf
sf.write("output.wav", wav, sr)

From the command line

# Real-time on NPU with all optimizations (models auto-download)
babelvox \
  --device NPU \
  --int8 \
  --cp-kv-cache \
  --talker-buckets "64,128,256" \
  --text "Hello, this is real-time speech synthesis on an Intel NPU." \
  --output hello.wav

# CPU-only (no NPU required, ~1.1x RTF)
babelvox \
  --device CPU \
  --int8 \
  --cp-kv-cache \
  --text "Hello world" \
  --output hello.wav

Model export (one-time setup)

BabelVox needs pre-exported OpenVINO IR models. The export scripts in tools/ require PyTorch and the original Qwen3-TTS model (~2.4 GB download):

pip install torch qwen-tts nncf

# Export OpenVINO IR models
python tools/export_tts_lm.py
python tools/export_speaker_encoder.py
python tools/export_decoder.py
python tools/export_tokenizer_encoder.py
python tools/export_cp_kvcache.py
python tools/export_weights.py

# Quantize to INT8 (recommended)
python tools/quantize_models.py --int8

After export, PyTorch is no longer needed.

Performance

Optimization progression

Optimization	RTF	Per-step	Notes
FP16 NPU baseline	3.0x	246 ms	Full-recompute, padded to 256 tokens
+ INT8 quantization	2.1x	156 ms	NNCF INT8_SYM weight compression
+ CP KV cache	1.4x	106 ms	Eliminates redundant code predictor recomputation
+ Multi-bucket talker	1.0x	~80 ms	Dynamically picks smallest NPU shape per step

RTF = Real-Time Factor. RTF=1.0x means generating 1 second of audio takes 1 second of compute.

Where the time goes (INT8 + CP KV cache, 256-token bucket)

Component	Device	Time	Share
Talker (28-layer transformer)	NPU	61 ms	57%
Code predictor (15 groups)	CPU	45 ms	43%
Numpy overhead (embeddings, sampling)	CPU	<1 ms	<1%

Multi-bucket scaling

The talker scales linearly with sequence length on NPU. Pre-compiling at multiple sizes and routing to the smallest bucket that fits dramatically reduces wasted compute:

Bucket size	Talker time	Total (+ 45ms CP)	Effective RTF
64	15 ms	60 ms	0.72x
128	22 ms	67 ms	0.80x
192	31 ms	76 ms	0.91x
256	43 ms	88 ms	1.06x

Hardware tested

CPU: Intel Core Ultra 7 258V (Lunar Lake)
NPU: Intel AI Boost (~13 TOPS)
RAM: 32 GB LPDDR5x
Device: Samsung Galaxy Book5 Pro

Architecture

Qwen3-TTS uses 5 model components orchestrated in an autoregressive loop:

Text --> Tokenizer --> Text Embeddings --> Talker (28L transformer) --> Codec code_0
                                               |
                       Speaker Embedding ------+    code_0 --> Code Predictor (5L) --> codes 1-15
                       (from reference audio)            \--> repeat 15x with KV cache
                                                                     |
                                           All 16 codes --> Tokenizer Decoder --> Waveform

Component	Layers	Hidden	Heads	Device	INT8 size
Talker	28	1024	16Q/8KV	NPU	444 MB
Code predictor	5	1024	16Q/8KV	CPU	79 MB
Tokenizer encoder	--	--	--	NPU	48 MB
Tokenizer decoder	--	--	--	NPU	114 MB
Speaker encoder	--	--	--	NPU	9 MB

CLI reference

Flag	Default	Description
`--device`	`CPU`	`CPU` or `NPU`
`--int8`	off	Use INT8 quantized models
`--precision`	`fp16`	`fp16`, `int8`, `int4`, or `fp32`
`--cp-kv-cache`	off	KV cache for code predictor (recommended)
`--talker-buckets`	none	Comma-separated NPU bucket sizes (e.g. `64,128,256`)
`--kv-cache`	off	KV cache for talker (not recommended on NPU)
`--max-tokens`	200	Maximum generation steps
`--max-talker-seq`	256	Fixed talker padding (when not using buckets)
`--max-decoder-frames`	256	Max codec frames for audio decoder
`--max-kv-len`	256	KV cache buffer size (if `--kv-cache`)
`--text`	demo text	Text to synthesize
`--language`	English	Language for synthesis
`--ref-audio`	none	Reference audio for voice cloning
`--output` / `-o`	`output.wav`	Output WAV file path
`--export-dir`	`openvino_export`	Directory with exported models
`--model-path`	`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	HuggingFace model (tokenizer)

Acknowledgments

Based on Qwen3-TTS by Alibaba Qwen Team (Apache-2.0).

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.7.1

Mar 22, 2026

1.7.0

Mar 22, 2026

1.6.0

Mar 22, 2026

1.5.0

Mar 22, 2026

1.4.0

Mar 22, 2026

1.3.0

Mar 20, 2026

1.2.0

Mar 20, 2026

1.0.0

Mar 19, 2026

0.11.0

Mar 19, 2026

0.10.0

Mar 19, 2026

0.9.0

Mar 19, 2026

0.8.0

Mar 19, 2026

0.6.1

Mar 18, 2026

0.6.0

Feb 17, 2026

0.5.1

Feb 17, 2026

0.5.0

Feb 17, 2026

0.4.1

Feb 17, 2026

0.4.0

Feb 17, 2026

0.3.0

Feb 17, 2026

0.2.2

Feb 17, 2026

0.2.1

Feb 17, 2026

This version

0.2.0

Feb 17, 2026

0.1.0

Feb 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelvox-0.2.0.tar.gz (19.4 kB view details)

Uploaded Feb 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

babelvox-0.2.0-py3-none-any.whl (18.1 kB view details)

Uploaded Feb 17, 2026 Python 3

File details

Details for the file babelvox-0.2.0.tar.gz.

File metadata

Download URL: babelvox-0.2.0.tar.gz
Upload date: Feb 17, 2026
Size: 19.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for babelvox-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a8561aababbd77755bf620f1fa21d544a87e6848a4e7127a2e94b89bb0d131a5`
MD5	`dcf95ce7d4807246f226b0400045ebe7`
BLAKE2b-256	`0d266e202de4a5139195ce1bf2b35ea53b6bb1ab17b45109c2e9d1b9be22717d`

See more details on using hashes here.

File details

Details for the file babelvox-0.2.0-py3-none-any.whl.

File metadata

Download URL: babelvox-0.2.0-py3-none-any.whl
Upload date: Feb 17, 2026
Size: 18.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for babelvox-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`73bc6627f51cddb2c4fa8f7c491dbee2aa9cab9b8579966820f1687e57abf9e6`
MD5	`7e9132a2374585c49a0577208799cf51`
BLAKE2b-256	`05f578331c7bfdc9a0eaae696e44f507acdc772edc569c63c8712a1bd2847c61`

See more details on using hashes here.

babelvox 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

BabelVox

Installation

Quick start

As a library

From the command line

Model export (one-time setup)

Performance

Optimization progression

Where the time goes (INT8 + CP KV cache, 256-token bucket)

Multi-bucket scaling

Hardware tested

Architecture

CLI reference

Acknowledgments

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes