Real-time text-to-speech on Intel NPU/CPU via OpenVINO
Project description
BabelVox
Real-time text-to-speech on Intel NPU via OpenVINO. Runs Qwen3-TTS 0.6B inference entirely on Intel NPU (AI Boost), achieving RTF=1.0x (real-time) speech synthesis on a Lunar Lake ultrabook.
No PyTorch at runtime. Dependencies: openvino, numpy, librosa, soundfile, scipy, transformers (tokenizer only), num2words, defusedxml.
Installation
pip install babelvox
Or from source:
git clone https://github.com/Djwarf/babelvox.git
cd babelvox
pip install -e .
Quick start
Models (~2.5 GB) are downloaded automatically from HuggingFace on first run and cached for future use. No manual setup needed.
As a library
from babelvox import BabelVox
import soundfile as sf
# CPU — works on any machine (models auto-download on first use)
tts = BabelVox(precision="int8", use_cp_kv_cache=True)
wav, sr = tts.generate("Don't panic.", language="English")
sf.write("output.wav", wav, sr)
For Intel NPU (Lunar Lake or later), enable hardware acceleration and model caching:
tts = BabelVox(device="NPU", precision="int8",
use_cp_kv_cache=True, talker_buckets=[64, 128, 256],
cache_dir="./ov_cache")
From the command line
# CPU (works anywhere, ~1.1x RTF)
babelvox --int8 --cp-kv-cache --text "Hello world" --output hello.wav
# Intel NPU (real-time, RTF=1.0x)
babelvox --device NPU --int8 --cp-kv-cache --talker-buckets "64,128,256" \
--cache-dir ./ov_cache \
--text "Hello, this is real-time speech synthesis on an Intel NPU." \
--output hello.wav
Features
Voice cloning
Clone any voice from a short reference audio clip (3-10 seconds). For best results, provide a transcription of the reference audio with ref_text:
wav, sr = tts.generate(
"This sounds like someone else.",
ref_audio="reference.wav",
ref_text="The words spoken in reference dot wav.",
language="English",
)
babelvox --int8 --cp-kv-cache --ref-audio reference.wav \
--ref-text "The words spoken in reference dot wav." \
--text "This sounds like someone else." --output cloned.wav
Voice persistence
Without a reference audio, each generate() call may produce a different voice. To keep a consistent voice across multiple calls:
# Extract once from reference audio, reuse for all subsequent calls
tts.default_speaker = tts.extract_speaker_embedding("voice.wav")
wav1, sr = tts.generate("First sentence.")
wav2, sr = tts.generate("Second sentence.") # same voice
# Or pass a speaker embedding directly per call
embed = tts.extract_speaker_embedding("voice.wav")
wav, sr = tts.generate("Hello", speaker_embed=embed)
Model caching
On NPU, OpenVINO compiles models at startup which can take minutes on first run. Use cache_dir to cache compiled models so subsequent launches are instant:
tts = BabelVox(device="NPU", cache_dir="./ov_cache", precision="int8")
# First run: ~200s compile. Second run: instant.
babelvox --device NPU --cache-dir ./ov_cache --int8 --cp-kv-cache
Streaming generation
Start playback immediately instead of waiting for the full utterance. generate_stream() yields waveform chunks every N codec frames (~1 second per 12 frames):
import sounddevice as sd
for chunk, sr in tts.generate_stream("A long paragraph of text...",
chunk_frames=12):
sd.play(chunk, sr)
sd.wait()
Silence-aware chunking
Split at natural pauses between words instead of fixed intervals. Produces complete words/phrases per chunk with crossfade overlap for click-free playback:
for chunk, sr in tts.generate_stream(
"A long paragraph of text...",
split_on_silence=True, # yield at pauses between words
min_chunk_frames=6, # at least 0.5s per chunk
max_chunk_frames=48, # at most 4s per chunk
silence_threshold=0.02, # RMS energy threshold for silence
crossfade_samples=1200, # 50ms overlap for click-free joins
):
sd.play(chunk, sr)
sd.wait()
For near-zero gap between chunks, overlap generation and playback with a thread:
import threading, queue
audio_q = queue.Queue()
def producer():
for chunk, sr in tts.generate_stream("Long text...",
split_on_silence=True):
audio_q.put((chunk, sr))
audio_q.put(None)
threading.Thread(target=producer).start()
while True:
item = audio_q.get()
if item is None:
break
sd.play(item[0], item[1])
sd.wait()
Text preprocessing & SSML
BabelVox automatically normalizes text before synthesis — expanding abbreviations (Dr. → Doctor), numbers ($4.50 → four dollars and fifty cents), dates, times, and phone numbers into spoken form. Unicode punctuation is cleaned up and repeated punctuation is collapsed.
For fine-grained control, pass SSML markup:
ssml = """<speak>
The price is <say-as interpret-as="number">$4.50</say-as>.
<break time="500ms"/>
Call <say-as interpret-as="telephone">555-123-4567</say-as> for details.
<sub alias="World Wide Web Consortium">W3C</sub> approved.
</speak>"""
wav, sr = tts.generate(ssml, ssml=True)
babelvox --ssml --text '<speak>Hello.<break time="500ms"/>World.</speak>' -o out.wav
Supported SSML tags:
| Tag | Effect |
|---|---|
<break time="500ms"/> |
Insert pause (maps to punctuation) |
<break strength="strong"/> |
Insert pause by strength level |
<sub alias="..."> |
Replace text with alias |
<say-as interpret-as="number|date|time|telephone|spell-out"> |
Normalize to spoken form |
<emphasis>, <prosody>, <phoneme> |
Parsed (effects coming in v0.11.0) |
Onomatopoeia (boom, crash, sizzle, etc.) is also detected and annotated for future prosody control.
10 languages
Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish.
wav, sr = tts.generate("Bonjour le monde.", language="French")
wav, sr = tts.generate("Hallo Welt.", language="German")
Sampling controls
Fine-tune the generation quality and diversity:
wav, sr = tts.generate(
"Hello world",
temperature=0.9, # higher = more expressive, lower = more stable
top_k=50, # limit sampling to top-k tokens
top_p=1.0, # nucleus sampling threshold
repetition_penalty=1.05, # discourage repeated audio patterns
max_new_tokens=512, # max generation steps (1 step = 1/12 sec audio)
)
HTTP API (cross-language integration)
Run BabelVox as an HTTP server so any language (JavaScript, Go, Rust, etc.) can call it:
babelvox --serve --int8 --cp-kv-cache --port 8765
Then from any client:
curl -X POST http://localhost:8765/tts \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "language": "English"}' \
-o hello.wav
From JavaScript:
const res = await fetch("http://localhost:8765/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text: "Hello world" }),
});
const blob = await res.blob();
const audio = new Audio(URL.createObjectURL(blob));
audio.play();
Endpoints:
| Method | Path | Description |
|---|---|---|
POST |
/tts |
Synthesize speech — JSON body in, WAV bytes out |
GET |
/tts/stream |
Streaming SSE — audio chunks as base64 events |
GET |
/health |
Health check — returns {"status": "ok"} |
POST /tts request body:
| Field | Required | Default | Description |
|---|---|---|---|
text |
yes | — | Text to synthesize |
language |
no | "English" |
One of the 10 supported languages |
ref_audio |
no | null |
Path to reference WAV for voice cloning |
ref_text |
no | null |
Transcription of reference audio (improves cloning) |
max_new_tokens |
no | 512 |
Max generation steps |
temperature |
no | 0.9 |
Sampling temperature |
top_k |
no | 50 |
Top-k sampling |
top_p |
no | 1.0 |
Nucleus sampling threshold |
repetition_penalty |
no | 1.05 |
Penalty for repeated tokens |
ssml |
no | false |
Treat text as SSML markup |
SSE streaming
Stream audio chunks in real-time via Server-Sent Events (no extra dependencies):
curl -N "http://localhost:8765/tts/stream?text=Hello+world&format=pcm_s16le"
Events: start (sample rate + format), audio (base64-encoded chunk), done (total duration), error.
WebSocket streaming
For bidirectional real-time streaming with cancel support, install the ws extra:
pip install babelvox[ws]
babelvox --serve --int8 --cp-kv-cache --port 8765 --ws-port 8766
Connect and send JSON requests, receive binary audio chunks:
const ws = new WebSocket("ws://localhost:8766");
ws.onopen = () => ws.send(JSON.stringify({ text: "Hello world", format: "pcm_s16le" }));
ws.onmessage = (e) => {
if (typeof e.data === "string") {
const msg = JSON.parse(e.data);
console.log(msg.event); // "start", "done", or "error"
} else {
// Binary audio chunk — play with Web Audio API
}
};
// Cancel mid-stream:
ws.send(JSON.stringify({ event: "cancel" }));
Formats: pcm_s16le (raw 16-bit PCM, lowest latency) or wav_chunks (each chunk is a complete WAV file).
Pre-download models
Cache models ahead of time instead of on first use:
from babelvox import download_models
path = download_models() # downloads ~2.5 GB to HuggingFace cache
API reference
BabelVox(model_path, export_dir, device, precision, ...)
| Parameter | Default | Description |
|---|---|---|
model_path |
"Qwen/Qwen3-TTS-12Hz-0.6B-Base" |
HuggingFace model (tokenizer only) |
export_dir |
None (auto-download) |
Path to exported OpenVINO models |
device |
"CPU" |
"CPU" or "NPU" |
precision |
"fp16" |
"fp16", "int8", "int4", or "fp32" |
use_cp_kv_cache |
False |
KV cache for code predictor (recommended) |
talker_buckets |
None |
NPU bucket sizes, e.g. [64, 128, 256] |
cache_dir |
None |
OpenVINO compiled model cache directory |
tts.generate(text, language, ref_audio, ...) returns (waveform, sample_rate)
| Parameter | Default | Description |
|---|---|---|
text |
required | Text to synthesize |
language |
"English" |
One of the 10 supported languages |
ref_audio |
None |
Path to reference WAV for voice cloning |
ref_text |
None |
Transcription of the reference audio (improves cloning) |
speaker_embed |
None |
Pre-extracted speaker embedding (numpy array) |
max_new_tokens |
512 |
Max generation steps (12 steps = 1 sec audio) |
temperature |
0.9 |
Sampling temperature (0 = greedy) |
top_k |
50 |
Top-k sampling |
top_p |
1.0 |
Nucleus sampling threshold |
repetition_penalty |
1.05 |
Penalty for repeated tokens |
ssml |
False |
Treat text as SSML markup |
tts.generate_stream(text, language, ..., chunk_frames=12) — same args as generate(), plus:
| Parameter | Default | Description |
|---|---|---|
chunk_frames |
12 |
Codec frames per chunk when not using silence detection (12 = ~1 sec) |
split_on_silence |
False |
Yield at natural pauses between words instead of fixed intervals |
min_chunk_frames |
6 |
Minimum frames before considering a silence split (~0.5s) |
max_chunk_frames |
48 |
Force yield after this many frames even without silence (~4s) |
silence_threshold |
0.02 |
RMS energy threshold for silence detection |
crossfade_samples |
1200 |
Overlap samples at chunk edges for click-free joins (50ms at 24kHz) |
Yields (waveform_chunk, 24000) tuples as audio is generated.
tts.extract_speaker_embedding(audio_path) returns numpy array (1, 1024)
tts.default_speaker — set to a speaker embedding for consistent voice across calls
Exporting models yourself (optional)
The pre-built INT8 models are downloaded automatically. If you want to export from scratch (e.g., for a different quantization), the export scripts in tools/ require PyTorch:
pip install torch qwen-tts nncf
python tools/export_tts_lm.py
python tools/export_speaker_encoder.py
python tools/export_decoder.py
python tools/export_tokenizer_encoder.py
python tools/export_cp_kvcache.py
python tools/export_weights.py
python tools/quantize_models.py --int8
Performance
Optimization progression
| Optimization | RTF | Per-step | Notes |
|---|---|---|---|
| FP16 NPU baseline | 3.0x | 246 ms | Full-recompute, padded to 256 tokens |
| + INT8 quantization | 2.1x | 156 ms | NNCF INT8_SYM weight compression |
| + CP KV cache | 1.4x | 106 ms | Eliminates redundant code predictor recomputation |
| + Multi-bucket talker | 1.0x | ~80 ms | Dynamically picks smallest NPU shape per step |
RTF = Real-Time Factor. RTF=1.0x means generating 1 second of audio takes 1 second of compute.
Where the time goes (INT8 + CP KV cache, 256-token bucket)
| Component | Device | Time | Share |
|---|---|---|---|
| Talker (28-layer transformer) | NPU | 61 ms | 57% |
| Code predictor (15 groups) | CPU | 45 ms | 43% |
| Numpy overhead (embeddings, sampling) | CPU | <1 ms | <1% |
Multi-bucket scaling
The talker scales linearly with sequence length on NPU. Pre-compiling at multiple sizes and routing to the smallest bucket that fits dramatically reduces wasted compute:
| Bucket size | Talker time | Total (+ 45ms CP) | Effective RTF |
|---|---|---|---|
| 64 | 15 ms | 60 ms | 0.72x |
| 128 | 22 ms | 67 ms | 0.80x |
| 192 | 31 ms | 76 ms | 0.91x |
| 256 | 43 ms | 88 ms | 1.06x |
Hardware tested
- CPU: Intel Core Ultra 7 258V (Lunar Lake)
- NPU: Intel AI Boost (~13 TOPS)
- RAM: 32 GB LPDDR5x
- Device: Samsung Galaxy Book5 Pro
Architecture
Qwen3-TTS uses 5 model components orchestrated in an autoregressive loop:
Text --> Tokenizer --> Text Embeddings --> Talker (28L transformer) --> Codec code_0
|
Speaker Embedding ------+ code_0 --> Code Predictor (5L) --> codes 1-15
(from reference audio) \--> repeat 15x with KV cache
|
All 16 codes --> Tokenizer Decoder --> Waveform
| Component | Layers | Hidden | Heads | Device | INT8 size |
|---|---|---|---|---|---|
| Talker | 28 | 1024 | 16Q/8KV | NPU | 444 MB |
| Code predictor | 5 | 1024 | 16Q/8KV | CPU | 79 MB |
| Tokenizer encoder | -- | -- | -- | NPU | 48 MB |
| Tokenizer decoder | -- | -- | -- | NPU | 114 MB |
| Speaker encoder | -- | -- | -- | NPU | 9 MB |
CLI reference
| Flag | Default | Description |
|---|---|---|
--device |
CPU |
CPU or NPU |
--int8 |
off | Use INT8 quantized models |
--precision |
fp16 |
fp16, int8, int4, or fp32 |
--cp-kv-cache |
off | KV cache for code predictor (recommended) |
--talker-buckets |
none | Comma-separated NPU bucket sizes (e.g. 64,128,256) |
--kv-cache |
off | KV cache for talker (not recommended on NPU) |
--cache-dir |
none | OpenVINO compiled model cache directory |
--max-tokens |
200 | Maximum generation steps |
--max-talker-seq |
256 | Fixed talker padding (when not using buckets) |
--max-decoder-frames |
256 | Max codec frames for audio decoder |
--max-kv-len |
256 | KV cache buffer size (if --kv-cache) |
--ssml |
off | Treat --text as SSML markup |
--text |
demo text | Text to synthesize |
--language |
English | Language for synthesis |
--ref-audio |
none | Reference audio for voice cloning |
--ref-text |
none | Transcription of reference audio (improves cloning) |
--serve |
off | Start HTTP server instead of generating once |
--host |
0.0.0.0 |
Server bind address |
--port |
8765 |
Server port |
--ws-port |
none | WebSocket server port (requires babelvox[ws]) |
--output / -o |
output.wav |
Output WAV file path |
--export-dir |
auto-download | Directory with exported models (downloads from HuggingFace if not set) |
--model-path |
Qwen/Qwen3-TTS-12Hz-0.6B-Base |
HuggingFace model (tokenizer) |
Acknowledgments
Based on Qwen3-TTS by Alibaba Qwen Team (Apache-2.0).
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file babelvox-0.9.0.tar.gz.
File metadata
- Download URL: babelvox-0.9.0.tar.gz
- Upload date:
- Size: 48.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dd150aaa3213cdbee3b4c72f25564ccf716c3e8e80b56472aa7a760aeeee476
|
|
| MD5 |
d50fd24e6da11b1134f383d26e976dc4
|
|
| BLAKE2b-256 |
2778fab26bb774097d1261b3b8329aa21afd82d385d7e1979b4ffa277e0d4463
|
Provenance
The following attestation bundles were made for babelvox-0.9.0.tar.gz:
Publisher:
publish.yml on Djwarf/babelvox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babelvox-0.9.0.tar.gz -
Subject digest:
2dd150aaa3213cdbee3b4c72f25564ccf716c3e8e80b56472aa7a760aeeee476 - Sigstore transparency entry: 1138969839
- Sigstore integration time:
-
Permalink:
Djwarf/babelvox@25593f1ae985aff702a0506e5c2bfa199dc3accb -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/Djwarf
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@25593f1ae985aff702a0506e5c2bfa199dc3accb -
Trigger Event:
push
-
Statement type:
File details
Details for the file babelvox-0.9.0-py3-none-any.whl.
File metadata
- Download URL: babelvox-0.9.0-py3-none-any.whl
- Upload date:
- Size: 38.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf35ed7052e0bbbcb3cba2b840aa54ea737493cb98e4734bfa7b0ef274598ed8
|
|
| MD5 |
d28e9df843fc7458fd25e0a933cf7147
|
|
| BLAKE2b-256 |
5d9bf7ff0ef9b5ef66329a21c59fd5393dffaa176da29bf31be3b3a32fa77484
|
Provenance
The following attestation bundles were made for babelvox-0.9.0-py3-none-any.whl:
Publisher:
publish.yml on Djwarf/babelvox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babelvox-0.9.0-py3-none-any.whl -
Subject digest:
bf35ed7052e0bbbcb3cba2b840aa54ea737493cb98e4734bfa7b0ef274598ed8 - Sigstore transparency entry: 1138969977
- Sigstore integration time:
-
Permalink:
Djwarf/babelvox@25593f1ae985aff702a0506e5c2bfa199dc3accb -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/Djwarf
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@25593f1ae985aff702a0506e5c2bfa199dc3accb -
Trigger Event:
push
-
Statement type: