Skip to main content

Real-time text-to-speech on Intel NPU/CPU via OpenVINO

Project description

BabelVox

Real-time text-to-speech on Intel NPU via OpenVINO. Runs Qwen3-TTS 0.6B inference entirely on Intel NPU (AI Boost), achieving RTF=1.0x (real-time) speech synthesis on a Lunar Lake ultrabook.

No PyTorch at runtime. Dependencies: openvino, numpy, librosa, soundfile, scipy, transformers (tokenizer only), num2words, defusedxml.

Installation

pip install babelvox

Or from source:

git clone https://github.com/Djwarf/babelvox.git
cd babelvox
pip install -e .

Quick start

Models (~2.5 GB) are downloaded automatically from HuggingFace on first run and cached for future use. No manual setup needed.

As a library

from babelvox import BabelVox
import soundfile as sf

# CPU — works on any machine (models auto-download on first use)
tts = BabelVox(precision="int8", use_cp_kv_cache=True)
wav, sr = tts.generate("Don't panic.", language="English")
sf.write("output.wav", wav, sr)

For Intel NPU (Lunar Lake or later), enable hardware acceleration and model caching:

tts = BabelVox(device="NPU", precision="int8",
               use_cp_kv_cache=True, talker_buckets=[64, 128, 256],
               cache_dir="./ov_cache")

From the command line

# CPU (works anywhere, ~1.1x RTF)
babelvox --int8 --cp-kv-cache --text "Hello world" --output hello.wav

# Intel NPU (real-time, RTF=1.0x)
babelvox --device NPU --int8 --cp-kv-cache --talker-buckets "64,128,256" \
  --cache-dir ./ov_cache \
  --text "Hello, this is real-time speech synthesis on an Intel NPU." \
  --output hello.wav

Features

Voice cloning

Clone any voice from a short reference audio clip (3-10 seconds). For best results, provide a transcription of the reference audio with ref_text:

wav, sr = tts.generate(
    "This sounds like someone else.",
    ref_audio="reference.wav",
    ref_text="The words spoken in reference dot wav.",
    language="English",
)
babelvox --int8 --cp-kv-cache --ref-audio reference.wav \
  --ref-text "The words spoken in reference dot wav." \
  --text "This sounds like someone else." --output cloned.wav

Speaker profiles

Save, load, and reuse named speaker voices across sessions:

from babelvox import BabelVox, SpeakerLibrary

tts = BabelVox(precision="int8", use_cp_kv_cache=True)
tts.speaker_library = SpeakerLibrary("~/.babelvox/speakers")

# Save a speaker from reference audio
tts.save_speaker("alice", "alice.wav", language="English", gender="female")

# Use by name
wav, sr = tts.generate("Hello from Alice", speaker="alice")
babelvox --save-speaker alice --ref-audio alice.wav --language English
babelvox --speaker alice --text "Hello from Alice" -o hello.wav
babelvox --list-speakers

Mix or interpolate voices:

from babelvox import mix_speakers, interpolate_speakers

lib = tts.speaker_library
a = lib.load("alice").embedding
b = lib.load("bob").embedding
mixed = mix_speakers([a, b], [0.7, 0.3])      # 70% alice, 30% bob
blended = interpolate_speakers(a, b, 0.5)      # 50/50 blend
wav, sr = tts.generate("Hello", speaker_embed=mixed)

Server API:

Method Path Description
GET /speakers List all saved speaker profiles
POST /speakers Save a speaker ({"name": "alice", "ref_audio": "..."})
DELETE /speakers/{name} Delete a speaker profile
POST /tts/batch Batch synthesis ({"items": [{"text": "...", "speaker": "..."}]})

Use "speaker": "alice" in POST /tts to synthesize with a saved voice.

Voice persistence

Without a reference audio, each generate() call may produce a different voice. To keep a consistent voice across multiple calls:

# Extract once from reference audio, reuse for all subsequent calls
tts.default_speaker = tts.extract_speaker_embedding("voice.wav")
wav1, sr = tts.generate("First sentence.")
wav2, sr = tts.generate("Second sentence.")  # same voice

# Or pass a speaker embedding directly per call
embed = tts.extract_speaker_embedding("voice.wav")
wav, sr = tts.generate("Hello", speaker_embed=embed)

Model caching

On NPU, OpenVINO compiles models at startup which can take minutes on first run. Use cache_dir to cache compiled models so subsequent launches are instant:

tts = BabelVox(device="NPU", cache_dir="./ov_cache", precision="int8")
# First run: ~200s compile. Second run: instant.
babelvox --device NPU --cache-dir ./ov_cache --int8 --cp-kv-cache

Streaming generation

Start playback immediately instead of waiting for the full utterance. generate_stream() yields waveform chunks every N codec frames (~1 second per 12 frames):

import sounddevice as sd

for chunk, sr in tts.generate_stream("A long paragraph of text...",
                                      chunk_frames=12):
    sd.play(chunk, sr)
    sd.wait()

Silence-aware chunking

Split at natural pauses between words instead of fixed intervals. Produces complete words/phrases per chunk with crossfade overlap for click-free playback:

for chunk, sr in tts.generate_stream(
    "A long paragraph of text...",
    split_on_silence=True,   # yield at pauses between words
    min_chunk_frames=6,      # at least 0.5s per chunk
    max_chunk_frames=48,     # at most 4s per chunk
    silence_threshold=0.02,  # RMS energy threshold for silence
    crossfade_samples=1200,  # 50ms overlap for click-free joins
):
    sd.play(chunk, sr)
    sd.wait()

For near-zero gap between chunks, overlap generation and playback with a thread:

import threading, queue

audio_q = queue.Queue()

def producer():
    for chunk, sr in tts.generate_stream("Long text...",
                                          split_on_silence=True):
        audio_q.put((chunk, sr))
    audio_q.put(None)

threading.Thread(target=producer).start()

while True:
    item = audio_q.get()
    if item is None:
        break
    sd.play(item[0], item[1])
    sd.wait()

Text preprocessing & SSML

BabelVox automatically normalizes text before synthesis — expanding abbreviations (Dr.Doctor), numbers ($4.50four dollars and fifty cents), dates, times, and phone numbers into spoken form. Unicode punctuation is cleaned up and repeated punctuation is collapsed.

For fine-grained control, pass SSML markup:

ssml = """<speak>
  The price is <say-as interpret-as="number">$4.50</say-as>.
  <break time="500ms"/>
  Call <say-as interpret-as="telephone">555-123-4567</say-as> for details.
  <sub alias="World Wide Web Consortium">W3C</sub> approved.
</speak>"""
wav, sr = tts.generate(ssml, ssml=True)
babelvox --ssml --text '<speak>Hello.<break time="500ms"/>World.</speak>' -o out.wav

Supported SSML tags:

Tag Effect
<break time="500ms"/> Insert pause (maps to punctuation)
<break strength="strong"/> Insert pause by strength level
<sub alias="..."> Replace text with alias
<say-as interpret-as="number|date|time|telephone|spell-out"> Normalize to spoken form
<emphasis>, <prosody>, <phoneme> Parsed (effects coming in v0.11.0)

Onomatopoeia (boom, crash, sizzle, etc.) is also detected and annotated for future prosody control.

10 languages

Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish.

wav, sr = tts.generate("Bonjour le monde.", language="French")
wav, sr = tts.generate("Hallo Welt.", language="German")

Sampling controls

Fine-tune the generation quality and diversity:

wav, sr = tts.generate(
    "Hello world",
    temperature=0.9,          # higher = more expressive, lower = more stable
    top_k=50,                 # limit sampling to top-k tokens
    top_p=1.0,                # nucleus sampling threshold
    repetition_penalty=1.05,  # discourage repeated audio patterns
    max_new_tokens=512,       # max generation steps (1 step = 1/12 sec audio)
)

HTTP API (cross-language integration)

Run BabelVox as an HTTP server so any language (JavaScript, Go, Rust, etc.) can call it:

babelvox --serve --int8 --cp-kv-cache --port 8765

Then from any client:

curl -X POST http://localhost:8765/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "language": "English"}' \
  -o hello.wav

From JavaScript:

const res = await fetch("http://localhost:8765/tts", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ text: "Hello world" }),
});
const blob = await res.blob();
const audio = new Audio(URL.createObjectURL(blob));
audio.play();

Endpoints:

Method Path Description
POST /tts Synthesize speech — JSON body in, WAV bytes out
GET /tts/stream Streaming SSE — audio chunks as base64 events
GET /health Health check — returns {"status": "ok"}

POST /tts request body:

Field Required Default Description
text yes Text to synthesize
language no "English" One of the 10 supported languages
ref_audio no null Path to reference WAV for voice cloning
ref_text no null Transcription of reference audio (improves cloning)
max_new_tokens no 512 Max generation steps
temperature no 0.9 Sampling temperature
top_k no 50 Top-k sampling
top_p no 1.0 Nucleus sampling threshold
repetition_penalty no 1.05 Penalty for repeated tokens
ssml no false Treat text as SSML markup

SSE streaming

Stream audio chunks in real-time via Server-Sent Events (no extra dependencies):

curl -N "http://localhost:8765/tts/stream?text=Hello+world&format=pcm_s16le"

Events: start (sample rate + format), audio (base64-encoded chunk), done (total duration), error.

WebSocket streaming

For bidirectional real-time streaming with cancel support, install the ws extra:

pip install babelvox[ws]
babelvox --serve --int8 --cp-kv-cache --port 8765 --ws-port 8766

Connect and send JSON requests, receive binary audio chunks:

const ws = new WebSocket("ws://localhost:8766");
ws.onopen = () => ws.send(JSON.stringify({ text: "Hello world", format: "pcm_s16le" }));
ws.onmessage = (e) => {
  if (typeof e.data === "string") {
    const msg = JSON.parse(e.data);
    console.log(msg.event); // "start", "done", or "error"
  } else {
    // Binary audio chunk — play with Web Audio API
  }
};
// Cancel mid-stream:
ws.send(JSON.stringify({ event: "cancel" }));

Formats: pcm_s16le (raw 16-bit PCM, lowest latency) or wav_chunks (each chunk is a complete WAV file).

Pre-download models

Cache models ahead of time instead of on first use:

from babelvox import download_models
path = download_models()  # downloads ~2.5 GB to HuggingFace cache

API reference

BabelVox(model_path, export_dir, device, precision, ...)

Parameter Default Description
model_path "Qwen/Qwen3-TTS-12Hz-0.6B-Base" HuggingFace model (tokenizer only)
export_dir None (auto-download) Path to exported OpenVINO models
device "CPU" "CPU" or "NPU"
precision "fp16" "fp16", "int8", "int4", or "fp32"
use_cp_kv_cache False KV cache for code predictor (recommended)
talker_buckets None NPU bucket sizes, e.g. [64, 128, 256]
cache_dir None OpenVINO compiled model cache directory

tts.generate(text, language, ref_audio, ...) returns (waveform, sample_rate)

Parameter Default Description
text required Text to synthesize
language "English" One of the 10 supported languages
ref_audio None Path to reference WAV for voice cloning
ref_text None Transcription of the reference audio (improves cloning)
speaker_embed None Pre-extracted speaker embedding (numpy array)
max_new_tokens 512 Max generation steps (12 steps = 1 sec audio)
temperature 0.9 Sampling temperature (0 = greedy)
top_k 50 Top-k sampling
top_p 1.0 Nucleus sampling threshold
repetition_penalty 1.05 Penalty for repeated tokens
ssml False Treat text as SSML markup

tts.generate_stream(text, language, ..., chunk_frames=12) — same args as generate(), plus:

Parameter Default Description
chunk_frames 12 Codec frames per chunk when not using silence detection (12 = ~1 sec)
split_on_silence False Yield at natural pauses between words instead of fixed intervals
min_chunk_frames 6 Minimum frames before considering a silence split (~0.5s)
max_chunk_frames 48 Force yield after this many frames even without silence (~4s)
silence_threshold 0.02 RMS energy threshold for silence detection
crossfade_samples 1200 Overlap samples at chunk edges for click-free joins (50ms at 24kHz)

Yields (waveform_chunk, 24000) tuples as audio is generated.

tts.extract_speaker_embedding(audio_path) returns numpy array (1, 1024)

tts.default_speaker — set to a speaker embedding for consistent voice across calls

Exporting models yourself (optional)

The pre-built INT8 models are downloaded automatically. If you want to export from scratch (e.g., for a different quantization), the export scripts in tools/ require PyTorch:

pip install torch qwen-tts nncf
python tools/export_tts_lm.py
python tools/export_speaker_encoder.py
python tools/export_decoder.py
python tools/export_tokenizer_encoder.py
python tools/export_cp_kvcache.py
python tools/export_weights.py
python tools/quantize_models.py --int8

Performance

Optimization progression

Optimization RTF Per-step Notes
FP16 NPU baseline 3.0x 246 ms Full-recompute, padded to 256 tokens
+ INT8 quantization 2.1x 156 ms NNCF INT8_SYM weight compression
+ CP KV cache 1.4x 106 ms Eliminates redundant code predictor recomputation
+ Multi-bucket talker 1.0x ~80 ms Dynamically picks smallest NPU shape per step

RTF = Real-Time Factor. RTF=1.0x means generating 1 second of audio takes 1 second of compute.

Where the time goes (INT8 + CP KV cache, 256-token bucket)

Component Device Time Share
Talker (28-layer transformer) NPU 61 ms 57%
Code predictor (15 groups) CPU 45 ms 43%
Numpy overhead (embeddings, sampling) CPU <1 ms <1%

Multi-bucket scaling

The talker scales linearly with sequence length on NPU. Pre-compiling at multiple sizes and routing to the smallest bucket that fits dramatically reduces wasted compute:

Bucket size Talker time Total (+ 45ms CP) Effective RTF
64 15 ms 60 ms 0.72x
128 22 ms 67 ms 0.80x
192 31 ms 76 ms 0.91x
256 43 ms 88 ms 1.06x

Hardware tested

  • CPU: Intel Core Ultra 7 258V (Lunar Lake)
  • NPU: Intel AI Boost (~13 TOPS)
  • RAM: 32 GB LPDDR5x
  • Device: Samsung Galaxy Book5 Pro

Architecture

Qwen3-TTS uses 5 model components orchestrated in an autoregressive loop:

Text --> Tokenizer --> Text Embeddings --> Talker (28L transformer) --> Codec code_0
                                               |
                       Speaker Embedding ------+    code_0 --> Code Predictor (5L) --> codes 1-15
                       (from reference audio)            \--> repeat 15x with KV cache
                                                                     |
                                           All 16 codes --> Tokenizer Decoder --> Waveform
Component Layers Hidden Heads Device INT8 size
Talker 28 1024 16Q/8KV NPU 444 MB
Code predictor 5 1024 16Q/8KV CPU 79 MB
Tokenizer encoder -- -- -- NPU 48 MB
Tokenizer decoder -- -- -- NPU 114 MB
Speaker encoder -- -- -- NPU 9 MB

CLI reference

Flag Default Description
--device CPU CPU or NPU
--int8 off Use INT8 quantized models
--precision fp16 fp16, int8, int4, or fp32
--cp-kv-cache off KV cache for code predictor (recommended)
--talker-buckets none Comma-separated NPU bucket sizes (e.g. 64,128,256)
--kv-cache off KV cache for talker (not recommended on NPU)
--cache-dir none OpenVINO compiled model cache directory
--max-tokens 200 Maximum generation steps
--max-talker-seq 256 Fixed talker padding (when not using buckets)
--max-decoder-frames 256 Max codec frames for audio decoder
--max-kv-len 256 KV cache buffer size (if --kv-cache)
--ssml off Treat --text as SSML markup
--text demo text Text to synthesize
--language English Language for synthesis
--ref-audio none Reference audio for voice cloning
--ref-text none Transcription of reference audio (improves cloning)
--speaker none Use a named speaker profile
--speaker-dir ~/.babelvox/speakers Speaker library directory
--save-speaker none Save speaker from --ref-audio as named profile
--list-speakers off List saved speaker profiles and exit
--serve off Start HTTP server instead of generating once
--host 0.0.0.0 Server bind address
--port 8765 Server port
--ws-port none WebSocket server port (requires babelvox[ws])
--output / -o output.wav Output WAV file path
--export-dir auto-download Directory with exported models (downloads from HuggingFace if not set)
--model-path Qwen/Qwen3-TTS-12Hz-0.6B-Base HuggingFace model (tokenizer)

Acknowledgments

Based on Qwen3-TTS by Alibaba Qwen Team (Apache-2.0).

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelvox-1.0.0.tar.gz (62.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babelvox-1.0.0-py3-none-any.whl (49.1 kB view details)

Uploaded Python 3

File details

Details for the file babelvox-1.0.0.tar.gz.

File metadata

  • Download URL: babelvox-1.0.0.tar.gz
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvox-1.0.0.tar.gz
Algorithm Hash digest
SHA256 baca473c01a0246a4b3bf98dab7d590355c1880b9198db7e03d5f73f1b231a8e
MD5 6a93d4a99b7050c64043c0d6636623ef
BLAKE2b-256 9c87230976b4b70d7a607a1d54ed6ac97779a3ac741a9bc2f57b3b69b4d741cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvox-1.0.0.tar.gz:

Publisher: publish.yml on Djwarf/babelvox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file babelvox-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: babelvox-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 49.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvox-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 48f5ad78e1bdbf54ed7315e557a3b62d12340a4cc78e5bb7b55dfa8e8397e1ed
MD5 79af43c4ec1d23fa9f75f3358b12d2b3
BLAKE2b-256 1398361f7051ba3582b38d5ccc1b1431c24b47e9e74adfc7f8a00d94d60392b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvox-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Djwarf/babelvox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page