Full-Stream Zero-shot TTS model with Extremely Low Latency and Speaking-rate Control

These details have not been verified by PyPI

Project links

Project description

VoXtream2: Full-stream TTS with dynamic speaking rate control

We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly.

For audio examples, see our demo page.

Try VoXtream2 in your browser on HuggingFace 🤗 space.

Key features

Dynamic speed control: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech.
Streaming performance: Works 4x times faster than real-time and achieves 74 ms first packet latency in a full-stream on a consumer GPU.
Translingual capability: Prompt text masking enables support of acoustic prompts in any language.

Updates

2026/04/30:
- Added a dynamic speaking rate control interactive demo. You can now adjust the speaking rate as the model produces speech in real-time. Run voxtream-app locally after installing the package or check our HuggingFace space.
- Added cache reset in SynkAttention. Fixed the bug that caused the model to generate noise by relying on an invalid prompt cache.
2026/04/08: Added a frame repeat counter. Reduces hallucinations caused by models getting stuck in the same frame. Controlled by the frame_repeat_counter parameter in SpeechGeneratorConfig. Recommended value (12-25), lower values for stricter control.
2026/03: We released VoXtream2.
2026/01: VoXtream is accepted for an oral presentation at ICASSP 2026.
2025/09: We released VoXtream. Now available at voxtream branch.

Installation

eSpeak NG phonemizer

# For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
apt-get install espeak-ng
# For RedHat-like distribution (e.g. CentOS, Fedora, etc.) 
yum install espeak-ng
# For MacOS
brew install espeak-ng

Pip package

pip install "voxtream>=0.2.2"

Usage

Prompt audio: a file containing 3-10 seconds of the target voice. The maximum supported length is 20 seconds (longer audio will be trimmed).
Text: What you want the model to say. The maximum supported length is 1000 characters (longer text will be trimmed).
Speaking rate (optional): target speaking rate in syllables per second.

Notes:

The model was tested on Ubuntu 22.04, CUDA 12 and PyTorch 2.4.
The model requires 2.2Gb of VRAM (enabling speech enhancement adds 2Gb).
Maximum generation length is limited to 1 minute.
The initial run may take a bit longer to download model weights and warmup model graph.
If you experience problems with CUDAGraphs please check this issue.

Command line

Output streaming

voxtream \
    --prompt-audio assets/audio/english_male.wav \
    --text "In general, some method is then needed to evaluate each approximation." \
    --output "output_stream.wav"

Full streaming (slow speech, 2 syllables per second)

voxtream \
    --prompt-audio assets/audio/english_female.wav \
    --text "Staff do not always do enough to prevent violence." \
    --output "full_stream_2sps.wav" \
    --full-stream \
    --spk-rate 2.0

Acoustic prompt enhancement

voxtream \
    --prompt-audio assets/test/english_male.wav \
    --text "In general, however, some method is then needed to evaluate each approximation." \
    --output "output_enhanced.wav" \
    --prompt-enhancement

Python API

import json
from itertools import repeat
from pathlib import Path

import numpy as np
import soundfile as sf

from voxtream.utils.generator import (
    set_seed,
    text_generator,
)
from voxtream.generator import SpeechGenerator, SpeechGeneratorConfig


set_seed()
with open('configs/generator.json') as f:
    config = SpeechGeneratorConfig(**json.load(f))

with open('configs/speaking_rate.json') as f:
    spk_rate_config = json.load(f)

speech_generator = SpeechGenerator(config, spk_rate_config)

# Output streaming, no speaking rate control
speech_stream = speech_generator.generate_stream(
    prompt_audio_path=Path('assets/audio/english_male.wav'),
    text="In general, however, some method is then needed to evaluate each approximation.",
)

audio_frames = [audio_frame for audio_frame, _ in speech_stream]
sf.write('output_stream.wav', np.concatenate(audio_frames), config.mimi_sr)

# Full streaming & fixed speaking rate control (2 syllables per second)
speech_stream = speech_generator.generate_stream(
    prompt_audio_path=Path('assets/audio/english_female.wav'),
    text=text_generator("Staff do not always do enough to prevent violence."),
    speaking_rate=repeat(2.0),
)

audio_frames = [audio_frame for audio_frame, _ in speech_stream]
sf.write('full_stream_2sps.wav', np.concatenate(audio_frames), config.mimi_sr)

Gradio demo

To start a gradio web-demo run:

voxtream-app

Websocket

To start a websocket server run:

voxtream-server

To send a request to the server run:

python voxtream/client.py

It sends a path to the audio prompt and a text to the server and immediately plays audio from the output stream.

Evaluation

To reproduce evaluation metrics from the paper check evaluation section.

Training

Build the Docker container. If you have another version of Docker compose installed use docker compose -f ... instead.

docker-compose -f .devcontainer/docker-compose.yaml build voxtream

Run training using the train.py script. You should specify GPU IDs that will be seen inside the container, ex. GPU_IDS=0,1. Specify the batch size according to your GPU. The default batch size is 64 (tested on H200), batch size 12 fits into RTX3090. The dataset will be downloaded automatically to the HF cache directory. Dataset size is 80Gb. The data will be loaded to RAM during training, make sure you can allocate ~80Gb of RAM per GPU. Results will be stored at the ./experiments directory.

Example of running the training using 2 GPUs with batch size 32:

GPU_IDS=0,1 docker-compose -f .devcontainer/docker-compose.yaml run voxtream python voxtream/train.py batch_size=12

Custom dataset

To prepare a custom training dataset check dataset section.

Benchmark

To evaluate model's real time factor (RTF) and First packet latency (FPL) run voxtream-benchmark. You can compile model for faster inference using --compile flag (note that initial compilation take some time).

Device	Compiled	FPL, ms	RTF
RTX3090		74	0.256
RTX3090	:heavy_check_mark:	63	0.173

TODO

Add finetuning instructions

License

The code in this repository is provided under the MIT License.

The Depth Transformer component from SesameAI-CSM is included under the Apache 2.0 License (see LICENSE-APACHE and NOTICE).

The model weights were trained on data licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). Redistribution of the weights must include proper attribution to the original dataset creators (see ATTRIBUTION.md).

Acknowledgements

Mimi: Streaming audio codec from Kyutai
CSM: Conversation speech model from Sesame
ReDimNet: Speaker recognition model from IDR&D
Sidon: Speech enhancemnet model from SaruLab

Citation

@inproceedings{torgashov2026voxtream,
  title={Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  author={Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  booktitle={Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026},
  note={to appear},
  url={https://arxiv.org/abs/2509.15969}
}

@article{torgashov2026voxtream2,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream2: Full-stream TTS with dynamic speaking rate control},
  journal   = {arXiv:2603.13518},
  year      = {2026}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.3

Apr 30, 2026

0.2.2

Apr 30, 2026

0.2.1

Apr 8, 2026

0.2.0

Mar 17, 2026

0.1.5

Oct 4, 2025

0.1.4

Sep 27, 2025

0.1.3

Sep 26, 2025

0.1.2

Sep 26, 2025

0.1.0

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxtream-0.2.3.tar.gz (8.1 MB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voxtream-0.2.3-py3-none-any.whl (103.8 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file voxtream-0.2.3.tar.gz.

File metadata

Download URL: voxtream-0.2.3.tar.gz
Upload date: Apr 30, 2026
Size: 8.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for voxtream-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`fc89af425af089e3b8cbaa72909d998a12bc5ffb270aa9468400eb6480388f4e`
MD5	`d596f0d4ddd4caa349ec10f7737dabbe`
BLAKE2b-256	`e3b8171f471793aa9949539e9a87b3683d4eada2d5d9d92df121bad9fa344aa9`

See more details on using hashes here.

File details

Details for the file voxtream-0.2.3-py3-none-any.whl.

File metadata

Download URL: voxtream-0.2.3-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 103.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for voxtream-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d8cf6b553e074f171d6a1ca5ced3d8f595b420be9169c18e95a5facf6b07b31b`
MD5	`6d152c22966857c0c774c0b18d0b4cb1`
BLAKE2b-256	`5d9b05eca376a4ca6520e139c32922d634d8e32edd8a9d1aad79db19712848e8`

See more details on using hashes here.

voxtream 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

VoXtream2: Full-stream TTS with dynamic speaking rate control

Key features

Updates

Installation

eSpeak NG phonemizer

Pip package

Usage

Command line

Output streaming

Full streaming (slow speech, 2 syllables per second)

Acoustic prompt enhancement

Python API

Gradio demo

Websocket

Evaluation

Training

Custom dataset

Benchmark

TODO

License

Acknowledgements

Citation

Disclaimer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes