Skip to main content

Full-Stream Zero-shot TTS model with Extremely Low Latency

Project description

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

arXiv demo model python pytorch

We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.

Key featues

  • Streaming: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
  • Speed: Works 5x times faster than real-time and achieves 102 ms first packet latency on GPU.
  • Quality and efficiency: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.

Try VoXtream ⚡ in your browser on HuggingFace 🤗 spaces.

Installation

pip install voxtream

Usage

Command line

Output streaming

voxtream \
    --prompt-audio assets/audio/male.wav \
    --prompt-text "The liquor was first created as 'Brandy Milk', produced with milk, brandy and vanilla." \
    --text "In general, however, some method is then needed to evaluate each approximation." \
    --output "output_stream.wav"
  • Note: The VoXtream requires around 2GB of VRAM. Initial run may take some additional time to download model weights.

Full streaming

voxtream \
    --prompt-audio assets/audio/female.wav \
    --prompt-text "Betty Cooper helps Archie with cleaning a store room, when Reggie attacks her." \
    --text "Staff do not always do enough to prevent violence." \
    --output "full_stream.wav" \
    --full-stream

Python API

import json
from pathlib import Path

import numpy as np
import soundfile as sf

from voxtream.utils.generator import set_seed, text_generator
from voxtream.generator import SpeechGenerator, SpeechGeneratorConfig


set_seed()
with open('configs/generator.json') as f:
    config = SpeechGeneratorConfig(**json.load(f))

speech_generator = SpeechGenerator(config)

# Output streaming
speech_stream = speech_generator.generate_stream(
    prompt_text="The liquor was first created as 'Brandy Milk', produced with milk, brandy and vanilla.",
    prompt_audio_path=Path('assets/audio/male.wav'),
    text="In general, however, some method is then needed to evaluate each approximation."
)

audio_frames = [audio_frame for audio_frame, _ in speech_stream]
sf.write('output_stream.wav', np.concatenate(audio_frames), config.mimi_sr)

# Full streaming
speech_stream = speech_generator.generate_stream(
    prompt_text="Betty Cooper helps Archie with cleaning a store room, when Reggie attacks her.",
    prompt_audio_path=Path('assets/audio/female.wav'),
    text=text_generator("Staff do not always do enough to prevent violence.")
)

audio_frames = [audio_frame for audio_frame, _ in speech_stream]
sf.write('full_stream.wav', np.concatenate(audio_frames), config.mimi_sr)

Gradio demo

voxtream-app

Training

  • Build the Docker container. If you have another version of Docker compose installed use docker compose -f ... instead.
docker-compose -f .devcontainer/docker-compose.yaml build voxtream
  • Run training using the train.py script. You should specify GPU IDs that will be seen inside the container, ex. GPU_IDS=0,1. Specify the batch size according to your GPU. The default batch size is 32 (tested on RTX3090), 64 fits into A100-40Gb, and 128 fits into A100-80Gb. The dataset will be downloaded automatically to the HF cache directory. Dataset size is 20Gb. The data will be loaded to RAM during training, make sure you can allocate ~20Gb of RAM per GPU. Results will be stored at the ./experiments directory.

Example of running the training using 2 GPUs with batch size 32:

GPU_IDS=0,1 docker-compose -f .devcontainer/docker-compose.yaml run voxtream python voxtream/train.py batch_size=32

Benchmark

To evaluate model's real time factor (RTF) and First packet latency (FPL) run voxtream-benchmark. You can compile model for faster inference using --compile flag (note that initial compilation take some time).

Device Compiled FPL, ms RTF
A100 176 1.00
A100 :heavy_check_mark: 102 0.17
RTX3090 205 1.19
RTX3090 :heavy_check_mark: 123 0.19

TODO

  • Add a neural phoneme aligner. Remove MFA dependency
  • Add PyPI package
  • Gradio demo
  • HuggingFace Spaces demo
  • Evaluation scripts

License

The code in this repository is provided under the MIT License.

The Depth Transformer component from SesameAI-CSM is included under the Apache 2.0 License (see LICENSE-APACHE and NOTICE).

The model weights were trained on data licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). Redistribution of the weights must include proper attribution to the original dataset creators (see ATTRIBUTION.md).

Acknowledgements

Citation

@article{torgashov2025voxtream,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  journal   = {arXiv:2509.15969},
  year      = {2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxtream-0.1.3.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voxtream-0.1.3-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file voxtream-0.1.3.tar.gz.

File metadata

  • Download URL: voxtream-0.1.3.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for voxtream-0.1.3.tar.gz
Algorithm Hash digest
SHA256 35b7bd40ec7a60262a67325dfd47f58e071d53f81ae694b1ecc08b1676eec8c5
MD5 6574ce7c439ba36e77d004ed35e7a0e8
BLAKE2b-256 cdf33016971a48f1a80fb4ab2c2f60d66697063fa179a3752ed7d4faf16b81bf

See more details on using hashes here.

File details

Details for the file voxtream-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: voxtream-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for voxtream-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 96e883bdb586c21b3c258f194dae48a46b8fd3546a8a6edd9c184f044ea4a796
MD5 04c4def94eb844a3e40331b2de7b0ab5
BLAKE2b-256 d0a0f748e20a33adafd768123f21744a1135d2b3dd8878c79d2908c0c0f641cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page