Skip to main content

Full-Stream Zero-shot TTS model with Extremely Low Latency

Project description

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

arXiv demo model python pytorch

We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.

Key featues

  • Streaming: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
  • Speed: Works 5x times faster than real-time and achieves 102 ms first packet latency on GPU.
  • Quality and efficiency: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.

Installation

pip install voxtream

Usage

Output streaming

voxtream \
    --prompt-audio assets/audio/male.wav \
    --prompt-text "The liquor was first created as 'Brandy Milk', produced with milk, brandy and vanilla." \
    --text "In general, however, some method is then needed to evaluate each approximation." \
    --output "output_stream.wav"
  • Note: Initial run may take some additional time to download model weights.

Full streaming

voxtream \
    --prompt-audio assets/audio/female.wav \
    --prompt-text "Betty Cooper helps Archie with cleaning a store room, when Reggie attacks her." \
    --text "Staff do not always do enough to prevent violence." \
    --output "full_stream.wav" \
    --full-stream

Training

  • Build the Docker container. If you have another version of Docker compose installed use docker compose -f ... instead.
docker-compose -f .devcontainer/docker-compose.yaml build voxtream
  • Run training using the train.py script. You should specify GPU IDs that will be seen inside the container, ex. GPU_IDS=0,1. Specify the batch size according to your GPU. The default batch size is 32 (tested on RTX3090), 64 fits into A100-40Gb, and 128 fits into A100-80Gb. The dataset will be downloaded automatically to the HF cache directory. Dataset size is 20Gb. The data will be loaded to RAM during training, make sure you can allocate ~20Gb of RAM per GPU. Results will be stored at the ./experiments directory.

Example of running the training using 2 GPUs with batch size 32:

GPU_IDS=0,1 docker-compose -f .devcontainer/docker-compose.yaml run voxtream python voxtream/train.py batch_size=32

Benchmark

To evaluate model's real time factor (RTF) and First packet latency (FPL) run voxtream-benchmark. You can compile model for faster inference using --compile flag (note that initial compilation take some time).

Device Compiled FPL, ms RTF
A100 176 1.00
A100 :heavy_check_mark: 102 0.17
RTX3090 205 1.19
RTX3090 :heavy_check_mark: 123 0.19

TODO

  • Add a neural phoneme aligner. Remove MFA dependency
  • Add PyPI package
  • Gradio demo
  • HuggingFace Spaces demo
  • Evaluation scripts

License

The code in this repository is provided under the MIT License.

The Depth Transformer component from SesameAI-CSM is included under the Apache 2.0 License (see LICENSE-APACHE and NOTICE).

The model weights were trained on data licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). Redistribution of the weights must include proper attribution to the original dataset creators (see ATTRIBUTION.md).

Acknowledgements

Citation

@article{torgashov2025voxtream,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  journal   = {arXiv:2509.15969},
  year      = {2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxtream-0.1.0.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voxtream-0.1.0-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file voxtream-0.1.0.tar.gz.

File metadata

  • Download URL: voxtream-0.1.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for voxtream-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6859c9b83897a7377fbea850412af4948d91452436a6c15ad8a426b131c410b1
MD5 4cfd947b85fa6b840ccdfa3c79c04714
BLAKE2b-256 599f273dbb4b4c27242c164298448b149880e94459c29f57f87456572fde3636

See more details on using hashes here.

File details

Details for the file voxtream-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: voxtream-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for voxtream-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c784622c74af641e75cc83fad6d798584aba58f145322fea7764161d75049196
MD5 fcee438dcc78c4ab4ee4b0f6e9752db0
BLAKE2b-256 8333af244087e58b671c87d22d0339f21def6749efec3370400b98c900b950a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page