TurboDiffusion: video generation acceleration framework that could accelerate end-to-end video generation by 100-205x with negligible video quality loss.
Project description
TurboDiffusion
This repository provides the official implementation of TurboDiffusion, a video generation acceleration framework that could accelerate end-to-end video generation by $100 \sim 205\times$ with negligible video quality loss. Currently, this repository contains the model checkpoints and inference code. The training code will be released in the future.
TurboDiffusion Technical Report: Accelerating Diffusion Models for Video Generation
|
Original, E2E Time: 166s
|
TurboDiffusion, E2E Time: 1.8s
|
Available Models
| Model Name | Checkpoint Link | Best Resolution |
|---|---|---|
TurboDiffusion-Wan2.1-T2V-1.3B-480P |
Huggingface Model | 480p |
TurboDiffusion-Wan2.1-T2V-14B-480P |
Huggingface Model | 480p |
TurboDiffusion-Wan2.1-T2V-14B-720P |
Huggingface Model | 720p |
TurboDiffusion-Wan2.2-I2V-A14B-720P |
Huggingface Model | 480p or 720p |
Note: All checkpoints support generating videos at 480p or 720p. The "Best Resolution" column indicates the resolution at which the model provides the best video quality.
Installation
Base environment: python>=3.9, torch>=2.7.0
dInstall TurboDiffusion by pip:
conda create -n turbodiffusion python=3.12
conda activate turbodiffusion
pip install turbodiffusion --no-build-isolation
Or you can compile from source:
git clone https://github.com/thu-ml/TurboDiffusion.git
cd TurboDiffusion
git submodule update --init --recursive
pip install -e . --no-build-isolation
To enable SageSLA, install SpargeAttn first:
pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation
Inference
-
Download the Wan2.1 VAE and umT5 text encoder checkpoints from the official Wan2.1 repository on Huggingface:
mkdir checkpoints cd checkpoints wget https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/Wan2.1_VAE.pth wget https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/models_t5_umt5-xxl-enc-bf16.pth
-
Download our finetuned checkpoints:
wget https://huggingface.co/TurboDiffusion/TurboDiffusion-Wan2.1-T2V-14B-720P/resolve/main/TurboDiffusion-Wan2.1-T2V-14B-720P.pthFor 14B model on GPUs with GPU memory less than 40GB (e.g. RTX5090), we recommend using a quantized version to avoid OOM:
wget https://huggingface.co/TurboDiffusion/TurboDiffusion-Wan2.1-T2V-14B-720P/resolve/main/TurboDiffusion-Wan2.1-T2V-14B-720P-quant.pthNote:
- Using the quantized version may introduce accuracy loss and extra quantizing overhead, so we suggest using the unquantized version if possible.
- On GPUs with GPU memory less than 30GB (e.g. RTX4090), it's not guaranteed that the quantized checkpoint will not cause OOM.
For I2V model, download both the high-noise and low-noise checkpoints:
wget https://huggingface.co/TurboDiffusion/TurboDiffusion-Wan2.2-I2V-A14B-720P/resolve/main/TurboDiffusion-Wan2.2-I2V-A14B-high-720P.pth wget https://huggingface.co/TurboDiffusion/TurboDiffusion-Wan2.2-I2V-A14B-720P/resolve/main/TurboDiffusion-Wan2.2-I2V-A14B-low-720P.pth
-
Use the inference script for the T2V model:
export PYTHONPATH=turbodiffusion # Arguments: # --dit_path Path to the finetuned TurboDiffusion checkpoint # --model Model to use: Wan2.1-1.3B or Wan2.1-14B (default: Wan2.1-1.3B) # --num_samples Number of videos to generate (default: 1) # --num_steps Sampling steps, 1–4 (default: 4) # --sigma_max Initial sigma for rCM (default: 80); larger choices (e.g., 1600) reduce diversity but may enhance quality # --vae_path Path to Wan2.1 VAE (default: checkpoints/Wan2.1_VAE.pth) # --text_encoder_path Path to umT5 text encoder (default: checkpoints/models_t5_umt5-xxl-enc-bf16.pth) # --num_frames Number of frames to generate (default: 77) # --prompt Text prompt for video generation # --resolution Output resolution: "480p" or "720p" (default: 480p) # --aspect_ratio Aspect ratio in W:H format (default: 16:9) # --seed Random seed for reproducibility (default: 0) # --save_path Output file path including extension (default: output/generated_video.mp4) # --attention_type Attention module to use: original, sla or sagesla (default: sagesla) # --sla_topk Top-k ratio for SLA/SageSLA attention (default: 0.15) # --quant_linear Enable quantization for linear layers, pass this if using a quantized checkpoint # --default_norm Use the original LayerNorm and RMSNorm of Wan models python turbodiffusion/inference/wan2.1_t2v_infer.py \ --model Wan2.1-14B \ --dit_path checkpoints/modified/TurboDiffusion-Wan2.1-T2V-14B-720P-quant.pth \ --resolution 720p \ --prompt "An alarm clock" \ --num_samples 1 \ --num_steps 4 \ --quant_linear \ --attention_type sagesla \ --sla_topk 0.15
Or the script for the I2V model:
export PYTHONPATH=turbodiffusion # --image_path Path to the input image # --high_noise_model_path Path to the high noise TurboDiffusion checkpoint # --low_noise_model_path Path to the high noise TurboDiffusion checkpoint # --boundary Timestep boundary for switching from high to low noise model (default: 0.9) # --model Model to use: Wan2.2-A14B (default: Wan2.2-A14B) # --num_samples Number of videos to generate (default: 1) # --num_steps Sampling steps, 1–4 (default: 4) # --sigma_max Initial sigma for rCM (default: 200); larger choices (e.g., 1600) reduce diversity but may enhance quality # --vae_path Path to Wan2.2 VAE (default: checkpoints/Wan2.2_VAE.pth) # --text_encoder_path Path to umT5 text encoder (default: checkpoints/models_t5_umt5-xxl-enc-bf16.pth) # --num_frames Number of frames to generate (default: 77) # --prompt Text prompt for video generation # --resolution Output resolution: "480p" or "720p" (default: 720p) # --aspect_ratio Aspect ratio in W:H format (default: 16:9) # --adaptive_resolution Enable adaptive resolution based on input image size # --ode Use ODE for sampling (sharper but less robust than SDE) # --seed Random seed for reproducibility (default: 0) # --save_path Output file path including extension (default: output/generated_video.mp4) # --attention_type Attention module to use: original, sla or sagesla (default: sagesla) # --sla_topk Top-k ratio for SLA/SageSLA attention (default: 0.18) # --quant_linear Enable quantization for linear layers, pass this if using a quantized checkpoint # --default_norm Use the original LayerNorm and RMSNorm of Wan models python turbodiffusion/inference/wan2.2_i2v_infer.py \ --model Wan2.2-A14B \ --low_noise_model_path checkpoints/TurboDiffusion-Wan2.2-I2V-A14B-low-720P-quant.pth \ --high_noise_model_path checkpoints/TurboDiffusion-Wan2.2-I2V-A14B-high-720P-quant.pth \ --resolution 720p \ --adaptive_resolution \ --image_path assets/i2v_input.jpg \ --prompt "POV selfie video, ultra-messy and extremely fast. A white cat in sunglasses stands on a surfboard with a neutral look when the board suddenly whips sideways, throwing cat and camera into the water; the frame dives sharply downward, swallowed by violent bursts of bubbles, spinning turbulence, and smeared water streaks as the camera sinks. Shadows thicken, pressure ripples distort the edges, and loose bubbles rush upward past the lens, showing the camera is still sinking. Then the cat kicks upward with explosive speed, dragging the view through churning bubbles and rapidly brightening water as sunlight floods back in; the camera races upward, water streaming off the lens, and finally breaks the surface in a sudden blast of light and spray, snapping back into a crooked, frantic selfie as the cat resurfaces." \ --num_samples 1 \ --num_steps 4 \ --quant_linear \ --attention_type sagesla \ --sla_topk 0.18 \ --ode
Evaluation
We evaluate video generation on a single RTX 5090 GPU. The E2E Time means the end-to-end latency of 5-second video generation.
Wan-2.1-T2V-1.3B-480P
|
Original, E2E Time: 166s
|
FastVideo, E2E Time: 6s
|
TurboDiffusion, E2E Time: 1.8s
|
|
Original, E2E Time: 166s
|
FastVideo, E2E Time: 6s
|
TurboDiffusion, E2E Time: 1.8s
|
|
Original, E2E Time: 166s
|
FastVideo, E2E Time: 6s
|
TurboDiffusion, E2E Time: 1.8s
|
|
Original, E2E Time: 166s
|
FastVideo, E2E Time: 6s
|
TurboDiffusion, E2E Time: 1.8s
|
|
Original, E2E Time: 166s
|
FastVideo, E2E Time: 6s
|
TurboDiffusion, E2E Time: 1.8s
|
|
Original, E2E Time: 166s
|
FastVideo, E2E Time: 6s
|
TurboDiffusion, E2E Time: 1.8s
|
|
Original, E2E Time: 166s
|
FastVideo, E2E Time: 6s
|
TurboDiffusion, E2E Time: 1.8s
|
|
Original, E2E Time: 166s
|
FastVideo, E2E Time: 6s
|
TurboDiffusion, E2E Time: 1.8s
|
|
Original, E2E Time: 166s
|
FastVideo, E2E Time: 6s
|
TurboDiffusion, E2E Time: 1.8s
|
Wan-2.2-I2V-14B-720P
|
Original, E2E Time: 4183s
|
TurboDiffusion, E2E Time: 35.4s
|
|
Original, E2E Time: 4183s
|
TurboDiffusion, E2E Time: 35.4s
|
|
Original, E2E Time: 4183s
|
TurboDiffusion, E2E Time: 35.4s
|
|
Original, E2E Time: 4183s
|
TurboDiffusion, E2E Time: 35.4s
|
|
Original, E2E Time: 4183s
|
TurboDiffusion, E2E Time: 35.4s
|
|
Original, E2E Time: 4183s
|
TurboDiffusion, E2E Time: 35.4s
|
|
Original, E2E Time: 4183s
|
TurboDiffusion, E2E Time: 35.4s
|
Wan-2.1-T2V-14B-720P
|
Original, E2E Time: 4648s
|
FastVideo, E2E Time: 83.8s
|
TurboDiffusion, E2E Time: 22.7s
|
|
Original, E2E Time: 4648s
|
FastVideo, E2E Time: 83.8s
|
TurboDiffusion, E2E Time: 22.7s
|
|
Original, E2E Time: 4648s
|
FastVideo, E2E Time: 83.8s
|
TurboDiffusion, E2E Time: 22.7s
|
|
Original, E2E Time: 4648s
|
FastVideo, E2E Time: 83.8s
|
TurboDiffusion, E2E Time: 22.7s
|
Wan-2.1-T2V-14B-480P
|
Original, E2E Time: 1635s
|
FastVideo, E2E Time: 30.5s
|
TurboDiffusion, E2E Time: 9.4s
|
|
Original, E2E Time: 1635s
|
FastVideo, E2E Time: 30.5s
|
TurboDiffusion, E2E Time: 9.4s
|
|
Original, E2E Time: 1635s
|
FastVideo, E2E Time: 30.5s
|
TurboDiffusion, E2E Time: 9.4s
|
|
Original, E2E Time: 1635s
|
FastVideo, E2E Time: 30.5s
|
TurboDiffusion, E2E Time: 9.4s
|
Citation
If you use this code or find our work valuable, please cite:
@inproceedings{zhang2025sageattention,
title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
@article{zhang2025sla,
title={SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention},
author={Zhang, Jintao and Wang, Haoxu and Jiang, Kai and Yang, Shuo and Zheng, Kaiwen and Xi, Haocheng and Wang, Ziteng and Zhu, Hongzhou and Zhao, Min and Stoica, Ion and Gonzalez, Joseph E. and Zhu, Jun and Chen, Jianfei},
journal={arXiv preprint arXiv:2509.24006},
year={2025}
}
@article{zheng2025rcm,
title={Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency},
author={Zheng, Kaiwen and Wang, Yuji and Ma, Qianli and Chen, Huayu and Zhang, Jintao and Balaji, Yogesh and Chen, Jianfei and Liu, Ming-Yu and Zhu, Jun and Zhang, Qinsheng},
journal={arXiv preprint arXiv:2510.08431},
year={2025}
}
@inproceedings{zhang2024sageattention2,
title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file turbodiffusion-0.1.2.tar.gz.
File metadata
- Download URL: turbodiffusion-0.1.2.tar.gz
- Upload date:
- Size: 5.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
293b5e7f1d3a71c8a611fcd0a347bcc340c4951220b38ca799325b259178e397
|
|
| MD5 |
6a1471f1f479511a5fb12be20e7fc937
|
|
| BLAKE2b-256 |
74acdade781c39f33d48823960b31194e7f9113a3745581245ee4906a2f0908c
|