xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
Project description
A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
๐ Paper | ๐ Quick Start | ๐ฏ Supported DiTs | ๐ Dev Guide | ๐ DiscussionTable of Contents
- ๐ฅ Meet xDiT
- ๐ข Updates
- ๐ฏ Supported DiTs
- ๐ Performance
- ๐ QuickStart
- โจ the xDiT's secret weapons
- ๐ Develop Guide
- ๐ง History and Looking for Contributions
- ๐ Cite Us
๐ฅ Meet xDiT
Diffusion Transformers (DiTs), pivotal in text-to-image and text-to-video models, are driving advancements in high-quality image and video generation. With the escalating input sequence length in DiTs, the computational demand of the Attention mechanism grows quadratically! Consequently, multi-GPU and multi-machine deployments are essential to maintain real-time performance in online services.
To meet real-time demand for DiTs applications, parallel inference is a must. xDiT is an inference engine designed for the parallel deployment of DiTs on large scale. xDiT provides a suite of efficient parallel inference approaches for Diffusion Models.
-
Sequence Parallelism, USP is a unified sequence parallel approach combining DeepSpeed-Ulysses, Ring-Attention.
-
PipeFusion, a patch level pipeline parallelism using displaced patch by taking advantage of the diffusion model characteristics.
-
Data Parallel: Processes multiple prompts or generates multiple images from a single prompt in parallel across images.
-
CFG Parallel, also known as Split Batch: Activates when using classifier-free guidance (CFG) with a constant parallelism of 2.
The four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware.
As shown in the following picture, xDiT offers a set of APIs to adapt DiT models in huggingface/diffusers to hybrid parallel implementation through simple wrappers. If the model you require is not available in the model zoo, developing it yourself is straightforward; please refer to our Dev Guide.
We also have implemented the following parallel stategies for reference:
- Tensor Parallelism
- DistriFusion
The communication and memory costs associated with the aforementioned parallelism, except for the CFG and DP, in DiTs are detailed in the table below. (* denotes that communication can be overlapped with computation.)
As we can see, PipeFusion and Sequence Parallel achieve lowest communication cost on different scales and hardware configurations, making them suitable foundational components for a hybrid approach.
๐: Number of pixels; ๐๐: Model hidden size; ๐ณ: Number of model layers; ๐ท: Total model parameters; ๐ต: Number of parallel devices; ๐ด: Number of patch splits; ๐ธ๐ถ: Query and Output parameter count; ๐ฒ๐ฝ: KV Activation parameter count; ๐จ = ๐ธ = ๐ถ = ๐ฒ = ๐ฝ: Equal parameters for Attention, Query, Output, Key, and Value;
attn-KV | communication cost | param memory | activations memory | extra buff memory | |
---|---|---|---|---|---|
Tensor Parallel | fresh | $4O(p \times hs)L$ | $\frac{1}{N}P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
DistriFusion* | stale | $2O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $2AL = (KV)L$ |
Ring Sequence Parallel* | fresh | $2O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
Ulysses Sequence Parallel | fresh | $\frac{4}{N}O(p \times hs)L$ | $P$ | $\frac{2}{N}A = \frac{1}{N}QO$ | $\frac{2}{N}A = \frac{1}{N}KV$ |
PipeFusion* | stale- | $2O(p \times hs)$ | $\frac{1}{N}P$ | $\frac{2}{M}A = \frac{1}{M}QO$ | $\frac{2L}{N}A = \frac{1}{N}(KV)L$ |
๐ข Updates
- ๐August 9, 2024: Support Latte sequence parallel version. The inference scripts are examples/latte_example.
- ๐August 8, 2024: Support Flux sequence parallel version. The inference scripts are examples/flux_example.
- ๐August 2, 2024: Support Stable Diffusion 3 hybrid parallel version. The inference scripts are examples/sd3_example.
- ๐July 18, 2024: Support PixArt-Sigma and PixArt-Alpha. The inference scripts are examples/pixartsigma_example.py, examples/pixartalpha_example.py.
- ๐July 17, 2024: Rename the project to xDiT. The project has evolved from a collection of parallel methods into a unified inference framework and supported the hybrid parallel for DiTs.
- ๐July 10, 2024: Support HunyuanDiT. The inference script is legacy/scripts/hunyuandit_example.py.
- ๐June 26, 2024: Support Stable Diffusion 3. The inference script is legacy/scripts/sd3_example.py.
- ๐May 24, 2024: PipeFusion is public released. It supports PixArt-alpha legacy/scripts/pixart_example.py, DiT legacy/scripts/ditxl_example.py and SDXL legacy/scripts/sdxl_example.py.
๐ฏ Supported DiTs
Model Name | CFG | SP | PipeFusion |
---|---|---|---|
๐ฌ Latte | โ | โ๏ธ | โ |
๐ต HunyuanDiT-v1.2-Diffusers | โ๏ธ | โ๏ธ | โ๏ธ |
๐ Flux | NA | โ๏ธ | โ |
๐ด PixArt-Sigma | โ๏ธ | โ๏ธ | โ๏ธ |
๐ข PixArt-alpha | โ๏ธ | โ๏ธ | โ๏ธ |
๐ Stable Diffusion 3 | โ๏ธ | โ๏ธ | โ๏ธ |
Supported by legacy version only:
๐ Performance
Here are the benchmark results for Pixart-Alpha using the 20-step DPM solver as the scheduler across various image resolutions. To replicate these findings, please refer to the script at ./legacy/scripts/benchmark.sh.
TBD: Updates results on hybrid parallelism.
- The Latency on 4xA100-80GB (PCIe)
- The Latency on 8xL20-48GB (PCIe)
- The Latency on 8xA100-80GB (NVLink)
- The Latency on 4xT4-16GB (PCIe)
๐ QuickStart
1. Install from pip
pip install xfuser
2. Install from source
2.1 Install yunchang for sequence parallel.
Install yunchang from feifeibear/long-context-attention.
Please note that it has a dependency on flash attention and specific GPU model requirements. We recommend installing yunchang from the source code rather than using pip install yunchang==0.2.0
.
2.2 Install xDiT
python setup.py install
2. Usage
We provide examples demonstrating how to run models with xDiT in the ./examples/ directory. You can easily modify the model type, model directory, and parallel options in the examples/run.sh within the script to run some already supported DiT models.
bash examples/run.sh
To inspect the available options for the PixArt-alpha example, use the following command:
python ./examples/pixartalpha_example.py -h
...
xFuser Arguments
options:
-h, --help show this help message and exit
Model Options:
--model MODEL Name or path of the huggingface model to use.
--download-dir DOWNLOAD_DIR
Directory to download and load the weights, default to the default cache dir of huggingface.
--trust-remote-code Trust remote code from huggingface.
Runtime Options:
--warmup_steps WARMUP_STEPS
Warmup steps in generation.
--use_parallel_vae
--seed SEED Random seed for operations.
--output_type OUTPUT_TYPE
Output type of the pipeline.
Parallel Processing Options:
--do_classifier_free_guidance
--use_split_batch Use split batch in classifier_free_guidance. cfg_degree will be 2 if set
--data_parallel_degree DATA_PARALLEL_DEGREE
Data parallel degree.
--ulysses_degree ULYSSES_DEGREE
Ulysses sequence parallel degree. Used in attention layer.
--ring_degree RING_DEGREE
Ring sequence parallel degree. Used in attention layer.
--pipefusion_parallel_degree PIPEFUSION_PARALLEL_DEGREE
Pipefusion parallel degree. Indicates the number of pipeline stages.
--num_pipeline_patch NUM_PIPELINE_PATCH
Number of patches the feature map should be segmented in pipefusion parallel.
--attn_layer_num_for_pp [ATTN_LAYER_NUM_FOR_PP ...]
List representing the number of layers per stage of the pipeline in pipefusion parallel
--tensor_parallel_degree TENSOR_PARALLEL_DEGREE
Tensor parallel degree.
--split_scheme SPLIT_SCHEME
Split scheme for tensor parallel.
Input Options:
--height HEIGHT The height of image
--width WIDTH The width of image
--prompt [PROMPT ...]
Prompt for the model.
--no_use_resolution_binning
--negative_prompt [NEGATIVE_PROMPT ...]
Negative prompt for the model.
--num_inference_steps NUM_INFERENCE_STEPS
Number of inference steps.
Hybriding multiple parallelism techniques togather is essential for efficiently scaling. It's important that the product of all parallel degrees matches the number of devices. For instance, you can combine CFG, PipeFusion, and sequence parallelism with the command below to generate an image of a cute dog through hybrid parallelism. Here ulysses_degree * pipefusion_parallel_degree * cfg_degree(use_split_batch) == number of devices == 8.
torchrun --nproc_per_node=8 \
examples/pixartalpha_example.py \
--model models/PixArt-XL-2-1024-MS \
--pipefusion_parallel_degree 2 \
--ulysses_degree 2 \
--num_inference_steps 20 \
--warmup_steps 0 \
--prompt "A small dog" \
--use_cfg_parallel
โ ๏ธ Applying PipeFusion requires setting warmup_steps
, also required in DistriFusion, typically set to a small number compared with num_inference_steps
.
The warmup step impacts the efficiency of PipeFusion as it cannot be executed in parallel, thus degrading to a serial execution.
We observed that a warmup of 0 had no effect on the PixArt model.
Users can tune this value according to their specific tasks.
โจ The xDiT's Secret Weapons
The exceptional capabilities of xDiT stem from our innovative technologies.
1. PipeFusion
PipeFusion: Displaced Patch Pipeline Parallelism for Diffusion Models
2. USP: Unified Sequence Parallelism
USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
3. Hybrid Parallel
4. CFG Parallel
5. Parallel VAE
๐ Develop Guide
The implement and design of xdit framework
๐ง History and Looking for Contributions
We conducted a major upgrade of this project in August 2024.
The latest APIs is located in the xfuser/ directory, supports hybrid parallelism. It offers clearer and more structured code but currently supports fewer models.
The legacy APIs is in the legacy/ directory, limited to single parallelism. It supports a richer of parallel methods, including PipeFusion, Sequence Parallel, DistriFusion, and Tensor Parallel. CFG Parallel can be hybrid with PipeFusion but not with other parallel methods.
For models not yet supported by the latest APIs, you can run the examples in the legacy/scripts/ directory. If you wish to develop new features on a model or require hybrid parallelism, stay tuned for further project updates.
We also welcome developers to join and contribute more features and models to the project. Tell us which model you need in xDiT in discussions.
๐ Cite Us
@article{wang2024pipefusion,
title={PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models},
author={Jiannan Wang and Jiarui Fang and Jinzhe Pan and Aoyu Li and PengCheng Yang},
year={2024},
eprint={2405.07719},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{fang2024unified,
title={USP: a Unified Sequence Parallelism Approach for Long Context Generative AI},
author={Fang, Jiarui and Zhao, Shangchun},
journal={arXiv preprint arXiv:2405.07719},
year={2024}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file xfuser-0.3.tar.gz
.
File metadata
- Download URL: xfuser-0.3.tar.gz
- Upload date:
- Size: 94.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f9ca921e83f4a1267a657299863ca47e262c4f09c7f24b38c6b0b3ab9ac7c67 |
|
MD5 | f854d0fa3393488fd37324725cb8d42f |
|
BLAKE2b-256 | e45a0520b456ab3caef1e86181278544b2fc9e63657fad01ff3f284806872b96 |
File details
Details for the file xfuser-0.3-py3-none-any.whl
.
File metadata
- Download URL: xfuser-0.3-py3-none-any.whl
- Upload date:
- Size: 128.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44c189f2fdf29ed5ebd69189fde34a8bd1e931dc01c6a75b7ddd0b22d5181bf2 |
|
MD5 | a1e263944ca4ffb3d48f5b150ecadc43 |
|
BLAKE2b-256 | cdea01115263d08cf3da6cb9fb7b898968ebb6cd397fe9310a378dafb28116bb |