xfuser

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters

These details have not been verified by PyPI

Project links

Homepage

Project description

xDiT

A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters

📃 Paper | 🚀 Quick Start | 🎯 Supported DiTs | 📚 Dev Guide | 📈 Discussion

🔥 Meet xDiT
📢 Updates
🎯 Supported DiTs
📈 Performance
- Flux.1
- HunyuanDiT
- SD3
- Pixart
- Latte
🚀 QuickStart
✨ xDiT's Arsenal
- Parallel Methods
- Compilation Acceleration
📚 Develop Guide
🚧 History and Looking for Contributions
📝 Cite Us

🔥 Meet xDiT

Diffusion Transformers (DiTs) are driving advancements in high-quality image and video generation. With the escalating input context length in DiTs, the computational demand of the Attention mechanism grows quadratically! Consequently, multi-GPU and multi-machine deployments are essential to meet the real-time requirements in online services.

To meet real-time demand for DiTs applications, parallel inference is a must. xDiT is an inference engine designed for the parallel deployment of DiTs on large scale. xDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as GPU kernel accelerations.

Sequence Parallelism, USP is a unified sequence parallel approach combining DeepSpeed-Ulysses, Ring-Attention.
PipeFusion, a patch level pipeline parallelism using displaced patch by taking advantage of the diffusion model characteristics.
Data Parallel: Processes multiple prompts or generates multiple images from a single prompt in parallel across images.
CFG Parallel, also known as Split Batch: Activates when using classifier-free guidance (CFG) with a constant parallelism of 2.

The four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware.

As shown in the following picture, xDiT offers a set of APIs to adapt DiT models in huggingface/diffusers to hybrid parallel implementation through simple wrappers. If the model you require is not available in the model zoo, developing it yourself is straightforward; please refer to our Dev Guide.

We also have implemented the following parallel stategies for reference:

Tensor Parallelism
DistriFusion

Optimization orthogonal to parallelization focuses on accelerating single GPU performance. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as torch.compile and onediff.

The overview of xDiT is shown as follows.

📢 Updates

⚙️August 30, 2024: Supporting(WIP) CogVideoX. The inference scripts are examples/latte_example.
🎉August 26, 2024: We apply torch.compile and onediff nexfort backend to accelerate GPU kernels speed.
🎉August 9, 2024: Support Latte sequence parallel version. The inference scripts are examples/latte_example.
🎉August 8, 2024: Support Flux sequence parallel version. The inference scripts are examples/flux_example.
🎉August 2, 2024: Support Stable Diffusion 3 hybrid parallel version. The inference scripts are examples/sd3_example.
🎉July 18, 2024: Support PixArt-Sigma and PixArt-Alpha. The inference scripts are examples/pixartsigma_example.py, examples/pixartalpha_example.py.
🎉July 17, 2024: Rename the project to xDiT. The project has evolved from a collection of parallel methods into a unified inference framework and supported the hybrid parallel for DiTs.
🎉July 10, 2024: Support HunyuanDiT. The inference script is legacy/scripts/hunyuandit_example.py.
🎉June 26, 2024: Support Stable Diffusion 3. The inference script is legacy/scripts/sd3_example.py.
🎉May 24, 2024: PipeFusion is public released. It supports PixArt-alpha legacy/scripts/pixart_example.py, DiT legacy/scripts/ditxl_example.py and SDXL legacy/scripts/sdxl_example.py.

🎯 Supported DiTs

Model Name	CFG	SP	PipeFusion
🎬 CogVideoX	❎	❎	❎
🎬 Latte	❎	✔️	❎
🔵 HunyuanDiT-v1.2-Diffusers	✔️	✔️	✔️
🟠 Flux	NA	✔️	❎
🔴 PixArt-Sigma	✔️	✔️	✔️
🟢 PixArt-alpha	✔️	✔️	✔️
🟠 Stable Diffusion 3	✔️	✔️	✔️

Supported by legacy version only, including DistriFusion and Tensor Parallel as the standalong parallel strategies:

🔴 DiT-XL

📈 Performance

🚀 QuickStart

1. Install from pip (current version)

pip install xfuser
# Or optionally, with flash_attn
pip install "xfuser[flash_attn]"

2. Install from source

pip install -e .
# Or optionally, with flash_attn
pip install -e ".[flash_attn]"

Note that we use two self-maintained packages:

The flash_attn used for yunchang should be >= 2.6.0

3. Usage

We provide examples demonstrating how to run models with xDiT in the ./examples/ directory. You can easily modify the model type, model directory, and parallel options in the examples/run.sh within the script to run some already supported DiT models.

bash examples/run.sh

To inspect the available options for the PixArt-alpha example, use the following command:

python ./examples/pixartalpha_example.py -h

...

xFuser Arguments

options:
  -h, --help            show this help message and exit

Model Options:
  --model MODEL         Name or path of the huggingface model to use.
  --download-dir DOWNLOAD_DIR
                        Directory to download and load the weights, default to the default cache dir of huggingface.
  --trust-remote-code   Trust remote code from huggingface.

Runtime Options:
  --warmup_steps WARMUP_STEPS
                        Warmup steps in generation.
  --use_parallel_vae
  --use_torch_compile   Enable torch.compile to accelerate inference in a single card
  --seed SEED           Random seed for operations.
  --output_type OUTPUT_TYPE
                        Output type of the pipeline.
  --enable_sequential_cpu_offload
                        Offloading the weights to the CPU.

Parallel Processing Options:
  --use_cfg_parallel    Use split batch in classifier_free_guidance. cfg_degree will be 2 if set
  --data_parallel_degree DATA_PARALLEL_DEGREE
                        Data parallel degree.
  --ulysses_degree ULYSSES_DEGREE
                        Ulysses sequence parallel degree. Used in attention layer.
  --ring_degree RING_DEGREE
                        Ring sequence parallel degree. Used in attention layer.
  --pipefusion_parallel_degree PIPEFUSION_PARALLEL_DEGREE
                        Pipefusion parallel degree. Indicates the number of pipeline stages.
  --num_pipeline_patch NUM_PIPELINE_PATCH
                        Number of patches the feature map should be segmented in pipefusion parallel.
  --attn_layer_num_for_pp [ATTN_LAYER_NUM_FOR_PP ...]
                        List representing the number of layers per stage of the pipeline in pipefusion parallel
  --tensor_parallel_degree TENSOR_PARALLEL_DEGREE
                        Tensor parallel degree.
  --split_scheme SPLIT_SCHEME
                        Split scheme for tensor parallel.

Input Options:
  --height HEIGHT       The height of image
  --width WIDTH         The width of image
  --prompt [PROMPT ...]
                        Prompt for the model.
  --no_use_resolution_binning
  --negative_prompt [NEGATIVE_PROMPT ...]
                        Negative prompt for the model.
  --num_inference_steps NUM_INFERENCE_STEPS
                        Number of inference steps.

Hybriding multiple parallelism techniques togather is essential for efficiently scaling. It's important that the product of all parallel degrees matches the number of devices. For instance, you can combine CFG, PipeFusion, and sequence parallelism with the command below to generate an image of a cute dog through hybrid parallelism. Here ulysses_degree * pipefusion_parallel_degree * cfg_degree(use_split_batch) == number of devices == 8.

torchrun --nproc_per_node=8 \
examples/pixartalpha_example.py \
--model models/PixArt-XL-2-1024-MS \
--pipefusion_parallel_degree 2 \
--ulysses_degree 2 \
--num_inference_steps 20 \
--warmup_steps 0 \
--prompt "A small dog" \
--use_cfg_parallel

⚠️ Applying PipeFusion requires setting warmup_steps, also required in DistriFusion, typically set to a small number compared with num_inference_steps. The warmup step impacts the efficiency of PipeFusion as it cannot be executed in parallel, thus degrading to a serial execution. We observed that a warmup of 0 had no effect on the PixArt model. Users can tune this value according to their specific tasks.

4. Launch a Http Service

Launching a Text-to-Image Http Service

5. Launch ComfyUI

Launching ComfyUI

✨ The xDiT's Arsenal

The remarkable performance of xDiT is attributed to two key facets. Firstly, it leverages parallelization techniques, pioneering innovations such as USP, PipeFusion, and hybrid parallelism, to scale DiTs inference to unprecedented scales.

Secondly, we employ compilation technologies to enhance execution on GPUs, integrating established solutions like torch.compile and onediff to optimize xDiT's performance.

1. Parallel Methods

As illustrated in the accompanying images, xDiTs offer a comprehensive set of parallelization techniques. For the DiT backbone, the foundational methods—Data, USP, PipeFusion, and CFG parallel—operate in a hybrid fashion. Additionally, the distinct methods, Tensor and DistriFusion parallel, function independently. For the VAE module, xDiT offers a parallel implementation, DistVAE, designed to prevent out-of-memory (OOM) issues. The (xDiT) highlights the methods first proposed by use.

The communication and memory costs associated with the aforementioned intra-image parallelism, except for the CFG and DP (they are inter-image parallel), in DiTs are detailed in the table below. (* denotes that communication can be overlapped with computation.)

As we can see, PipeFusion and Sequence Parallel achieve lowest communication cost on different scales and hardware configurations, making them suitable foundational components for a hybrid approach.

𝒑: Number of pixels; 𝒉𝒔: Model hidden size; 𝑳: Number of model layers; 𝑷: Total model parameters; 𝑵: Number of parallel devices; 𝑴: Number of patch splits; 𝑸𝑶: Query and Output parameter count; 𝑲𝑽: KV Activation parameter count; 𝑨 = 𝑸 = 𝑶 = 𝑲 = 𝑽: Equal parameters for Attention, Query, Output, Key, and Value;

	attn-KV	communication cost	param memory	activations memory	extra buff memory
Tensor Parallel	fresh	$4O(p \times hs)L$	$\frac{1}{N}P$	$\frac{2}{N}A = \frac{1}{N}QO$	$\frac{2}{N}A = \frac{1}{N}KV$
DistriFusion*	stale	$2O(p \times hs)L$	$P$	$\frac{2}{N}A = \frac{1}{N}QO$	$2AL = (KV)L$
Ring Sequence Parallel*	fresh	$2O(p \times hs)L$	$P$	$\frac{2}{N}A = \frac{1}{N}QO$	$\frac{2}{N}A = \frac{1}{N}KV$
Ulysses Sequence Parallel	fresh	$\frac{4}{N}O(p \times hs)L$	$P$	$\frac{2}{N}A = \frac{1}{N}QO$	$\frac{2}{N}A = \frac{1}{N}KV$
PipeFusion*	stale-	$2O(p \times hs)$	$\frac{1}{N}P$	$\frac{2}{M}A = \frac{1}{M}QO$	$\frac{2L}{N}A = \frac{1}{N}(KV)L$

1.1. PipeFusion

PipeFusion: Displaced Patch Pipeline Parallelism for Diffusion Models

1.2. USP: Unified Sequence Parallelism

USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

1.3. Hybrid Parallel

Hybrid Parallelism

1.4. CFG Parallel

CFG Parallel

1.5. Parallel VAE

Patch Parallel VAE

Compilation Acceleration

We utilize two compilation acceleration techniques, torch.compile and onediff, to enhance runtime speed on GPUs. These compilation accelerations are used in conjunction with parallelization methods.

We employ the nexfort backend of onediff. Please install it before use:

pip install onediff
pip install -U nexfort

For usage instructions, refer to the example/run.sh. Simply append --use_torch_compile or --use_onediff to your command. Note that these options are mutually exclusive, and their performance varies across different scenarios.

📚 Develop Guide

The implement and design of xdit framework

Manual for adding new models

🚧 History and Looking for Contributions

We conducted a major upgrade of this project in August 2024.

The latest APIs is located in the xfuser/ directory, supports hybrid parallelism. It offers clearer and more structured code but currently supports fewer models.

The legacy APIs is in the legacy/ directory, limited to single parallelism. It supports a richer of parallel methods, including PipeFusion, Sequence Parallel, DistriFusion, and Tensor Parallel. CFG Parallel can be hybrid with PipeFusion but not with other parallel methods.

For models not yet supported by the latest APIs, you can run the examples in the legacy/scripts/ directory. If you wish to develop new features on a model or require hybrid parallelism, stay tuned for further project updates.

We also welcome developers to join and contribute more features and models to the project. Tell us which model you need in xDiT in discussions.

📝 Cite Us

@article{wang2024pipefusion,
      title={PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models}, 
      author={Jiannan Wang and Jiarui Fang and Jinzhe Pan and Aoyu Li and PengCheng Yang},
      year={2024},
      eprint={2405.07719},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{fang2024unified,
      title={USP: a Unified Sequence Parallelism Approach for Long Context Generative AI},
      author={Fang, Jiarui and Zhao, Shangchun},
      journal={arXiv preprint arXiv:2405.07719},
      year={2024}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.5

Nov 20, 2024

0.3.4

Nov 7, 2024

0.3.3

Oct 25, 2024

This version

0.3.2

Sep 20, 2024

0.3.1

Aug 27, 2024

0.3

Aug 21, 2024

0.2

Aug 9, 2024

0.1

Aug 8, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xfuser-0.3.2.tar.gz (104.5 kB view details)

Uploaded Sep 20, 2024 Source

Built Distribution

xfuser-0.3.2-py3-none-any.whl (145.4 kB view details)

Uploaded Sep 20, 2024 Python 3

File details

Details for the file xfuser-0.3.2.tar.gz.

File metadata

Download URL: xfuser-0.3.2.tar.gz
Upload date: Sep 20, 2024
Size: 104.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for xfuser-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`d760e9661e1f296dbea2e034dfefd3bd25b9c8926eae6b1df131373b6d3a57d4`
MD5	`316976c477e75e130602b6a721a24d5b`
BLAKE2b-256	`6838fc9da3b4014a0ea2381ea405c48ed26ffd96f63d01099e02be4b183a7529`

See more details on using hashes here.

File details

Details for the file xfuser-0.3.2-py3-none-any.whl.

File metadata

Download URL: xfuser-0.3.2-py3-none-any.whl
Upload date: Sep 20, 2024
Size: 145.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for xfuser-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7da62abf9cd83b58d3dd2efc7c03e7f68cd2650878890580d6a3a2f7df4f069b`
MD5	`e046a3902c138f1639e4102be23ce580`
BLAKE2b-256	`680c63b2f6801e9e039405fb62c83cfc797fb712c797879bfeded51b871843e4`

See more details on using hashes here.

xfuser 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters

Table of Contents

🔥 Meet xDiT

📢 Updates

🎯 Supported DiTs

Supported by legacy version only, including DistriFusion and Tensor Parallel as the standalong parallel strategies:

📈 Performance

Flux.1

HunyuanDiT

SD3

Pixart

Pixart

🚀 QuickStart

1. Install from pip (current version)

2. Install from source

3. Usage

4. Launch a Http Service

5. Launch ComfyUI

✨ The xDiT's Arsenal

1. Parallel Methods

1.1. PipeFusion

1.2. USP: Unified Sequence Parallelism

1.3. Hybrid Parallel

1.4. CFG Parallel

1.5. Parallel VAE

Compilation Acceleration

📚 Develop Guide

🚧 History and Looking for Contributions

📝 Cite Us

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes