Skip to main content

A Unified and Flexible Inference Engine with Hybrid Cache Acceleration and Parallelism for ๐Ÿค—DiTs.

Project description

A Unified and Flexible Inference Engine with ๐Ÿค—๐ŸŽ‰
Hybrid Cache Acceleration and Parallelism for DiTs
Featured๏ฝœHelloGitHub

Baseline SCM S S* SCM F D* SCM U D* +TS +compile +FP8*
24.85s 15.4s 11.4s 8.2s 8.2s ๐ŸŽ‰7.1s ๐ŸŽ‰4.5s

Scheme: DBCache + SCM(steps_computation_mask) + TS(TaylorSeer) + FP8*, L20x1, S*: static cache,
D*: dynamic cache, S: Slow, F: Fast, U: Ultra Fast, TS: TaylorSeer, FP8*: FP8 DQ + Sage, FLUX.1-Dev

U*: Ulysses Attention, UAA: Ulysses Anything Attenton, UAA*: UAA + Gloo, Device: NVIDIA L20
FLUX.1-Dev w/o CPU Offload, 28 steps; Qwen-Image w/ CPU Offload, 50 steps; Gloo: Extra All Gather w/ Gloo

CP2 U* CP2 UAA* L20x1 CP2 UAA* CP2 U* L20x1 CP2 UAA*
FLUX, 13.87s ๐ŸŽ‰13.88s 23.25s ๐ŸŽ‰13.75s Qwen, 132s 181s ๐ŸŽ‰133s
1024x1024 1024x1024 1008x1008 1008x1008 1312x1312 1328x1328 1328x1328
โœ”๏ธU* โœ”๏ธUAA โœ”๏ธU* โœ”๏ธUAA NO CP โŒU* โœ”๏ธUAA โœ”๏ธU* โœ”๏ธUAA NO CP โŒU* โœ”๏ธUAA

๐Ÿ”ฅHightlight

We are excited to announce that the ๐ŸŽ‰v1.1.0 version of cache-dit has finally been released! It brings ๐Ÿ”ฅContext Parallelism and ๐Ÿ”ฅTensor Parallelism to cache-dit, thus making it a Unified and Flexible Inference Engine for ๐Ÿค—DiTs. Key features: Unified Cache APIs, Forward Pattern Matching, Block Adapter, DBCache, DBPrune, Cache CFG, TaylorSeer, SCM, Context Parallelism (w/ UAA), Tensor Parallelism and ๐ŸŽ‰SOTA performance.

pip3 install -U cache-dit # Also, pip3 install git+https://github.com/huggingface/diffusers.git (latest)

You can install the stable release of cache-dit from PyPI, or the latest development version from GitHub. Then try โ™ฅ๏ธ Cache Acceleration with just one line of code ~ โ™ฅ๏ธ

>>> import cache_dit
>>> from diffusers import DiffusionPipeline
>>> pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image") # Can be any diffusion pipeline
>>> cache_dit.enable_cache(pipe) # One-line code with default cache options.
>>> output = pipe(...) # Just call the pipe as normal.
>>> stats = cache_dit.summary(pipe) # Then, get the summary of cache acceleration stats.
>>> cache_dit.disable_cache(pipe) # Disable cache and run original pipe.

๐Ÿ“šCore Features

  • ๐ŸŽ‰Full ๐Ÿค—Diffusers Support: Notably, cache-dit now supports nearly all of Diffusers' DiT-based pipelines, include 30+ series, nearly 100+ pipelines, such as FLUX.1, Qwen-Image, Qwen-Image-Lightning, Wan 2.1/2.2, HunyuanImage-2.1, HunyuanVideo, HiDream, AuraFlow, CogView3Plus, CogView4, CogVideoX, LTXVideo, ConsisID, SkyReelsV2, VisualCloze, PixArt, Chroma, Mochi, SD 3.5, DiT-XL, etc.
  • ๐ŸŽ‰Extremely Easy to Use: In most cases, you only need one line of code: cache_dit.enable_cache(...). After calling this API, just use the pipeline as normal.
  • ๐ŸŽ‰Easy New Model Integration: Features like Unified Cache APIs, Forward Pattern Matching, Automatic Block Adapter, Hybrid Forward Pattern, and Patch Functor make it highly functional and flexible. For example, we achieved ๐ŸŽ‰ Day 1 support for HunyuanImage-2.1 with 1.7x speedup w/o precision lossโ€”even before it was available in the Diffusers library.
  • ๐ŸŽ‰State-of-the-Art Performance: Compared with algorithms including ฮ”-DiT, Chipmunk, FORA, DuCa, TaylorSeer and FoCa, cache-dit achieved the SOTA performance w/ 7.4xโ†‘๐ŸŽ‰ speedup on ClipScore!
  • ๐ŸŽ‰Support for 4/8-Steps Distilled Models: Surprisingly, cache-dit's DBCache works for extremely few-step distilled modelsโ€”something many other methods fail to do.
  • ๐ŸŽ‰Compatibility with Other Optimizations: Designed to work seamlessly with torch.compile, Quantization (torchao, ๐Ÿ”ฅnunchaku), CPU or Sequential Offloading, ๐Ÿ”ฅContext Parallelism, ๐Ÿ”ฅTensor Parallelism, etc.
  • ๐ŸŽ‰Hybrid Cache Acceleration: Now supports hybrid Block-wise Cache + Calibrator schemes (e.g., DBCache or DBPrune + TaylorSeerCalibrator). DBCache or DBPrune acts as the Indicator to decide when to cache, while the Calibrator decides how to cache. More mainstream cache acceleration algorithms (e.g., FoCa) will be supported in the future, along with additional benchmarksโ€”stay tuned for updates!
  • ๐Ÿค—Diffusers Ecosystem Integration: ๐Ÿ”ฅcache-dit has joined the Diffusers community ecosystem as the first DiT-specific cache acceleration framework! Check out the documentation here:

The comparison between cache-dit and other algorithms shows that within a speedup ratio (TFLOPs) less than ๐ŸŽ‰4x, cache-dit achieved the SOTA performance. Please refer to ๐Ÿ“šBenchmarks for more details.

Method TFLOPs(โ†“) SpeedUp(โ†‘) ImageReward(โ†‘) Clip Score(โ†‘)
[FLUX.1-dev]: 50 steps 3726.87 1.00ร— 0.9898 32.404
Chipmunk 1505.87 2.47ร— 0.9936 32.776
FORA(N=3) 1320.07 2.82ร— 0.9776 32.266
DBCache(S) 1400.08 2.66ร— 1.0065 32.838
DuCa(N=5) 978.76 3.80ร— 0.9955 32.241
TeaCache(l=0.8) 892.35 4.17ร— 0.8683 31.704
TaylorSeer(N=4,O=2) 1042.27 3.57ร— 0.9857 32.413
DBCache(S)+TS 1153.05 3.23ร— 1.0221 32.819
DBCache(M)+TS 944.75 3.94ร— 1.0107 32.865
FoCa(N=5) 893.54 4.16ร— 1.0029 32.948
[FLUX.1-dev]: 22% steps 818.29 4.55ร— 0.8183 31.772
TaylorSeer(N=7,O=2) 670.44 5.54ร— 0.9128 32.128
FoCa(N=8) 596.07 6.24ร— 0.9502 32.706
DBCache(F)+TS 651.90 5.72x 0.9526 32.568
DBCache(U)+TS 505.47 7.37x 0.8645 32.719

๐ŸŽ‰Surprisingly, cache-dit still works in the extremely few-step distill model, such as Qwen-Image-Lightning, with the F16B16 config, the PSNR is 34.8 and the ImageReward is 1.26. It maintained a relatively high precision.

Config PSNR(โ†‘) Clip Score(โ†‘) ImageReward(โ†‘) TFLOPs(โ†“) SpeedUp(โ†‘)
[Full 4 steps] INF 35.5797 1.2630 274.33 1.00x
F24B24 36.3242 35.6224 1.2630 264.74 1.04x
F16B16 34.8163 35.6109 1.2614 244.25 1.12x
F12B12 33.8953 35.6535 1.2549 234.63 1.17x
F8B8 33.1374 35.7284 1.2517 224.29 1.22x
F1B0 31.8317 35.6651 1.2397 206.90 1.33x

๐Ÿ”ฅSupported DiTs

[!Tip] One Model Series may contain many pipelines. cache-dit applies optimizations at the Transformer level; thus, any pipelines that include the supported transformer are already supported by cache-dit. โœ…: known work and official supported now; โœ–๏ธ: unofficial supported now, but maybe support in the future; Q: 4-bits models w/ nunchaku + SVDQ W4A4.

๐Ÿ“šModel Cache CP TP ๐Ÿ“šModel Cache CP TP
๐ŸŽ‰FLUX.1 โœ… โœ… โœ… ๐ŸŽ‰FLUX.1 Q โœ… โœ… โœ–๏ธ
๐ŸŽ‰FLUX.1-Fill โœ… โœ… โœ… ๐ŸŽ‰FLUX.1-Fill Q โœ… โœ… โœ–๏ธ
๐ŸŽ‰Qwen-Image โœ… โœ… โœ… ๐ŸŽ‰Qwen-Image Q โœ… โœ… โœ–๏ธ
๐ŸŽ‰Qwen...Edit โœ… โœ… โœ… ๐ŸŽ‰Qwen...Edit Q โœ… โœ… โœ–๏ธ
๐ŸŽ‰Qwen...Lightning โœ… โœ… โœ… ๐ŸŽ‰Qwen...Light Q โœ… โœ… โœ–๏ธ
๐ŸŽ‰Qwen...Control.. โœ… โœ… โœ… ๐ŸŽ‰Qwen...E...Light Q โœ… โœ… โœ–๏ธ
๐ŸŽ‰Wan 2.1 I2V/T2V โœ… โœ… โœ… ๐ŸŽ‰Mochi โœ… โœ–๏ธ โœ…
๐ŸŽ‰Wan 2.1 VACE โœ… โœ… โœ… ๐ŸŽ‰HiDream โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰Wan 2.2 I2V/T2V โœ… โœ… โœ… ๐ŸŽ‰HunyunDiT โœ… โœ–๏ธ โœ…
๐ŸŽ‰HunyuanVideo โœ… โœ… โœ… ๐ŸŽ‰Sana โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰ChronoEdit โœ… โœ… โœ… ๐ŸŽ‰Bria โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰CogVideoX โœ… โœ… โœ… ๐ŸŽ‰SkyReelsV2 โœ… โœ… โœ…
๐ŸŽ‰CogVideoX 1.5 โœ… โœ… โœ… ๐ŸŽ‰Lumina 1/2 โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰CogView4 โœ… โœ… โœ… ๐ŸŽ‰DiT-XL โœ… โœ… โœ–๏ธ
๐ŸŽ‰CogView3Plus โœ… โœ… โœ… ๐ŸŽ‰Allegro โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰PixArt Sigma โœ… โœ… โœ… ๐ŸŽ‰Cosmos โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰PixArt Alpha โœ… โœ… โœ… ๐ŸŽ‰OmniGen โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰Chroma-HD โœ… โœ… ๏ธโœ… ๐ŸŽ‰EasyAnimate โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰VisualCloze โœ… โœ… โœ… ๐ŸŽ‰StableDiffusion3 โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰HunyuanImage โœ… โœ… โœ… ๐ŸŽ‰PRX T2I โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰Kandinsky5 โœ… โœ…๏ธ โœ…๏ธ ๐ŸŽ‰Amused โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰LTXVideo โœ… โœ… โœ… ๐ŸŽ‰AuraFlow โœ… โœ–๏ธ โœ–๏ธ
๐ŸŽ‰ConsisID โœ… โœ… โœ… ๐ŸŽ‰LongCatVideo โœ… โœ–๏ธ โœ–๏ธ
๐Ÿ”ฅClick here to show many Image/Video cases๐Ÿ”ฅ

๐ŸŽ‰Now, cache-dit covers almost All Diffusers' DiT Pipelines๐ŸŽ‰
๐Ÿ”ฅQwen-Image | Qwen-Image-Edit | Qwen-Image-Edit-Plus ๐Ÿ”ฅ
๐Ÿ”ฅFLUX.1 | Qwen-Image-Lightning 4/8 Steps | Wan 2.1 | Wan 2.2 ๐Ÿ”ฅ
๐Ÿ”ฅHunyuanImage-2.1 | HunyuanVideo | HunyuanDiT | HiDream | AuraFlow๐Ÿ”ฅ
๐Ÿ”ฅCogView3Plus | CogView4 | LTXVideo | CogVideoX | CogVideoX 1.5 | ConsisID๐Ÿ”ฅ
๐Ÿ”ฅCosmos | SkyReelsV2 | VisualCloze | OmniGen 1/2 | Lumina 1/2 | PixArt๐Ÿ”ฅ
๐Ÿ”ฅChroma | Sana | Allegro | Mochi | SD 3/3.5 | Amused | ... | DiT-XL๐Ÿ”ฅ

๐Ÿ”ฅWan2.2 MoE | +cache-dit:2.0xโ†‘๐ŸŽ‰ | HunyuanVideo | +cache-dit:2.1xโ†‘๐ŸŽ‰

๐Ÿ”ฅQwen-Image | +cache-dit:1.8xโ†‘๐ŸŽ‰ | FLUX.1-dev | +cache-dit:2.1xโ†‘๐ŸŽ‰

๐Ÿ”ฅQwen...Lightning | +cache-dit:1.14xโ†‘๐ŸŽ‰ | HunyuanImage | +cache-dit:1.7xโ†‘๐ŸŽ‰

๐Ÿ”ฅQwen-Image-Edit | Input w/o Edit | Baseline | +cache-dit:1.6xโ†‘๐ŸŽ‰ | 1.9xโ†‘๐ŸŽ‰

๐Ÿ”ฅFLUX-Kontext-dev | Baseline | +cache-dit:1.3xโ†‘๐ŸŽ‰ | 1.7xโ†‘๐ŸŽ‰ | 2.0xโ†‘ ๐ŸŽ‰

๐Ÿ”ฅHiDream-I1 | +cache-dit:1.9xโ†‘๐ŸŽ‰ | CogView4 | +cache-dit:1.4xโ†‘๐ŸŽ‰ | 1.7xโ†‘๐ŸŽ‰

๐Ÿ”ฅCogView3 | +cache-dit:1.5xโ†‘๐ŸŽ‰ | 2.0xโ†‘๐ŸŽ‰| Chroma1-HD | +cache-dit:1.9xโ†‘๐ŸŽ‰

๐Ÿ”ฅMochi-1-preview | +cache-dit:1.8xโ†‘๐ŸŽ‰ | SkyReelsV2 | +cache-dit:1.6xโ†‘๐ŸŽ‰

๐Ÿ”ฅVisualCloze-512 | Model | Cloth | Baseline | +cache-dit:1.4xโ†‘๐ŸŽ‰ | 1.7xโ†‘๐ŸŽ‰

๐Ÿ”ฅLTX-Video-0.9.7 | +cache-dit:1.7xโ†‘๐ŸŽ‰ | CogVideoX1.5 | +cache-dit:2.0xโ†‘๐ŸŽ‰

๐Ÿ”ฅOmniGen-v1 | +cache-dit:1.5xโ†‘๐ŸŽ‰ | 3.3xโ†‘๐ŸŽ‰ | Lumina2 | +cache-dit:1.9xโ†‘๐ŸŽ‰

๐Ÿ”ฅAllegro | +cache-dit:1.36xโ†‘๐ŸŽ‰ | AuraFlow-v0.3 | +cache-dit:2.27xโ†‘๐ŸŽ‰

๐Ÿ”ฅSana | +cache-dit:1.3xโ†‘๐ŸŽ‰ | 1.6xโ†‘๐ŸŽ‰| PixArt-Sigma | +cache-dit:2.3xโ†‘๐ŸŽ‰

๐Ÿ”ฅPixArt-Alpha | +cache-dit:1.6xโ†‘๐ŸŽ‰ | 1.8xโ†‘๐ŸŽ‰| SD 3.5 | +cache-dit:2.5xโ†‘๐ŸŽ‰

๐Ÿ”ฅAsumed | +cache-dit:1.1xโ†‘๐ŸŽ‰ | 1.2xโ†‘๐ŸŽ‰ | DiT-XL-256 | +cache-dit:1.8xโ†‘๐ŸŽ‰
โ™ฅ๏ธ Please consider to leave a โญ๏ธ Star to support us ~ โ™ฅ๏ธ

๐Ÿ“–Table of Contents

For more advanced features such as Unified Cache APIs, Forward Pattern Matching, Automatic Block Adapter, Hybrid Forward Pattern, Patch Functor, DBCache, DBPrune, TaylorSeer Calibrator, Hybrid Cache CFG, Context Parallelism and Tensor Parallelism, please refer to the ๐ŸŽ‰User_Guide.md for details.

๐Ÿ‘‹Contribute

How to contribute? Star โญ๏ธ this repo to support us or check CONTRIBUTE.md.

๐ŸŽ‰Projects Using CacheDiT

Here is a curated list of open-source projects integrating CacheDiT, including popular repositories like jetson-containers, flux-fast, and sdnext. ๐ŸŽ‰CacheDiT has been recommended by: Wan 2.2, Qwen-Image-Lightning, Qwen-Image, LongCat-Video, Kandinsky-5, LeMiCa, ๐Ÿค—diffusers and HelloGitHub, among others.

ยฉ๏ธAcknowledgements

Special thanks to vipshop's Computer Vision AI Team for supporting document, testing and production-level deployment of this project. We learned the design and reused code from the following projects: ๐Ÿค—diffusers, ParaAttention, xDiT, TaylorSeer and LeMiCa.

ยฉ๏ธCitations

@misc{cache-dit@2025,
  title={cache-dit: A Unified and Flexible Inference Engine with Hybrid Cache Acceleration and Parallelism for DiTs.},
  url={https://github.com/vipshop/cache-dit.git},
  note={Open-source software available at https://github.com/vipshop/cache-dit.git},
  author={DefTruth, vipshop.com},
  year={2025}
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cache_dit-1.1.2-py3-none-any.whl (189.6 kB view details)

Uploaded Python 3

File details

Details for the file cache_dit-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: cache_dit-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 189.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for cache_dit-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5c61c257b5a8dda8d0ab9a35dea0bd454600e318a53b94be52b3f06ccdd05abc
MD5 8dff850a00883dd0bce24b03947e0fa9
BLAKE2b-256 40963e99d7f067508771e2e2dd7aafc3637a0db1f2c356cdb7d5eaec4ea7a27e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page