Drop-in TaylorSeer/HiCache basis upgrade: training-free diffusion acceleration via Dynamic Mode Decomposition (Prony) exponential feature forecasting
Project description
HiCache++
A drop-in basis upgrade for TaylorSeer / HiCache: forecast cached diffusion features with a Dynamic Mode Decomposition (Prony) exponential basis instead of a polynomial — same schedule, same API, near-lossless at wider skip intervals.
Feature caches (TaylorSeer, HiCache) skip the network on most denoising steps and forecast the velocity from cached anchors — with a polynomial basis. But a diffusion feature trajectory solves a near-linear feature-ODE whose exact solution class is a sum of damped / oscillatory exponentials; polynomials only locally truncate that class and diverge under extrapolation, which is exactly why every polynomial cache caps out at a modest skip interval. HiCache++ swaps in the exponential basis — Dynamic Mode Decomposition (Prony) — and keeps quality at skip intervals where the polynomial collapses. One loop, no training, no model edits:
import torch
from hicache_pp import hicache_init, hicache_decide, hicache_update_derivatives
from hicache_pp import dmd_update_snapshots, dmd_forecast_state # the exponential forecaster
state = hicache_init(num_steps=N, interval=5, first_enhance=4, backend="dmd", history=6)
for i, t in enumerate(timesteps):
if hicache_decide(state) == "forecast":
v = dmd_forecast_state(state) # skip the network — forecast the velocity
else:
v = model(x, t, ...) # the expensive forward
hicache_update_derivatives(state, v.detach())
dmd_update_snapshots(state, v.detach(), state["history"])
state["step"] += 1
x = scheduler.step(v, t, x)
If you already run TaylorSeer or HiCache, this is a basis swap, not a new pipeline: the
compute/skip schedule, warm-up and API stay identical — only the per-skip forecast formula
changes (backend="dmd", or backend="auto" to let a holdout test pick the basis per window).
DiT-XL/2 ImageNet FID-50k vs latency Pareto plot — in progress
(benchmarks/dit_imagenet/); plot lands here.
Headline so far: on Hunyuan3D-2.1, as the skip interval grows the polynomial (Hermite) decays fast — 0.88 → 0.74 → 0.38 F-score at interval 3 / 5 / 6 — while the exponential holds: 0.85 → 0.86 → 0.62 (baseline 0.91). The exponential lead grows with the skip — +0.13 at i5, +0.24 at i6.
Name note. HiCache here refers to the diffusion feature-forecasting method (Hermite polynomial feature caching, arXiv:2508.16984), which HiCache++ upgrades. It is unrelated to SGLang / Mooncake's "HiCache", a hierarchical KV cache for LLM serving. Likewise, in this repo "DMD" always abbreviates Dynamic Mode Decomposition (Prony) — classical spectral estimation — and never Distribution Matching Distillation.
TL;DR
On a flow-matching / diffusion denoise loop you can skip the network on most steps and forecast the velocity from cached anchors. The state of the art (TaylorSeer, HiCache) forecasts with a polynomial basis (monomial / scaled-Hermite). But a diffusion feature trajectory is the solution of a near-linear feature-ODE whose exact solution class is a sum of (damped/oscillatory) exponentials — not polynomials. Polynomials diverge under extrapolation, which is exactly why every polynomial cache caps out at a modest skip.
HiCache++ forecasts with Dynamic Mode Decomposition (Prony) — DMD (Schmid 2010) is the
SVD-regularised generalisation of Prony's method (1795): identify the linear propagator
A from raw velocity snapshots (F_{t+1} ≈ A F_t), eigendecompose it once, and predict any
(fractional) horizon k by eigenvalue powers:
F_{t+k} ≈ Φ (λ**k ⊙ b), b = Φ⁺ F_t
It is exact on exponential trajectories (the solution class) — the property polynomials lack — so it holds quality at skip intervals where Hermite/Taylor drift.
How it compares
Every modern feature cache skips the network on most steps and forecasts the velocity; they differ in the basis used to extrapolate. The basis is what sets the skip ceiling, because a diffusion feature trajectory is (locally) a sum of exponentials, not a polynomial. HiCache++'s basis is the Dynamic Mode Decomposition (Prony) exponential:
| Method | Forecast basis | Exact on the feature-ODE class | Extrapolation | Max lossless skip* |
|---|---|---|---|---|
| TaylorSeer | monomial (Taylor) | ✗ | diverges | small |
| HiCache | scaled-Hermite | ✗ | drifts | interval‑3 |
| FoCa · Padé · Chebyshev | rational / orthogonal poly | ✗ | drifts | small–moderate |
| HiCache++ (this work) | exponential (DMD / Prony) | ✓ exact | bounded, correct asymptotics | interval‑5–6 |
*measured on Hunyuan3D-2.1 / SAM3D-slat (see Results). A polynomial basis is only a local truncation of the exponential, so it is accurate for a tiny skip and diverges as the horizon grows; the exponential basis is the exact solution class, so it stays lossless further out — and the exponential forecaster admits fractional horizons, so it forecasts sub-steps between compute steps exactly.
Why exponentials (the math)
A diffusion/flow-matching sampler integrates dx/dt = v_θ(x, t). Across timesteps the
cached feature F_t (the CFG-combined velocity) evolves under a slowly-varying, near-linear
operator. The exact solution of a linear ODE Ḟ = M F is F_t = Σ_j a_j e^{μ_j t} — a sum
of exponentials with poles μ_j (damped if Re μ_j < 0, oscillatory if Im μ_j ≠ 0).
- Polynomial basis (Taylor monomials, Hermite): a local Taylor truncation of that exponential. Accurate for a tiny skip, diverges as the horizon grows → modest skip cap.
- Exponential basis — Dynamic Mode Decomposition (Prony): the exact function class. Fit
the poles
λ_j = e^{μ_j Δ}from snapshots and extrapolate with bounded, correct asymptotics.
The ≥4-snapshot floor. A real-valued trajectory spends two real degrees of freedom
on every complex pole (a conjugate pair r e^{±iω} → r^t cos ωt, r^t sin ωt). So even a
single oscillatory mode needs rank 3 to identify, i.e. 3 snapshot-pairs = 4 snapshots. With
only 2 pairs the fit aliases (empirically ~2e-1 error vs ~5e-9 at 3 pairs). Below the floor (or
across a non-uniform window) HiCache++ falls back to the Hermite forecast for warm-up.
Results (A/B, geometry-preserving)
All accelerators are training-free and geometry-preserving; the right A/B is how far the output drifts from the uncached/baseline geometry vs how much faster it runs. "DMD" below is the HiCache++ Dynamic Mode Decomposition (Prony) exponential basis.
Mechanism — controlled, no model
Forecasting H steps past an 8-step cached window on synthetic trajectories from the exact
feature-ODE class — three forecast bases, rel. L2 error (↓):
| basis | H=1 | H=2 | H=4 | H=6 | H=8 |
|---|---|---|---|---|---|
| TaylorSeer (polynomial) | 1.5e-2 | 8.0e-2 | 6.2e-1 | 2.3e0 | 6.5e0 |
| Padé / FoCa (rational) | 4.9e-2 | 1.1e-1 | 2.4e-1 | 5.3e-1 | 1.2e0 |
| HiCache++ (exponential) | 4.7e-9 | 1.4e-8 | 5.3e-8 | 1.2e-7 | 2.2e-7 |
The exponential basis is exact (~1e-8, flat in H); the polynomial diverges, and the
rational (Padé / FoCa) improves on it but still diverges — 6-to-9 orders of magnitude behind the
exponential, and under noise the rational basis turns fragile (Froissart poles). That gap is
the skip ceiling. And when the dynamics are NOT clean — an abrupt regime switch inside the
cached window, where a whole-window exponential fit misfits — the holdout-selected
backend="auto" catches it every time (it backcasts the newest snapshot with both bases and
serves the winner): on the switch stress it picks the safe fallback in 120/120 windows and cuts
the long-horizon error ~3x vs a forced exponential fit (H=8: 3.1 vs 9.4 rel. error, with the
polynomial at 106), while on clean/drifting/noisy trajectories it picks the exponential basis
120/120 and matches it exactly. Full five-scenario tables:
benchmarks/MICROBENCH_RESULTS.md. Reproduce:
python benchmarks/forecast_microbench.py.
DiT-XL/2 ImageNet — FID-50k / IS vs latency
In progress — the class-conditional ImageNet-256 sweep (FID-50k + Inception Score across
intervals, Hermite vs exponential) is running in benchmarks/dit_imagenet/;
the table and Pareto plot land here.
Hunyuan3D-2.1 (flat DiT velocities) — Toys4K F-score@0.05
Excludes ball_000 (a sphere — Go-ICP alignment is rotationally degenerate on it; two runs
otherwise agree to ±0.01). Speedup is solo / uncontended.
| interval | Hermite (HiCache) | DMD (HiCache++) | speedup |
|---|---|---|---|
| baseline (uncached) | 0.911 | 0.911 | 1.00× |
| i3 | 0.876 | 0.852 | 1.72× |
| i4 | 0.776 | 0.827 | 1.80× |
| i5 | 0.735 | 0.860 | 1.79× |
| i6 | 0.375 | 0.616 | ~2.0× |
The exponential basis degrades gracefully where Hermite collapses, and its lead grows with the interval. On the deployed Hunyuan3D-2-mini, it is exactly lossless at i5 (0.794 = baseline 0.794).
SAM3D (PyTree velocities, slat FlowMatching) — real weights, F1 vs baseline
| config | speedup | CD_vs_base | F1_vs_base |
|---|---|---|---|
| vanilla | 1.00× | 0.000 | 1.000 |
| HiCache i3 | 1.44× | 0.013 | 1.000 |
| DMD i5 | 1.47× | 0.013 | 1.000 |
| DMD i6 | 1.56× | 0.013 | 1.000 |
Both are geometry-lossless (F1=1.000); the exponential basis stays lossless to interval-6, where it gives the best speedup — past Hermite's lossless i3.
TRELLIS v1 (sparse-structure stage) — Toys4K F-score@0.05, n=31
Swapping only the SS forecast basis Hermite→exponential in faster-trellis (same
carved-hybrid schedule):
| variant | F@0.05 | speedup | vs vanilla |
|---|---|---|---|
| vanilla (uncached) | 0.839 | 1.00× | — |
| HiCache (Hermite) | 0.825 | 2.82× | −0.014 |
| HiCache++ (DMD) | 0.829 | 2.76× | −0.010 |
At the deployed ~interval-3 (2.8×), the exponential basis is the most lossless accelerator
(beats Hermite by +0.005 at matched speed); the margin widens at higher intervals. The same
holds on TRELLIS.2-4B (v2) — it ties Hermite at the deployed interval and pulls
+0.03–0.04 F-score ahead at intervals 3–4
(see hermit-trellis2-plus-plus).
Full tables: results/RESULTS.md.
Install / use
pip install hicache-pp
The one-loop snippet at the top is the whole integration for flat tensor velocities
(e.g. a DiT). For PyTree / structured velocities (e.g. SAM3D), use hicache_pp.tree —
the same API but tree-aware (hicache_forecast_tree, dmd_forecast_tree, plus tree
Adaptive-CFG). Backends:
backend="hermite"— the published HiCache scaled-Hermite polynomial (clean reimplementation).backend="dmd"— the HiCache++ Dynamic Mode Decomposition (Prony) exponential basis.backend="auto"— holdout selection: per compute step, backcast the newest held-out snapshot with both bases and serve whichever demonstrably wins on the data at hand.
See integrations/ for the exact wiring into Hunyuan3D-2.1,
Hunyuan3D-2-mini, SAM3D and Fast-SAM3D, and
integrations/pr_drafts/ for prepared patches that add this
exponential basis to cache-dit, Hugging Face diffusers (TaylorSeerCacheConfig)
and Cache4Diffusion in each project's native conventions.
Tuning notes
- Hermite: lossless up to a modest interval (Hunyuan-2.1: i3/order-2). Higher order does not rescue bigger intervals — the polynomial ceiling.
- Exponential: push the interval further (i5–i6) for more skip while staying lossless.
historyis the snapshot window (5–6); needs ≥4 uniformly-spaced snapshots before it engages (Hermite covers warm-up automatically). first_enhancealways computes the first few steps (high curvature); keep it ≥ 3.
Tests
python -m hicache_pp.hermite # Hermite basis + schedule (CPU, no GPU/model)
python -m hicache_pp.dmd # exponential basis exact-on-exponential + ≥4-snapshot floor
python -m hicache_pp.tree # tree-aware Hermite + exponential + Adaptive-CFG
python tests/run_tests.py # all of the above
3D generator integrations (sibling repos)
The forecaster in this repo is model-agnostic; it has also been wired natively into a family
of 3D-generator forks. These are complementary accelerators, not competing solutions —
each speeds up a different base generator, and the + / ++ suffix is a method choice
(+ = HiCache Hermite polynomial, ++ = HiCache++ Dynamic Mode Decomposition (Prony)
exponential), not a rival product. Pick by (1) which base model you run, then (2) which
forecast basis you want:
| base generator | + = HiCache (Hermite) |
++ = HiCache++ (DMD) |
|---|---|---|
| Hunyuan3D-2.1 | hunyuan2.1-plus |
hunyuan2.1-plus-plus |
| Hunyuan3D-2 mini | hunyuan2-plus |
hunyuan2-plus-plus |
| SAM 3D Objects | sam3d-plus |
sam3d-plus-plus |
| Fast-SAM3D | fastsam3d-plus |
fastsam3d-plus-plus |
| TRELLIS (v1) | faster-trellis |
faster-trellis-plus-plus |
| TRELLIS.2-4B (v2) | hermit-trellis2 |
hermit-trellis2-plus-plus |
+(HiCache / scaled-Hermite): the published polynomial velocity-forecast basis — conservative, reproduces the HiCache paper. Use it to deploy the established method.++(HiCache++ / exponential): our Dynamic Mode Decomposition (Prony) basis — the same near-lossless quality at wider skip intervals, where the polynomial diverges. Use it when you push the cache interval for more speed.- standalone / model-agnostic:
hicache-plus-plus(this repo) — the forecaster itself, to add exponential caching to your own diffusion/flow model. fast-trellis2= the TaylorSeer baseline fork (the upstream "Fast" accel) — the v2 reference point, not a HiCache variant.
Lineage & attribution
- TaylorSeer — feature caching with a monomial (Taylor) basis.
- HiCache (arXiv:2508.16984) — the scaled-Hermite polynomial upgrade.
hicache_pp.hermiteis a clean reimplementation. - HiCache++ (this work) — the Dynamic Mode Decomposition (Prony) exponential forecaster
(
hicache_pp.dmd). DMD (Schmid 2010) / Prony (1795) / Matrix-Pencil (Hua–Sarkar 1990) are classical spectral estimation; their application to diffusion feature caching is, to our knowledge, new. - Adaptive-CFG (Adaptive Guidance, arXiv:2312.12487) — composable uncond-skip, included in the tree module.
Citation
If you use this library, please cite HiCache++ (this work) and the methods it builds on:
@misc{hicachepp2026,
title = {HiCache++: Training-free Diffusion Inference Acceleration via Exponential (DMD/Prony) Velocity Forecasting},
author = {Attri, Krishi},
year = {2026},
note = {https://github.com/Archerkattri/hicache-plus-plus}
}
@misc{hicache2025,
title = {HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial Feature Forecasting},
eprint = {2508.16984}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, year = {2025}
}
@misc{taylorseer2025,
title = {From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers},
eprint = {2503.06923}, archivePrefix = {arXiv}, year = {2025}
}
@article{schmid2010dmd,
title = {Dynamic mode decomposition of numerical and experimental data},
author = {Schmid, Peter J.},
journal = {Journal of Fluid Mechanics}, volume = {656}, pages = {5--28}, year = {2010}
}
@article{hua1990matrixpencil,
title = {Matrix pencil method for estimating parameters of exponentially damped/undamped sinusoids in noise},
author = {Hua, Yingbo and Sarkar, Tapan K.},
journal = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
volume = {38}, number = {5}, pages = {814--824}, year = {1990}
}
@misc{adaptiveguidance2023,
title = {Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models},
eprint = {2312.12487}, archivePrefix = {arXiv}, year = {2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hicache_pp-1.1.0.tar.gz.
File metadata
- Download URL: hicache_pp-1.1.0.tar.gz
- Upload date:
- Size: 29.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0dd3358b909e2e98a5ebba4aef4d3da22febd1ac4a17d91f85b7ab145d8f2f8
|
|
| MD5 |
a6f92b012bf801ad6d3a8e5a6a476510
|
|
| BLAKE2b-256 |
084824c571be19d6fc584fcca0e0b749826d9ee00fb8ff284d500edf7369ab00
|
File details
Details for the file hicache_pp-1.1.0-py3-none-any.whl.
File metadata
- Download URL: hicache_pp-1.1.0-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ecb2702ecad729617f1703d6d11f2b272763548797577cd69410a9efbce1a946
|
|
| MD5 |
91c92cf73a212c060ea60f776e7f7927
|
|
| BLAKE2b-256 |
bc9303f87a97ab4530f6dc884545c2f2adf6122f9fa0555dd4b95da8f5ffc7ab
|