Drop-in TaylorSeer/HiCache basis upgrade: training-free diffusion acceleration via Dynamic Mode Decomposition (Prony) exponential feature forecasting
Project description
HiCache++
No single forecast basis wins across diffusion families. HiCache++ ships the exponential
(Dynamic Mode Decomposition / Prony) basis that wins on flow-matching 3D generators, a
corrected (sign-fixed) Hermite polynomial that dominates on DiT-class denoising, and a
training-free holdout selector (backend="auto") that solves regime switches but, per our
own pre-registered DiT A/B, does NOT solve the domain split. Pick the basis by model
family; the numbers for both families are below.
Feature caches (TaylorSeer, HiCache) skip the network on most denoising steps and forecast the cached features from recent compute-step anchors. The literature treats the forecast basis as a ladder to climb: monomial, Hermite, rational, Chebyshev, each pitched as a better basis. We built the natural endpoint of that ladder, the exponential solution class of the local feature-ODE fitted by Dynamic Mode Decomposition (Prony), benchmarked it across two diffusion families, and found the honest answer:
The domain split (read this first)
| workload | winning basis | evidence |
|---|---|---|
| Flow-matching 3D generators (Hunyuan3D-2.1 / 2-mini, SAM3D, TRELLIS v1/v2) | exponential (DMD) | +0.13 / +0.24 F-score over the deployed Hermite arm at intervals 5 / 6 on Hunyuan3D-2.1; exactly lossless at i5 on 2-mini; geometry-lossless (F1 = 1.000) through i6 at 1.56x on SAM3D |
| DiT-class denoising (DiT-XL/2, ImageNet-256, 250-step DDPM) | polynomial | corrected TaylorSeer is near-lossless (paired-noise FID drift 2.27 at 3.81x) and the corrected Hermite sweeps the interval ladder (3.54 / 6.46 / 10.74 at i4 / i6 / i8). The exponential basis drifts 5-9x more than the corrected Hermite at every interval tested |
Neither basis transfers, and on the Pareto front each family is owned outright: corrected TaylorSeer / Hermite dominate DiT at every interval, DMD dominates the flow-matching 3D family at every interval past i3. Any single-basis claim you have read (including our own earlier framing) is family-conditional.
The selection verdict, stated plainly. We built a training-free selector
(backend="auto") that backcasts a held-out snapshot with both bases per compute window
and serves the winner, in two modes (1-step and horizon-matched backcast). It is exact on
synthetic regime switches (120/120), harmless on the 3D generators, and wrong on DiT in
both modes: auto reads FID drift 18.11 in the pre-registered A/B (both modes,
identical) where the corrected Hermite sits at 3.54. The reason is characterized below:
on DiT's real features DMD's richer parameterization fits and backcasts the snapshot
history better while extrapolating forward worse, so in-window backcast fit carries no
signal about forward quality. Therefore:
- Pick the basis by model family (the table above). That is the recommendation.
- Use
backend="auto"only as a safety net for intra-run regime switches, which it demonstrably solves; do not expect it to discover the family-level winner. - Per-window basis selection from real feature trajectories is an open problem.
Quickstart
import torch
from hicache_pp import hicache_init, hicache_decide, hicache_update_derivatives
from hicache_pp import dmd_update_snapshots, dmd_forecast_state
# backend: pick by model family -- "dmd" for flow-matching 3D generators,
# "hermite" for DiT-class denoising, "auto" only as a regime-switch safety net.
state = hicache_init(num_steps=N, interval=5, first_enhance=4, backend="dmd", history=6)
for i, t in enumerate(timesteps):
if hicache_decide(state) == "forecast":
v = dmd_forecast_state(state) # skip the network, forecast the velocity
else:
v = model(x, t, ...) # the expensive forward
hicache_update_derivatives(state, v.detach())
dmd_update_snapshots(state, v.detach(), state["history"])
state["step"] += 1
x = scheduler.step(v, t, x)
If you already run TaylorSeer or HiCache this is a basis swap, not a new pipeline: the
compute/skip schedule, warm-up and API stay identical. Only the per-skip forecast formula
changes. Backends: "hermite" (corrected HiCache polynomial; pick this for DiT-class
denoising), "dmd" (exponential; pick this for flow-matching 3D generators), "auto"
(holdout selection; regime-switch safety net, see the verdict above).
Name note. HiCache here refers to the diffusion feature-forecasting method (Hermite polynomial feature caching, arXiv:2508.16984), which HiCache++ extends. It is unrelated to SGLang / Mooncake's "HiCache", a hierarchical KV cache for LLM serving. Likewise, in this repo "DMD" always abbreviates Dynamic Mode Decomposition (Prony), classical spectral estimation, and never Distribution Matching Distillation.
Why exponentials (the hypothesis, and where it holds)
A diffusion/flow-matching sampler integrates dx/dt = v_theta(x, t). If, across a short
window of steps, the cached feature F_t (the CFG-combined velocity) evolves under a
slowly varying near-linear operator, the trajectory locally solves a linear feature-ODE
F' = M F, whose exact solution class is a sum of damped/oscillatory exponentials
F_t = sum_j a_j e^{mu_j t}. Polynomials only locally truncate that class and diverge
under extrapolation; the exponential basis is exact on it.
DMD (Schmid 2010) is the SVD-regularised multivariate generalisation of Prony's method
(1795): identify the linear propagator A from raw velocity snapshots
(F_{t+1} ~ A F_t), eigendecompose once per compute window, and predict any (fractional)
horizon k by eigenvalue powers:
F_{t+k} ~ Phi (lambda**k * b), b = pinv(Phi) F_t
The >= 4-snapshot floor. A real-valued trajectory spends two real degrees of freedom
per complex pole (a conjugate pair r e^{+-iw} gives r^t cos wt, r^t sin wt), so a
single oscillatory mode already needs 3 snapshot pairs = 4 snapshots. Below the floor, or
across a non-uniform window, HiCache++ serves the Hermite forecast instead (warm-up is
covered automatically).
Where the hypothesis binds. Whether a real model's feature stream is in this class at
the horizons actually served is an empirical question per family. On flow-matching 3D
velocity streams it is (the exponential basis wins). On the 250-step DDPM DiT stream it is
not: the stream is locally so smooth that low-order polynomial truncation error is
negligible at the served horizons, while the exponential fit pays pole-estimation error
that compounds with the horizon. That is the domain split. And note that holdout
selection cannot see it from inside the window: DMD fits (and therefore backcasts) the
snapshot history better on DiT even while extrapolating forward worse, which is exactly
why auto picks the wrong arm there and why the family default is the recommendation.
Results
All accelerators here are training-free; the right A/B is how far the output drifts from the uncached baseline vs how much faster it runs. "DMD" below is the HiCache++ exponential basis.
Mechanism (controlled, no model)
Forecasting H steps past an 8-anchor window on synthetic trajectories from the exact
feature-ODE class, rel. L2 error (lower is better), all rows sign-correct:
| basis | H=1 | H=2 | H=4 | H=6 | H=8 |
|---|---|---|---|---|---|
| TaylorSeer (monomial) | 1.5e-2 | 8.0e-2 | 6.2e-1 | 2.3e0 | 6.5e0 |
| Pade / FoCa (rational) | 4.9e-2 | 1.1e-1 | 2.4e-1 | 5.3e-1 | 1.2e0 |
| HiCache (Hermite, fixed +k) | 1.5e-1 | 3.0e-1 | 6.3e-1 | 1.0e0 | 1.7e0 |
| HiCache++ (exponential) | 4.7e-9 | 1.4e-8 | 5.3e-8 | 1.2e-7 | 2.2e-7 |
On the class the exponential basis is exact and flat in H; every polynomial-family basis
diverges. Under 1% snapshot noise DMD's rank truncation contains the error at ~0.3 where
the monomial amplifies to 15. When the dynamics break the fit (an abrupt regime switch
inside the window), auto detects the misfit in 120/120 windows, serves the Hermite arm,
and contains the long-horizon error at 1.56 vs 9.40 for a forced exponential fit (the
plain monomial sits at 106); on clean/noisy/drifting trajectories it picks the exponential
arm 120/120 and matches it exactly. Full six-scenario tables for both holdout modes:
benchmarks/MICROBENCH_RESULTS.md. Reproduce:
python benchmarks/forecast_microbench.py.
Flow-matching 3D generators (where the exponential basis wins)
Hunyuan3D-2.1 (flat DiT velocities), Toys4K F-score@0.05 vs the uncached baseline
(0.911 = self-score). Excludes ball_000 (a sphere; Go-ICP alignment is rotationally
degenerate on it; other cells reproduce to +-0.01). The Hermite arm is the fork's deployed
(as-released, pre-sign-fix) forecaster; see the honesty note below. Speedup is solo.
| interval | Hermite arm (as released) | DMD (HiCache++) | speedup |
|---|---|---|---|
| baseline (uncached) | 0.911 | 0.911 | 1.00x |
| i3 | 0.876 | 0.852 | 1.72x |
| i4 | 0.776 | 0.827 | 1.80x |
| i5 | 0.735 | 0.860 | 1.79x |
| i6 | 0.375 | 0.616 | ~2.0x |
The exponential basis degrades gracefully where the polynomial arm collapses, and its lead grows with the interval. On the deployed Hunyuan3D-2-mini it is exactly lossless at i5 (0.794 = baseline 0.794).
SAM3D (PyTree velocities, slat FlowMatching), real weights, vs baseline:
| config | speedup | CD_vs_base | F1_vs_base |
|---|---|---|---|
| vanilla | 1.00x | 0.000 | 1.000 |
| HiCache i3 | 1.44x | 0.013 | 1.000 |
| DMD i5 | 1.47x | 0.013 | 1.000 |
| DMD i6 | 1.56x | 0.013 | 1.000 |
Geometry-lossless to interval 6, past the Hermite arm's lossless i3; the extra speed is entirely the wider lossless interval.
TRELLIS v1 (sparse-structure stage), Toys4K F-score@0.05, n=31, same carved-hybrid schedule, only the SS forecast basis swapped:
| variant | F@0.05 | speedup | vs vanilla |
|---|---|---|---|
| vanilla (uncached) | 0.839 | 1.00x | n/a |
| Hermite arm | 0.825 | 2.82x | -0.014 |
| HiCache++ (DMD) | 0.829 | 2.76x | -0.010 |
The same holds on TRELLIS.2-4B (v2): ties at the deployed interval, +0.03-0.04 F-score
at intervals 3-4
(see hermit-trellis2-plus-plus).
Full tables: results/RESULTS.md.
DiT-XL/2 ImageNet-256 (where the polynomial basis wins)
Paired-noise FID-10k ladder, 250-step DDPM, cfg 1.5: the per-step noise is re-seeded
identically across cells, so FID vs the uncached baseline measures pure cache-induced
drift (a lossless cache reads ~0; the interval-1 control reads -0.00). The ladder is
COMPLETE: both eras are labeled ("as released" = the pre-sign-fix near-reuse Hermite and
the auto cell measured alongside it; "corrected" = sign-fixed +k re-runs). FID values are
from the 10k runs; latencies marked * were re-timed post-eigencache at n=512 because
their FID runs predate the per-window eigendecomposition cache. Full protocol and ledger:
benchmarks/dit_imagenet/RESULTS_DIT.md.
| cell | speedup | FID drift (as released) | FID drift (corrected) | FID vs ImageNet-10k ref |
|---|---|---|---|---|
| baseline | 1.00x | 0.00 | 0.00 | 8.89 |
| taylor_i4 (TaylorSeer, +k) | 3.81x | n/a | 2.27 | 8.95 |
| hermite_i4 (HiCache, +k) | 3.79x | 10.57 | 3.54 | 9.60 |
| dmd_i4 (HiCache++) | 3.71x* | 18.02 | 18.02 | 21.47 |
| auto_i4 (1-step holdout) | 2.59x* | 18.08 | 18.11 | 21.57 |
| auto_i4 (horizon holdout) | 2.59x* | n/a | 18.11 | 21.57 |
| hermite_i6 (+k) | 5.46x | 28.06 | 6.46 | 11.61 |
| dmd_i6 | 5.31x* | 54.24 | 54.24 | 55.57 |
| hermite_i8 (+k) | 7.21x | 57.79 | 10.74 | 15.10 |
| dmd_i8 | 6.98x | 100.65 | 100.65 | 100.99 |
(Speedup shown for the corrected run where one exists; the as-released hermite cells ran at 4.02x / 5.80x / 7.66x. Absolute FID for as-released hermite/auto cells: 15.09 / 31.06 / 59.73 and 21.54.)
What this table says, honestly:
- Polynomial forecasting is near-lossless on DiT. The sign-correct TaylorSeer cell drifts 2.27 FID at 3.81x (absolute 8.95 vs the baseline's 8.89), and the corrected Hermite sweeps the ladder: 3.54 / 6.46 / 10.74 at i4 / i6 / i8. The corrected polynomial at i8 (7.2x) drifts less than the as-released one did at i4.
- The exponential basis loses here at every interval: 1.7-1.9x more drift than even
the as-released (near-reuse) control, 5-9x more than the corrected Hermite. Do not
deploy
backend="dmd"on DiT-class denoising; use the polynomial. - Holdout selection failed in the wild, in BOTH modes.
auto_i4tracks DMD (18.11 corrected, vs the corrected Hermite's 3.54) because the 1-gap backcast ranks DMD ahead.holdout="horizon"(backcast at the actual skip distance), which fixes the reconstructed inversion 20/20 on the microbench, reads 18.11 on DiT, identical: on real DiT features the horizon holdout ALSO serves the DMD arm. The microbench win did not transfer. Why: DMD's richer parameterization fits and backcasts the snapshot history better at any in-window distance while extrapolating forward worse, so backcast fit carries no signal about forward quality there. Recommendation: family defaults (above);autoonly as a regime-switch safety net; real-feature basis selection = open problem.
Not run: the FID-50k trio (~40 h GPU; under the paired-noise differential protocol it can only tighten an estimator bias that is identical across cells and cancels in the drift metric).
Sign-convention fix (2026-06-10), and why you should care even if you never use this repo. Versions up to and including v1.1.0 evaluated the Hermite basis at
x = -kinstead ofx = +k(k= steps past the newest anchor). One character, ours, introduced in porting; upstream TaylorSeer uses+k, but step-index conventions differ across caching codebases. Odd-order terms flipped sign, so the shipped Hermite forecast extrapolated backwards: strictly worse than plain reuse in every cell of the controlled suite. It survived every end-to-end benchmark because near-reuse fails safe (images render, F-scores degrade gracefully), and tuning concealed it (the buggy variant's best sigma is the one that mutes it toward reuse). The fixed (+k) Hermite beats reuse in 29 of 30 cells, is 1.45-2.47x more accurate than shipped, and keeps sigma = 0.5 optimal. The DMD andauto-selection paths were never affected (forward eigenvalue powers, a forward polynomial yardstick). Fixed in v1.2.0 (hicache_pp/hermite.pyandhicache_pp/tree.py) with closed-form directional regression tests (exact extrapolation of a linear series at sigma = sqrt(1/2)); we recommend the same one-line test to every caching codebase. Full shipped-vs-fixed-vs-reuse A/B:benchmarks/MICROBENCH_RESULTS.md. As-released numbers stay published in labeled columns next to corrected ones; the 3D-generator results were measured through the forks' deployed (as-released) Hermite arm and stand as measured.
Install / use
pip install hicache-pp
The one-loop snippet at the top is the whole integration for flat tensor velocities
(e.g. a DiT). For PyTree / structured velocities (e.g. SAM3D), use hicache_pp.tree,
the same API but tree-aware (hicache_forecast_tree, dmd_forecast_tree, plus tree
Adaptive-CFG). Backends:
backend="hermite": the published HiCache scaled-Hermite polynomial (clean, sign-correct reimplementation). The basis to pick for DiT-class denoising.backend="dmd": the exponential basis. Wins on flow-matching 3D generators; loses on DiT-class denoising (see the domain split above).backend="auto": holdout selection. Per compute step, backcast a held-out snapshot with both arms and serve whichever wins.holdout="1step"is the default;holdout="horizon"is the opt-in distance-matched test. Solves intra-run regime switches; does NOT recover the family-level winner on DiT (both modes serve the DMD arm there, FID drift 18.11 vs the corrected Hermite's 3.54). Use family defaults first; evidence inbenchmarks/MICROBENCH_RESULTS.mdandbenchmarks/dit_imagenet/RESULTS_DIT.md.
See integrations/ for the exact wiring into Hunyuan3D-2.1,
Hunyuan3D-2-mini, SAM3D and Fast-SAM3D, and
integrations/pr_drafts/ for prepared patches that add the
exponential basis to cache-dit, Hugging Face diffusers (TaylorSeerCacheConfig)
and Cache4Diffusion in each project's native conventions.
Tuning notes
- Pick the basis by family. Flow-matching 3D velocity streams:
"dmd", push the interval to i5-i6. DiT-class denoising:"hermite"(or a plain TaylorSeer); never"dmd", and do not rely on"auto"to save you (it serves the DMD arm there). - Hermite: lossless up to a modest interval (Hunyuan-2.1: i3/order-2). Higher order does not rescue bigger intervals.
- Exponential:
historyis the snapshot window (5-6); needs >= 4 uniformly spaced snapshots before it engages (Hermite covers warm-up automatically). first_enhancealways computes the first few steps (high curvature); keep it >= 3.- The DMD eigendecomposition is cached per compute window (refit exactly when a new
snapshot arrives), so the per-skip forecast cost is one
Phi @ (lambda**k * b): 2.5-3.1x cheaper per forecast than fit-per-call on CPU (benchmarks/eigencache_timing.py). On the DiT GPU re-time it cuts dmd_i4 from 566 to 483 ms/img (3.17x -> 3.71x) and dmd_i6 from 434 to 337 ms/img (4.13x -> 5.31x).
Tests
python -m hicache_pp.hermite # Hermite basis + schedule + sign-convention regressions
python -m hicache_pp.dmd # exponential basis + auto holdout modes + eigencache
python -m hicache_pp.tree # tree-aware Hermite + exponential + Adaptive-CFG + eigencache
python tests/run_tests.py # all of the above + DiT-harness tests (taylor sign,
# FID checkpoint/resume) + version sync; also
# pytest-discoverable
3D generator integrations (sibling repos)
The forecaster in this repo is model-agnostic; it has also been wired natively into a
family of 3D-generator forks. These are complementary accelerators, not competing
solutions: each speeds up a different base generator, and the + / ++ suffix is a
method choice (+ = HiCache Hermite polynomial, ++ = HiCache++ exponential), not a
rival product. Pick by (1) which base model you run, then (2) which forecast basis you
want:
| base generator | + = HiCache (Hermite) |
++ = HiCache++ (DMD) |
|---|---|---|
| Hunyuan3D-2.1 | hunyuan2.1-plus |
hunyuan2.1-plus-plus |
| Hunyuan3D-2 mini | hunyuan2-plus |
hunyuan2-plus-plus |
| SAM 3D Objects | sam3d-plus |
sam3d-plus-plus |
| Fast-SAM3D | fastsam3d-plus |
fastsam3d-plus-plus |
| TRELLIS (v1) | faster-trellis |
faster-trellis-plus-plus |
| TRELLIS.2-4B (v2) | hermit-trellis2 |
hermit-trellis2-plus-plus |
All of these are flow-matching 3D generators, i.e. the side of the domain split where the
exponential basis wins. fast-trellis2 is the TaylorSeer baseline fork (the upstream
"Fast" accel), the v2 reference point, not a HiCache variant.
Lineage & attribution
- TaylorSeer: feature caching with a monomial (Taylor) basis.
- HiCache (arXiv:2508.16984): the scaled-Hermite polynomial upgrade.
hicache_pp.hermiteis a clean reimplementation (sign-corrected as of v1.2). - HiCache++ (this work): the exponential (DMD/Prony) forecaster (
hicache_pp.dmd), the cross-family domain-split benchmark, and theautoholdout selector with both holdout modes. DMD (Schmid 2010) / Prony (1795) / Matrix-Pencil (Hua-Sarkar 1990) are classical spectral estimation; their application to diffusion feature caching is, to our knowledge, new. - Adaptive-CFG (Adaptive Guidance, arXiv:2312.12487): composable uncond-skip, included in the tree module.
Citation
If you use this library, please cite HiCache++ (this work) and the methods it builds on:
@misc{hicachepp2026,
title = {No Single Basis Wins: An Empirical Domain Split in Diffusion Feature
Forecasting, and a Training-Free Mechanism for Selecting the Basis},
author = {Attri, Krishi},
year = {2026},
note = {https://github.com/Archerkattri/hicache-plus-plus}
}
@misc{hicache2025,
title = {HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial Feature Forecasting},
eprint = {2508.16984}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, year = {2025}
}
@misc{taylorseer2025,
title = {From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers},
eprint = {2503.06923}, archivePrefix = {arXiv}, year = {2025}
}
@article{schmid2010dmd,
title = {Dynamic mode decomposition of numerical and experimental data},
author = {Schmid, Peter J.},
journal = {Journal of Fluid Mechanics}, volume = {656}, pages = {5--28}, year = {2010}
}
@article{hua1990matrixpencil,
title = {Matrix pencil method for estimating parameters of exponentially damped/undamped sinusoids in noise},
author = {Hua, Yingbo and Sarkar, Tapan K.},
journal = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
volume = {38}, number = {5}, pages = {814--824}, year = {1990}
}
@misc{adaptiveguidance2023,
title = {Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models},
eprint = {2312.12487}, archivePrefix = {arXiv}, year = {2023}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hicache_pp-1.2.0.tar.gz.
File metadata
- Download URL: hicache_pp-1.2.0.tar.gz
- Upload date:
- Size: 44.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6951ee2afac9ee9448c6b91ef8822ba5712d2f3898474a892260b3d407c132c5
|
|
| MD5 |
e80783b8ae1c4a26f1420b9aea0623c1
|
|
| BLAKE2b-256 |
bab60b978bac88c29f6005f4dd0dcad72617c4c54889fb7090837a10af4d41e0
|
File details
Details for the file hicache_pp-1.2.0-py3-none-any.whl.
File metadata
- Download URL: hicache_pp-1.2.0-py3-none-any.whl
- Upload date:
- Size: 34.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e77164b1ff6e2f3060c1e47fffb6ee03039ddef9682a0ffb2963b21d69d10e07
|
|
| MD5 |
9a59d919059c651df290ce2b74403ee4
|
|
| BLAKE2b-256 |
9b8ac166b76af7fa437947d2ccc15fa89d9675e110e0ba608750403dccb25eed
|