Training-free diffusion inference acceleration via exponential (DMD/Prony) velocity forecasting

These details have not been verified by PyPI

Project links

Project description

HiCache++

Training-free diffusion inference acceleration by exponential velocity forecasting.

A drop-in upgrade to TaylorSeer / HiCache: replace the polynomial feature-cache basis with a Dynamic-Mode-Decomposition (Prony) exponential basis — exact on the class diffusion features actually live in, so it stays lossless at larger skip intervals than the polynomial.

training‑free PyTorch CPU tests license MIT python

When to use this repo

These repos are complementary accelerators, not competing solutions — each speeds up a different base generator, and the + / ++ suffix is a method choice, not a rival product. Pick by (1) which base model you run, then (2) which forecast basis you want:

base generator	`+` = HiCache (Hermite)	`++` = HiCache++ (DMD)
Hunyuan3D-2.1	`hunyuan2.1-plus`	`hunyuan2.1-plus-plus`
Hunyuan3D-2 mini	`hunyuan2-plus`	`hunyuan2-plus-plus`
SAM 3D Objects	`sam3d-plus`	`sam3d-plus-plus`
Fast-SAM3D	`fastsam3d-plus`	`fastsam3d-plus-plus`
DiT-XL/2 (ImageNet)	`dit-plus`	`dit-plus-plus`
TRELLIS (v1)	`faster-trellis`	`faster-trellis-plus-plus`
TRELLIS.2-4B (v2)	`hermit-trellis2`	`hermit-trellis2-plus-plus`

+ (HiCache / scaled-Hermite): the published polynomial velocity-forecast basis — conservative, reproduces the HiCache paper. Use it to deploy the established method.
++ (HiCache++ / DMD exponential): our Dynamic-Mode-Decomposition basis — the same near-lossless quality at wider skip intervals, where the polynomial diverges. Use it when you push the cache interval for more speed.
standalone / model-agnostic: hicache-plus-plus — the forecaster itself, to add DMD caching to your own diffusion/flow model.
fast-trellis2 = the TaylorSeer baseline fork (the upstream "Fast" accel) — the v2 reference point, not a HiCache variant.

This repo: hicache-plus-plus — the standalone HiCache++ forecaster (DMD/Prony exponential velocity cache) + the Hermite baseline — model-agnostic; the per-model integrations are the sibling repos above.

TL;DR

On a flow-matching / diffusion denoise loop you can skip the network on most steps and forecast the velocity from cached anchors. The state of the art (TaylorSeer, HiCache) forecasts with a polynomial basis (monomial / scaled-Hermite). But a diffusion feature trajectory is the solution of a near-linear feature-ODE whose exact solution class is a sum of (damped/oscillatory) exponentials — not polynomials. Polynomials diverge under extrapolation, which is exactly why every polynomial cache caps out at a modest skip.

HiCache++ forecasts with Dynamic Mode Decomposition (Schmid 2010) — the SVD-regularised generalisation of Prony's method (1795): identify the linear propagator A from raw velocity snapshots (F_{t+1} ≈ A F_t), eigendecompose it once, and predict any (fractional) horizon k by eigenvalue powers:

F_{t+k} ≈ Φ (λ**k ⊙ b),     b = Φ⁺ F_t

It is exact on exponential trajectories (the solution class) — the property polynomials lack — so it holds quality at skip intervals where Hermite/Taylor drift.

Headline: on Hunyuan3D-2.1, as the skip interval grows the polynomial (Hermite) decays fast — 0.88 → 0.74 → 0.38 at interval 3 / 5 / 6 — while the exponential (DMD) holds: 0.85 → 0.86 → 0.62 (baseline 0.91). DMD's lead grows with the skip — +0.13 at i5, +0.24 at i6 — the exponential basis is what extends the lossless skip range.

How it compares

Every modern feature cache skips the network on most steps and forecasts the velocity; they differ in the basis used to extrapolate. The basis is what sets the skip ceiling, because a diffusion feature trajectory is (locally) a sum of exponentials, not a polynomial:

Method	Forecast basis	Exact on the feature-ODE class	Extrapolation	Max lossless skip*
TaylorSeer	monomial (Taylor)	✗	diverges	small
HiCache	scaled-Hermite	✗	drifts	interval‑3
FoCa · Padé · Chebyshev	rational / orthogonal poly	✗	drifts	small–moderate
HiCache++ (this work)	exponential (DMD / Prony)	✓ exact	bounded, correct asymptotics	interval‑5–6

_{*measured on Hunyuan3D-2.1 / SAM3D-slat (see Results). A polynomial basis is only a
local truncation of the exponential, so it is accurate for a tiny skip and diverges as the
horizon grows; the exponential basis is the exact solution class, so it stays lossless
further out — and DMD admits fractional horizons, so it forecasts sub-steps between
compute steps exactly.}

Why exponentials (the math)

A diffusion/flow-matching sampler integrates dx/dt = v_θ(x, t). Across timesteps the cached feature F_t (the CFG-combined velocity) evolves under a slowly-varying, near-linear operator. The exact solution of a linear ODE Ḟ = M F is F_t = Σ_j a_j e^{μ_j t} — a sum of exponentials with poles μ_j (damped if Re μ_j < 0, oscillatory if Im μ_j ≠ 0).

Polynomial basis (Taylor monomials, Hermite): a local Taylor truncation of that exponential. Accurate for a tiny skip, diverges as the horizon grows → modest skip cap.
Exponential basis (DMD / Prony): the exact function class. Fit the poles λ_j = e^{μ_j Δ} from snapshots and extrapolate with bounded, correct asymptotics.

The ≥4-snapshot floor. A real-valued trajectory spends two real degrees of freedom on every complex pole (a conjugate pair r e^{±iω} → r^t cos ωt, r^t sin ωt). So even a single oscillatory mode needs rank 3 to identify, i.e. 3 snapshot-pairs = 4 snapshots. With only 2 pairs the fit aliases (empirically ~2e-1 error vs ~5e-9 at 3 pairs). Below the floor (or across a non-uniform window) HiCache++ falls back to the Hermite forecast for warm-up.

Results (A/B, geometry-preserving)

All accelerators are training-free and geometry-preserving; the right A/B is how far the output drifts from the uncached/baseline geometry vs how much faster it runs.

Mechanism — controlled, no model

Forecasting H steps past an 8-step cached window on synthetic trajectories from the exact feature-ODE class — three forecast bases, rel. L2 error (↓):

basis	H=1	H=2	H=4	H=6	H=8
TaylorSeer (polynomial)	1.5e-2	8.0e-2	6.2e-1	2.3e0	6.5e0
Padé / FoCa (rational)	4.9e-2	1.1e-1	2.4e-1	5.3e-1	1.2e0
HiCache++ (exponential)	4.7e-9	1.4e-8	5.3e-8	1.2e-7	2.2e-7

The exponential basis is exact (~1e-8, flat in H); the polynomial diverges, and the rational (Padé / FoCa) improves on it but still diverges — 6-to-9 orders of magnitude behind DMD, and under noise the rational basis turns fragile (Froissart poles). That gap is the skip ceiling. Reproduce: python benchmarks/forecast_microbench.py.

Hunyuan3D-2.1 (flat DiT velocities) — Toys4K F-score@0.05

Excludes ball_000 (a sphere — Go-ICP alignment is rotationally degenerate on it; two runs otherwise agree to ±0.01). Speedup is solo / uncontended.

interval	Hermite (HiCache)	DMD (HiCache++)	speedup
baseline (uncached)	0.911	0.911	1.00×
i3	0.876	0.852	1.72×
i4	0.776	0.827	1.80×
i5	0.735	0.860	1.79×
i6	0.375	0.616	~2.0×

DMD degrades gracefully where Hermite collapses, and its lead grows with the interval. On the deployed Hunyuan3D-2-mini, DMD is exactly lossless at i5 (0.794 = baseline 0.794).

SAM3D (PyTree velocities, slat FlowMatching) — real weights, F1 vs baseline

config	speedup	CD_vs_base	F1_vs_base
vanilla	1.00×	0.000	1.000
HiCache i3	1.44×	0.013	1.000
DMD i5	1.47×	0.013	1.000
DMD i6	1.56×	0.013	1.000

Both are geometry-lossless (F1=1.000); DMD stays lossless to interval-6, where it gives the best speedup — past Hermite's lossless i3.

Fast-SAM3D (SS-stage TaylorSeer)

Hermite ≈ Taylor (a wash): both run the same stride-3 schedule, so the basis swap doesn't change latency — TaylorSeer caching (the default) is what gives the ~3×, not the basis.

TRELLIS v1 (sparse-structure stage) — Toys4K F-score@0.05, n=31

Swapping only the SS forecast basis Hermite→DMD in faster-trellis (same carved-hybrid schedule):

variant	F@0.05	speedup	vs vanilla
vanilla (uncached)	0.839	1.00×	—
HiCache (Hermite)	0.825	2.82×	−0.014
HiCache++ (DMD)	0.829	2.76×	−0.010

At the deployed ~interval-3 (2.8×), DMD is the most lossless accelerator (beats Hermite by +0.005 at matched speed); the margin widens at higher intervals. The same holds on TRELLIS.2-4B (v2) — DMD ties Hermite at the deployed interval and pulls +0.03–0.04 F-score ahead at intervals 3–4 (see hermit-trellis2-plus-plus). (The DiT-XL/2 ImageNet FID-vs-latency table is still in progress.)

Install / use

import torch
from hicache_pp import hicache_init, hicache_decide, hicache_update_derivatives, hicache_forecast
from hicache_pp import dmd_update_snapshots, dmd_forecast_state   # the exponential forecaster

# in your denoise loop (flat tensor velocities):
state = hicache_init(num_steps=N, interval=5, first_enhance=4, backend="dmd", history=6)
for i, t in enumerate(timesteps):
    if hicache_decide(state) == "forecast":
        v = dmd_forecast_state(state)            # skip the network — forecast the velocity
        state["step"] += 1
    else:
        v = model(x, t, ...)                     # the expensive forward
        hicache_update_derivatives(state, v.detach())
        dmd_update_snapshots(state, v.detach(), state["history"])
        state["step"] += 1
    x = scheduler.step(v, t, x)

For PyTree / structured velocities (e.g. SAM3D), use hicache_pp.tree — the same API but tree-aware (hicache_forecast_tree, dmd_forecast_tree, plus tree Adaptive-CFG).

See integrations/ for the exact wiring into Hunyuan3D-2.1, Hunyuan3D-2-mini, SAM3D and Fast-SAM3D, benchmarks/ for the controlled forecast microbenchmark, and results/ for the full tables.

Tests

python -m hicache_pp.hermite     # Hermite basis + schedule (CPU, no GPU/model)
python -m hicache_pp.dmd         # DMD exact-on-exponential + ≥4-snapshot floor
python -m hicache_pp.tree        # tree-aware Hermite + DMD + Adaptive-CFG
python tests/run_tests.py        # all of the above

Lineage & attribution

TaylorSeer — feature caching with a monomial (Taylor) basis.
HiCache (arXiv:2508.16984) — the scaled-Hermite polynomial upgrade. hicache_pp.hermite is a clean reimplementation.
HiCache++ (this work) — the DMD/Prony exponential forecaster (hicache_pp.dmd). DMD (Schmid 2010) / Prony (1795) / Matrix-Pencil (Hua–Sarkar 1990) are classical spectral estimation; their application to diffusion feature caching is, to our knowledge, new.
Adaptive-CFG (Adaptive Guidance, arXiv:2312.12487) — composable uncond-skip, included in the tree module.

Citation

If you use this library, please cite HiCache++ (this work) and the methods it builds on:

@misc{hicachepp2026,
  title  = {HiCache++: Training-free Diffusion Inference Acceleration via Exponential (DMD/Prony) Velocity Forecasting},
  author = {Attri, Krishi},
  year   = {2026},
  note   = {https://github.com/Archerkattri/hicache-plus-plus}
}

@misc{hicache2025,
  title  = {HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial Feature Forecasting},
  eprint = {2508.16984}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, year = {2025}
}

@misc{taylorseer2025,
  title  = {From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers},
  eprint = {2503.06923}, archivePrefix = {arXiv}, year = {2025}
}

@article{schmid2010dmd,
  title   = {Dynamic mode decomposition of numerical and experimental data},
  author  = {Schmid, Peter J.},
  journal = {Journal of Fluid Mechanics}, volume = {656}, pages = {5--28}, year = {2010}
}

@article{hua1990matrixpencil,
  title   = {Matrix pencil method for estimating parameters of exponentially damped/undamped sinusoids in noise},
  author  = {Hua, Yingbo and Sarkar, Tapan K.},
  journal = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
  volume  = {38}, number = {5}, pages = {814--824}, year = {1990}
}

@misc{adaptiveguidance2023,
  title  = {Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models},
  eprint = {2312.12487}, archivePrefix = {arXiv}, year = {2023}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.0

Jun 10, 2026

1.1.0

Jun 10, 2026

1.0.0

Jun 7, 2026

This version

0.1.0

Jun 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hicache_pp-0.1.0.tar.gz (25.4 kB view details)

Uploaded Jun 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hicache_pp-0.1.0-py3-none-any.whl (24.0 kB view details)

Uploaded Jun 7, 2026 Python 3

File details

Details for the file hicache_pp-0.1.0.tar.gz.

File metadata

Download URL: hicache_pp-0.1.0.tar.gz
Upload date: Jun 7, 2026
Size: 25.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for hicache_pp-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6937486e880e276736e4b225be91195f93fae0ba80d0e7911e9bc69408da3a81`
MD5	`1e17ce821431dc2dfe86c307177ffcb6`
BLAKE2b-256	`eba33dbefe9980c24019292f9796ea04ea10107d5339833c849f33570f3dc037`

See more details on using hashes here.

File details

Details for the file hicache_pp-0.1.0-py3-none-any.whl.

File metadata

Download URL: hicache_pp-0.1.0-py3-none-any.whl
Upload date: Jun 7, 2026
Size: 24.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for hicache_pp-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`504dbced98ee3df9769425b2158b576103d291e317c4cf48908245a4adbd262a`
MD5	`30fafcdfab51baf45bd2f17a954c99f9`
BLAKE2b-256	`7dae5b6d5a1ee9152f1046fc586afed2bbc16f297a60e8e031ee3307ca1c153f`

See more details on using hashes here.

hicache-pp 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HiCache++

When to use this repo

TL;DR

How it compares

Why exponentials (the math)

Results (A/B, geometry-preserving)

Mechanism — controlled, no model

Hunyuan3D-2.1 (flat DiT velocities) — Toys4K F-score@0.05

SAM3D (PyTree velocities, slat FlowMatching) — real weights, F1 vs baseline

Fast-SAM3D (SS-stage TaylorSeer)

TRELLIS v1 (sparse-structure stage) — Toys4K F-score@0.05, n=31

Install / use

Tests

Lineage & attribution

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes