Drop-in throughput and memory optimisations for FAIR Hiera. 0.2: graph-safe MAE for torch.compile(reduce-overhead) — 2.37x on Hiera-Base GH200 over eager.

These details have not been verified by PyPI

Project links

Project description

hiera-optim

Drop-in throughput optimisations for FAIR's Hiera and its MAE variant. Two lines for the layout fix, one extra flag for graph-safe torch.compile(reduce-overhead). Numerically equivalent within bf16 noise; weights preserved.

from hiera_optim import optimize
optimize(model)                                  # 1.3–1.9x: layout + gather
optimize(model, graph_safe=True)                 # 2.2–2.4x e2e once you also torch.compile

Results

GH200 (Hopper), bf16, full forward + backward, B=128 in-chans=8.

variant	eager	`optimize(model)`	`optimize(graph_safe=True)` + `compile(mode="reduce-overhead")`
Hiera-Tiny	1.00x	1.75x	2.20x
Hiera-Small	1.00x	1.46x	2.34x
Hiera-Base	1.00x	1.32x	2.37x

Hiera-Base step time: 112 ms → 85 ms → 47 ms. Same loss within 5e-3 rel diff in bf16; same gradient flow (worst grad RMS in tests: 4.4e-5).

RTX 4090, Hiera-Base, in-chans=8, B=8: eager 54.3 ms → 13.1 ms (4.13x) under graph-safe + reduce-overhead.

Multi-GPU (DDP) — GH200 4-GPU, B=128 / rank

variant	ms / step (4-GPU)	total samp / s	DDP overhead vs 1-GPU
Hiera-Base	52.3	9,787	+10% (all-reduce on 4× Hopper)
Hiera-Tiny	40.2	12,753	similar

Validated with static_graph=True on DDPStrategy, fused AdamW, and mem_efficient SDPA. Gradients agree across ranks (cross-rank rel diff 0.0).

Variant × q_pool sweep — GH200 B=128 in-chans=8

	q_pool=1	q_pool=2	q_pool=3
Hiera-Tiny	2.35x	2.16x	3.17x
Hiera-Small	2.31x	2.27x	3.33x
Hiera-Base	2.25x	2.31x	3.12x

No regressions across the architecture sweep. q_pool=3 configs see >3× speedup.

Hiera-Large — GH200 in-chans=8 bf16

	eager	`optimize(model)`	graph_safe + reduce-overhead
Hiera-Large B=64	1.00x (140 ms)	1.40x	2.72x (52 ms)
Hiera-Large B=128	1.00x (169 ms)	1.47x	1.73x (98 ms)

Peak memory: 9.2 GiB (B=64), 16.8 GiB (B=128). Large is compute-heavy so the launch-amortization win is smaller at high B than for Hiera-Base, but still 1.7–2.7×.

1-D Hiera (launch-bound) — GH200 bf16, base, fwd+bwd

A genuine 1-D MAE (q_stride=(2,), 1024 samples → 64 tokens, 8 mask units) is in a different regime than 2-D: it's launch-bound (tiny per-channel GEMMs, thousands of small kernels), not attention-bound.

	eager	`optimize(model)` (layout)	graph_safe + reduce-overhead
B=128	1.00x (1931 samp/s, 0.9 GB)	0.92x (no-op)	8.46x (16,333 samp/s, 0.8 GB)
B=512	1.00x (7643 samp/s, 3.5 GB)	1.22x	3.12x (23,829 samp/s, 1.0 GB)

On 1-D the eager layout swap does nothing (0.92x at B=128 — it can even regress); the entire win is graph_safe=True + compile(mode="reduce-overhead"). It also cuts memory ~3.5× at B=512 (3.5→1.0 GB), which is what lets you push much larger batches. Loss is bit-identical to FAIR for a matched mask. For 1-D training, graph_safe + reduce-overhead is the recommended default — see CI_SPEEDUP.md.

Full Wave-1 layout-fix matrix: MATRIX_RESULTS.md. Changelog: CHANGELOG.md.

Install

pip install hiera-optim

From source:

git clone https://github.com/avocardio/hiera-optim.git
cd hiera-optim
pip install -e .

PyTorch >= 2.5, Triton >= 2.3. Recognises FAIR Hiera in-tree (models.hiera) or PyPI (hiera-transformer).

Usage — minimal (1.3-1.9x)

import torch
from hiera_optim import optimize
from hiera import mae_hiera_base_224

model = mae_hiera_base_224(pretrained=False, in_chans=8, input_size=(224, 224))
optimize(model)
model = torch.compile(model, mode="default", dynamic=False)

x = torch.randn(128, 8, 224, 224, device="cuda", dtype=torch.bfloat16)
loss, *_ = model(x, mask_ratio=0.6)
loss.backward()

optimize(model) does two things, in place:

Swap every MaskUnitAttention for a 4-D Q/K/V variant so PyTorch SDPA dispatches to FlashAttention / cuDNN-attn / mem-efficient instead of math. FAIR's original feeds SDPA a 5-D tensor that the fused kernels reject (~13x per call on Ada, ~6x on Hopper).
Swap x[mask.tile(...)] and x_dec[mask] = ... for explicit torch.gather / scatter_. Removes the indexing_backward_kernel and the aten::nonzero graph break.

Usage — graph-safe + `reduce-overhead` (2.2–2.4x on GH200)

graph_safe=True rewrites forward_loss and get_pixel_label_* with mask-weighted-mean reductions (no pred[mask] boolean indexing — the data-dependent shape would otherwise crash CUDA Graphs on replay) and auto-pins a CUDA-Graph-safe SDPA backend on Hopper (cuDNN-attention isn't graph-safe in PyTorch 2.9).

import torch
from hiera_optim import optimize, MAEStepInputs
from hiera import mae_hiera_base_224

model = mae_hiera_base_224(pretrained=False, in_chans=8, input_size=(224, 224))
model = model.to("cuda", torch.bfloat16)
optimize(model, graph_safe=True)
model = torch.compile(model, mode="reduce-overhead", dynamic=False)

# CUDA Graphs need stable input tensor addresses across iterations.
inputs = MAEStepInputs(
    model._orig_mod, batch_size=128, in_chans=8,
    input_size=(224, 224), mask_ratio=0.6,
    device="cuda", dtype=torch.bfloat16,
)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4, fused=True)

for batch in loader:
    inputs.x.copy_(batch.cuda(non_blocking=True))
    inputs.refresh_mask()
    out = model(inputs.x, mask_ratio=0.6,
                mask=inputs.mask, keep_idx=inputs.keep_idx)
    out[0].backward()
    opt.step(); opt.zero_grad()

MAEStepInputs handles the static-buffer pattern that CUDA Graphs require: the mask and keep_idx are sampled outside the captured region and copied into persistent tensors each step.

With PyTorch Lightning (one-line training_step)

For Lightning users, MAEStaticInputsCallback owns the buffer lifecycle so the training_step stays clean. Install with pip install hiera-optim[lightning] or just pip install hiera-optim lightning.

from lightning.pytorch import Trainer
from lightning.pytorch.strategies import DDPStrategy
from hiera_optim import optimize, MAEStaticInputsCallback, feed_batch

# in LightningModule.__init__:
optimize(self.model, graph_safe=True)
self.model = torch.compile(self.model, mode="reduce-overhead", dynamic=False)

# in LightningModule.training_step:
def training_step(self, batch, batch_idx):
    inputs = self._mae_inputs                           # attached by callback
    feed_batch(inputs, batch, self._batch_key)          # 1-line buffer fill + mask refresh
    out = self.model(inputs.x, mask_ratio=self.mask_ratio,
                     mask=inputs.mask, keep_idx=inputs.keep_idx)
    loss = out[0]
    ...

# Trainer:
cb = MAEStaticInputsCallback(
    batch_size=B, in_chans=8, input_size=(224, 224),
    mask_ratio=0.6, dtype=torch.bfloat16, batch_key="eeg_raw",
)
trainer = Trainer(
    callbacks=[cb],
    strategy=DDPStrategy(static_graph=True),
    precision="bf16-mixed",
    devices=4,
)

The callback:

allocates static buffers on setup(stage="fit") (knows the device by then)
walks _orig_mod / .module to find the inner Hiera model
refreshes the mask on on_train_batch_start (so the captured CUDA Graph sees stable input addresses every step)
exposes the buffers as pl_module._mae_inputs for any callback or hook to read

Production checklist for `graph_safe + reduce-overhead`

	required	why
`PYTORCH_ALLOC_CONF=max_split_size_mb:512,expandable_segments:True`	yes at large B	Without this, the captured graph pool fragments and reduce-overhead OOMs at B=1024/GPU on Hiera-Base GH200 even though peak alloc is only 67 GiB of 95 GiB.
`drop_last=True` on dataloader	yes	a partial last batch triggers a ~50-second recompile (whole-graph re-capture for the new shape). drop_last keeps every step at the captured B.
`static_graph=True` on `DDPStrategy`	yes	DDP needs a stable comm pattern for the captured allreduce hooks.
static batch shape across steps	yes	feed `inputs.x.copy_(batch)` — never re-allocate.
in-chans / image size / mask ratio constant for the run	yes	changing any retriggers compile.
Muon / manual_optimization, `accumulate_grad_batches=1`	ok — validated on GH200 (loss matches FAIR ref within `6.2e-3` bf16).
Muon / manual_optimization, `accumulate_grad_batches > 1`	broken under reduce-overhead in PyTorch 2.9 — 4 workarounds tried (mark_step_begin / +sync / +clone / +del), all hit `accessing tensor output of CUDAGraphs that has been overwritten`. Use `accum=1` with a larger per-GPU batch instead.
gradient checkpointing under reduce-overhead	don't — `enable_stage_checkpointing` increases memory under CUDA Graphs (recompute path also captured into pool). Tested: stage-2 ckpt goes 67 → 77 GiB peak.
compile cache	persistent across runs if you set `TRITON_CACHE_DIR` and `TORCHINDUCTOR_CACHE_DIR` to a stable path. First step still pays the autotune cost.

Production scenarios on 4×GH200, effective batch 4096

The bench numbers above are single-GPU, bare PyTorch, AdamW. Real Lightning + Muon + DDP stacks pay extra shell overhead that the lab bench doesn't capture. Realistic per-GPU samp/s for the same effective-batch=4096 recipe:

stack	optimizer	per-GPU B	mode	samp/s/GPU
bare PT, single GPU (lab)	AdamW	1024	reduce-overhead + graph_safe	2839
bare PT, single GPU (lab)	Muon	1024	reduce-overhead + graph_safe	2221 (Muon is 78% of AdamW)
4×GH200 DDP, Lightning, `sync_dist=True` per step (current prod)	Muon	1024	reduce-overhead + graph_safe	~932
4×GH200 DDP, Lightning, `sync_dist=False` (one-line fix)	Muon	1024	reduce-overhead + graph_safe	~1500-1800
4×GH200 DDP, Lightning, no per-step log + AdamW	AdamW	1024	reduce-overhead + graph_safe	~2500

Lightning's sync_dist=True on the per-step loss log is a hidden ~50% throughput tax at this it/s, because it does an all-reduce on the loss scalar outside the captured CUDA Graph and serializes every step. The user-side fix is one line: change sync_dist=True → sync_dist=False (Lightning still aggregates at epoch end if on_epoch=True). See cluster/REALITY_CHECK.md for the full diagnosis.

If you want raw throughput in a Lightning + DDP stack, take the headline as 1.5× over the previous production baseline, not 2.31×.

Optional

from hiera_optim import optimize, enable_stage_checkpointing

optimize(model, sdpa_backend="auto")             # per-block SDPA hint
optimize(model, sdpa_backend="mem_efficient")    # pin one backend everywhere
enable_stage_checkpointing(model, stages=(2,))   # OOM lever

Flexibility

Validated across the Hiera matrix (86/86 tests). Equivalence to FAIR baseline holds for:

	tested
variants	Tiny / Small / Base / Large / Huge
q_pool	1 / 2 / 3
mask ratio	0.5 / 0.6 / 0.75
dtypes	fp32 / bf16
in_chans	1 / 3 / 8
input sizes	128 / 224
1-D / 2-D / 3-D MAE	all paths
production raw_reshape_224 (HieraMAEv5)	yes

graph_safe=True adds:

mask-weighted-mean loss (no bool indexing in label/loss)
Hopper-safe SDPA backend auto-pin (avoid cuDNN-attention under CUDA Graphs)
MAEStepInputs helper for the static-buffer pattern

It does not change the 0.1.0 layout / gather fixes — the new flag is opt-in. Default optimize(model) calls work exactly as before.

GPU support

Architecture	SM	Layout fix	graph_safe + reduce-overhead
Ada (RTX 4090, L40)	SM89	Tested	Tested (4.13x on Base B=8)
Hopper (H100, GH200)	SM90	Tested	Tested (2.37x on Base B=128)
Ampere (A100)	SM80	Should work	Likely works (cuDNN-attn not used on Ampere)
Blackwell (B200)	SM100	Should work	Should work

Tests

pip install -e .[test]
pytest

86 tests cover all 5 Hiera variants × q_pool {1, 2, 3} × mask ratios × bf16/fp16/fp32 × 1D/2D/3D inputs × classification + MAE × graph-safe path.

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Jun 16, 2026

0.2.0

May 23, 2026

0.1.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hiera_optim-0.2.1.tar.gz (69.8 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hiera_optim-0.2.1-py3-none-any.whl (61.7 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file hiera_optim-0.2.1.tar.gz.

File metadata

Download URL: hiera_optim-0.2.1.tar.gz
Upload date: Jun 16, 2026
Size: 69.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for hiera_optim-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`9605dfc177b0d5b834bbd4bd94b2d16ce75f61e86d0d3f3710f332678092e9c5`
MD5	`34fbf4cde5b25e159dc73c561701f8d2`
BLAKE2b-256	`bba9affaa43639a0e49b0477c082a196be97300c15bb57162e0705b9e39a3445`

See more details on using hashes here.

File details

Details for the file hiera_optim-0.2.1-py3-none-any.whl.

File metadata

Download URL: hiera_optim-0.2.1-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 61.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for hiera_optim-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96f599177fd97fcd9f8b25123a9005254d4c81465266d3fb8af6b7953a91aa02`
MD5	`9b365f368568ddded800ed8c4fcaf9c8`
BLAKE2b-256	`a94c8e5e2585546c8958b48297fdf0d682202c0e23f00d0b11282c94fcd52319`

See more details on using hashes here.

hiera-optim 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hiera-optim

Results

Multi-GPU (DDP) — GH200 4-GPU, B=128 / rank

Variant × q_pool sweep — GH200 B=128 in-chans=8

Hiera-Large — GH200 in-chans=8 bf16

1-D Hiera (launch-bound) — GH200 bf16, base, fwd+bwd

Install

Usage — minimal (1.3-1.9x)

Usage — graph-safe + reduce-overhead (2.2–2.4x on GH200)

With PyTorch Lightning (one-line training_step)

Production checklist for graph_safe + reduce-overhead

Production scenarios on 4×GH200, effective batch 4096

Optional

Flexibility

GPU support

Tests

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Usage — graph-safe + `reduce-overhead` (2.2–2.4x on GH200)

Production checklist for `graph_safe + reduce-overhead`