Smart model management for Hugging Face Diffusers pipelines
Project description
diffusers-mm
Smart model management for Hugging Face Diffusers pipelines. A drop-in replacement for enable_model_cpu_offload() and enable_group_offloading() that's size-aware, more configurable, and handles the edge cases diffusers doesn't.
Installation
uv add diffusers-mm
diffusers and accelerate are required (the library rides whatever versions are already installed rather than pinning specific ones).
Quick Start
import torch
from diffusers import LTX2Pipeline
from diffusers_mm import managed
pipe = LTX2Pipeline.from_pretrained("OzzyGT/LTX-2.3-Distilled", torch_dtype=torch.bfloat16)
pipe = managed(pipe) # auto strategy based on VRAM + component sizes
video, audio = pipe(prompt="A cat walking on a beach")
managed() mutates the pipeline in place (registers components, installs hooks, wraps __call__ with a device scope) and returns the same object with a .mm attribute exposing the underlying ModelManager.
Offload Strategies
managed() supports five strategies via the strategy= argument:
| Strategy | Description | When auto picks it |
|---|---|---|
"auto" |
Resolves to one of the below based on VRAM, RAM, and component sizes | Default |
"no_offload" |
All components stay on GPU | Pipeline weights × 1.5 fit in VRAM |
"model_offload" |
Components stream onto GPU one at a time via an accelerate hook chain | Largest component × 1.5 fits in VRAM |
"block_pin" |
Pins as many transformer blocks on GPU as VRAM allows; streams the rest via leaf-level group_offload | Largest component is too big for model_offload but has ≥ 8 repeated blocks |
"group_offload" |
Leaf-level streaming on every component (diffusers' apply_group_offloading with the fast defaults) |
Fallback when nothing else fits |
Auto Resolution
When strategy="auto" (the default), the resolver looks at available VRAM and RAM (not total — so other processes on the GPU and host are accounted for) and the size of the registered components. The decision rule:
- If
pipeline_weights × 1.5 ≤ available VRAM→no_offload. - Else if
largest_component × 1.5 ≤ available VRAM→model_offload. - Else if the largest component has a discoverable
nn.ModuleListof ≥ 8 repeated same-type blocks →block_pin. - Otherwise →
group_offload.
The 1.5× factor (AUTO_NO_OFFLOAD_FACTOR / AUTO_MODEL_OFFLOAD_FACTOR) is the activation budget — empirically validated for SDNQ int8 LTX-2.3 to give ~0.3 GiB margin above peak.
If no components are registered yet, the resolver falls back to a VRAM-only tier table:
| Available VRAM | Strategy |
|---|---|
| ≥ 20 GB | no_offload |
| ≥ 12 GB | model_offload |
| < 12 GB / non-CUDA | group_offload |
If pipeline weights exceed RAM × 0.85, a warning is logged — the workload likely won't fit on host memory regardless of strategy.
Block-pin tuning
The block_pin strategy fills the gap between model_offload (largest component must fit) and group_offload (everything streams, transformer pays transfer cost on every step). It pins as many transformer blocks as VRAM allows on the GPU permanently, and streams the rest via apply_group_offloading(offload_type="leaf_level").
The pin count is auto-budgeted per component:
pipe = managed(pipe) # auto-budget block_pin on whatever component is biggest
Override the per-component count explicitly when the auto budget is wrong for your workload (e.g. very high activation cost):
pipe.mm.set_block_pin_count("transformer", 30)
For long-video workloads the default working-set margin (6.5 GiB) can be undersized — set it on the manager via the ctor (see Tuning the auto resolver below):
pipe = managed(pipe, auto_block_pin_working_set_gb=12.0) # video at 768x512x121f
For block_pin to budget tightly, set the env var before starting Python:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Without it, allocator fragmentation can eat ~1-2 GiB and a careful budget can OOM. The strategy logs a warning if it's missing on apply.
Tuning the auto resolver
Every threshold the "auto" strategy uses internally is also a constructor argument on ModelManager (and a managed() kwarg). The defaults are tuned for image diffusion on consumer hardware — bump them for unusual workloads (high-activation video, very RAM-tight machines, etc.). All eight are keyword-only:
pipe = managed(
pipe,
# --- Strategy thresholds (1.5× ≈ activation budget) ---
auto_no_offload_factor=1.5, # pipeline_weights × this ≤ VRAM → no_offload
auto_model_offload_factor=1.5, # largest_component × this ≤ VRAM → model_offload
auto_block_pin_min_blocks=8, # min ModuleList size to pick block_pin over group_offload
# --- RAM safety margins ---
auto_ram_headroom=0.85, # warn if pipeline_weights > RAM × this
auto_low_cpu_mem_ram_headroom_gb=16.0, # flip group_offload's low_cpu_mem=False if RAM ≥ weights + this
# --- block_pin VRAM budget ---
auto_block_pin_working_set_gb=6.5, # reserved VRAM per component (Linux/macOS); bump for video
auto_block_pin_working_set_windows_gb=8.5, # same, Windows (allocator runs in fixed-segment mode)
auto_block_pin_ram_evict_headroom_gb=4.0, # auto-evict only if RAM ≥ evicted_subset + this
)
Each knob in detail:
| Knob | Default | Raise it when… | Lower it when… |
|---|---|---|---|
auto_no_offload_factor |
1.5 | Activations are large relative to weights — pick a more conservative no_offload. |
Activations are small, you want to stay GPU-resident more aggressively. |
auto_model_offload_factor |
1.5 | Largest component runs into VRAM ceiling at peak — push toward block_pin / group_offload. |
Same direction as above, just for the per-component tier. |
auto_ram_headroom |
0.85 | RAM is tight and you want the "won't fit" warning to fire earlier. | RAM is plentiful and the warning is noisy. |
auto_low_cpu_mem_ram_headroom_gb |
16.0 | RAM-constrained system, prefer staying in low-RAM mode (slower steady-state but stable). | RAM-rich system, bias toward low_cpu_mem=False for faster transfers. |
auto_block_pin_working_set_gb |
6.5 | Long video at meaningful resolution (LTX-2.3 768×512×121f measures 10–14 GiB). | You've measured your specific workload below the default and want to pin more blocks. |
auto_block_pin_working_set_windows_gb |
8.5 | Same as above, on Windows. | Same. |
auto_block_pin_min_blocks |
8 | Your block list is small enough that per-block hook overhead dominates. | You want block_pin even for shallow transformers (rarely useful). |
auto_block_pin_ram_evict_headroom_gb |
4.0 | Neighbors have unusually large host-side staging (large pinned buffers, big activations). | Neighbors are lightweight and you want eviction to fire more readily. |
These can also be set after construction by assigning the matching ALL_CAPS attribute on the manager — e.g. pipe.mm.AUTO_BLOCK_PIN_WORKING_SET_GB = 12.0. The ctor arg is shorthand for "do that at construction time."
Usage Examples
Explicit strategy
pipe = managed(pipe, strategy="group_offload")
Group offload tuning
The two main knobs (defaults match the recommended fast config):
pipe = managed(
pipe,
strategy="group_offload",
group_offload_use_stream=True, # overlap transfers with compute
group_offload_low_cpu_mem=True, # defer pinning per-transfer (saves RAM)
)
Without low_cpu_mem_usage=True, a full pinned host copy of every weight is held for the entire inference (~2× host RAM). This pairing is enforced — low_cpu_mem is dropped from kwargs when use_stream=False.
Shared manager (multiple pipelines)
When you have multiple pipelines sharing components — e.g. an LTX-2 base and refiner sharing the same T5 and VAE — pass a single ModelManager to both managed() calls. The manager refcounts shared modules so they aren't re-hooked, and unregistering one pipeline doesn't pull components out from under the other:
from diffusers_mm import ModelManager, managed
mm = ModelManager(strategy="auto")
pipe1 = managed(pipe1, mm=mm, device="cuda")
pipe2 = managed(pipe2, mm=mm, device="cuda") # T5 + VAE shared, transformer separate
# Later, just unregister one — the other keeps working
mm.unregister_components(pipe1)
When mm= is passed, the strategy/group_offload kwargs are ignored (the manager owns its own configuration).
Per-step strategy override
For decomposed pipelines (calling components individually) where the global strategy doesn't fit a specific step:
pipe = managed(pipe, strategy="group_offload")
# VAE is too granular for leaf-level hooks — temporarily switch to model_offload
with pipe.mm.use_components("vae", device="cuda", strategy_override="model_offload"):
decoded = pipe.vae.decode(latents)
# Original group_offload hooks are restored automatically on exit
Standalone ModelManager
If you're not using a standard DiffusionPipeline (custom inference loop, decomposed graph), drive ModelManager directly:
import torch
from diffusers_mm import ModelManager
mm = ModelManager(strategy="auto")
mm.register_component("transformer", transformer)
mm.register_component("vae", vae)
mm.apply_offload_strategy("cuda")
with mm.use_components("transformer", device="cuda"):
output = transformer(latents)
# Cross-pipeline component caching (load-or-reuse)
def load_my_transformer():
return MyTransformer.from_pretrained(...)
transformer = mm.load_component(
"transformer",
identifier="/models/my-transformer",
factory=load_my_transformer,
)
# A second call with the same identifier returns the cached module
# without invoking the factory.
mm.clear() # remove hooks, drop components, gc + empty_cache
Re-apply hooks after LoRA
Loading LoRA adapters adds new submodules to the transformer; those new submodules won't have offload hooks unless re-applied:
transformer.load_lora_adapter(state_dict, adapter_name="my_lora")
pipe.mm.reapply_group_offload("transformer", device="cuda")
Standalone hook cleanup
Sometimes you need to strip diffusers' group-offload hooks from a module tree without going through the manager — e.g. before serializing or transferring weights. The library exposes a submodule-walking cleanup that fixes diffusers' remove_hook(recurse=True) bug (it misses submodules whose parent lacks a _diffusers_hook attribute):
from diffusers_mm import remove_offload_hooks
remove_offload_hooks(module) # idempotent; safe if no hooks installed
Debugging memory
record_memory_history is a context manager around torch.cuda.memory._record_memory_history that dumps a snapshot pickle on exit. No-op when CUDA is unavailable so it's safe to leave in CPU-only test runs:
with pipe.mm.record_memory_history("trace.pickle"):
pipe(prompt="...")
# Visualize with:
# python -m torch.cuda._memory_viz trace_plot trace.pickle -o trace.html
# or upload to https://docs.pytorch.org/memory_viz
Comparison with Diffusers built-ins
| Feature | Diffusers | diffusers-mm |
|---|---|---|
| Model CPU offload | pipe.enable_model_cpu_offload() |
managed(pipe, strategy="model_offload") |
| Group offload | pipe.enable_group_offloading(...) |
managed(pipe, strategy="group_offload") (defaults match the fast config) |
| Block-level pinning | Not available | managed(pipe, strategy="block_pin") |
| Auto strategy | No | Yes — size-aware (looks at VRAM, RAM, and component sizes) |
| Per-step strategy override | No | mm.use_components(..., strategy_override=...) |
| Hook cleanup | remove_hook(recurse=True) misses nested submodules |
remove_offload_hooks(module) walks all submodules |
| Hook restore after override | No | Automatic in use_components finally block |
| Re-apply after LoRA | Manual | mm.reapply_group_offload(name, device) |
| Shared components across pipelines | No tracking | Refcount + per-source registration |
| Thread safety | No | RLock-guarded |
| Component caching | No | Hash-keyed cache + load_component(identifier, factory) |
Development
make format # auto-format with ruff
make lint-fix # auto-fix lint issues
make check # CI-friendly: format-check + lint (no modifications)
make test # CPU-only unit tests (~2s)
make cov # coverage report (terminal)
make cov-html # coverage report (HTML, in htmlcov/)
Real-model GPU tests are opt-in (require a CUDA device + downloaded weights):
make test-envs-fast # strategy-decision tests with synthetic modules (fast)
make test-envs-real # real LTX-2.3 distilled inference under a 24 GiB VRAM cap
The real-env tests cap VRAM via a held dummy tensor (genuine cudaMalloc OOM if exceeded). For an additional kernel-enforced RAM cap, wrap the invocation in a cgroup:
systemd-run --user --scope -p MemoryMax=32G -p MemorySwapMax=0 make test-envs-real
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diffusers_mm-0.2.1.tar.gz.
File metadata
- Download URL: diffusers_mm-0.2.1.tar.gz
- Upload date:
- Size: 74.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92a6114340b8c1f24a8a81035187909f41c07f212fc7a3c11470e14f0c708f17
|
|
| MD5 |
90771b6c0b73d71b42a5fca3a52a4ab7
|
|
| BLAKE2b-256 |
24bf108212062b7fda40ec8287c545ef9695c77a5b86597db4fda9614f5b8a94
|
File details
Details for the file diffusers_mm-0.2.1-py3-none-any.whl.
File metadata
- Download URL: diffusers_mm-0.2.1-py3-none-any.whl
- Upload date:
- Size: 46.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e5bdc9d6a09622bfd30064418bded8f9f5fe81d73608996fb762115e349d205
|
|
| MD5 |
30d57fc02a098d7b76cf22b664733df9
|
|
| BLAKE2b-256 |
9eda804aed40cc5529f6b2e88543bc69e18342bb757965518e5e4ac26da6fbee
|