Smart model management for Hugging Face Diffusers pipelines

These details have not been verified by PyPI

Project links

Project description

diffusers-mm

Smart model management for Hugging Face Diffusers pipelines. A drop-in replacement for enable_model_cpu_offload() and enable_group_offloading() that's size-aware, more configurable, and handles the edge cases diffusers doesn't.

Installation

uv add diffusers-mm

diffusers and accelerate are required (the library rides whatever versions are already installed rather than pinning specific ones).

Quick Start

import torch
from diffusers import LTX2Pipeline
from diffusers_mm import managed

pipe = LTX2Pipeline.from_pretrained("OzzyGT/LTX-2.3-Distilled", torch_dtype=torch.bfloat16)
pipe = managed(pipe)  # auto strategy based on VRAM + component sizes — just works
video, audio = pipe(prompt="A cat walking on a beach")

managed() mutates the pipeline in place (registers components, installs hooks, wraps __call__ with a device scope) and returns the same object with a .mm attribute exposing the underlying ModelManager.

Offload Strategies

managed() supports five strategies via the strategy= argument:

Strategy	Description	When auto picks it
`"auto"`	Resolves to one of the below based on VRAM, RAM, and component sizes	Default
`"no_offload"`	All components stay on GPU	Pipeline weights × 1.5 fit in VRAM
`"model_offload"`	Components stream onto GPU one at a time via an accelerate hook chain	Largest component × 1.5 fits in VRAM
`"block_pin"`	Pins as many transformer blocks on GPU as VRAM allows; streams the rest via leaf-level group_offload	Largest component is too big for `model_offload` but has ≥ 8 repeated blocks
`"group_offload"`	Leaf-level streaming on every component (diffusers' `apply_group_offloading` with the fast defaults)	Fallback when nothing else fits

Auto Resolution

When strategy="auto" (the default), the resolver looks at available VRAM and RAM (not total — so other processes on the GPU and host are accounted for) and the size of the registered components. The decision rule:

If pipeline_weights × 1.5 ≤ available VRAM → no_offload.
Else if largest_component × 1.5 ≤ available VRAM → model_offload.
Else if the largest component has a discoverable nn.ModuleList of ≥ 8 repeated same-type blocks → block_pin.
Otherwise → group_offload.

The 1.5× factor (AUTO_NO_OFFLOAD_FACTOR / AUTO_MODEL_OFFLOAD_FACTOR) is the activation budget — empirically validated for SDNQ int8 LTX-2.3 to give ~0.3 GiB margin above peak.

If no components are registered yet, the resolver falls back to a VRAM-only tier table:

Available VRAM	Strategy
≥ 20 GB	`no_offload`
≥ 12 GB	`model_offload`
< 12 GB / non-CUDA	`group_offload`

If pipeline weights exceed RAM × 0.85, a warning is logged — the workload likely won't fit on host memory regardless of strategy.

Block-pin tuning

The block_pin strategy fills the gap between model_offload (largest component must fit) and group_offload (everything streams, transformer pays transfer cost on every step). It pins as many transformer blocks as VRAM allows on the GPU permanently, and streams the rest via apply_group_offloading(offload_type="leaf_level").

The pin count is auto-budgeted per component:

pipe = managed(pipe)  # auto-budget block_pin on whatever component is biggest

Override the per-component count explicitly when the auto budget is wrong for your workload (e.g. very high activation cost):

pipe.mm.set_block_pin_count("transformer", 30)

For long-video workloads the default working-set margin (AUTO_BLOCK_PIN_WORKING_SET_GB = 6.5 GiB) can be undersized — adjust on the manager:

pipe.mm.AUTO_BLOCK_PIN_WORKING_SET_GB = 12.0  # video at 768x512x121f

For block_pin to budget tightly, set the env var before starting Python:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Without it, allocator fragmentation can eat ~1-2 GiB and a careful budget can OOM. The strategy logs a warning if it's missing on apply.

Usage Examples

Explicit strategy

pipe = managed(pipe, strategy="group_offload")

Group offload tuning

The two main knobs (defaults match the recommended fast config):

pipe = managed(
    pipe,
    strategy="group_offload",
    group_offload_use_stream=True,    # overlap transfers with compute
    group_offload_low_cpu_mem=True,   # defer pinning per-transfer (saves RAM)
)

Without low_cpu_mem_usage=True, a full pinned host copy of every weight is held for the entire inference (~2× host RAM). This pairing is enforced — low_cpu_mem is dropped from kwargs when use_stream=False.

Shared manager (multiple pipelines)

When you have multiple pipelines sharing components — e.g. an LTX-2 base and refiner sharing the same T5 and VAE — pass a single ModelManager to both managed() calls. The manager refcounts shared modules so they aren't re-hooked, and unregistering one pipeline doesn't pull components out from under the other:

from diffusers_mm import ModelManager, managed

mm = ModelManager(strategy="auto")
pipe1 = managed(pipe1, mm=mm, device="cuda")
pipe2 = managed(pipe2, mm=mm, device="cuda")  # T5 + VAE shared, transformer separate

# Later, just unregister one — the other keeps working
mm.unregister_components(pipe1)

When mm= is passed, the strategy/group_offload kwargs are ignored (the manager owns its own configuration).

Per-step strategy override

For decomposed pipelines (calling components individually) where the global strategy doesn't fit a specific step:

pipe = managed(pipe, strategy="group_offload")

# VAE is too granular for leaf-level hooks — temporarily switch to model_offload
with pipe.mm.use_components("vae", device="cuda", strategy_override="model_offload"):
    decoded = pipe.vae.decode(latents)
# Original group_offload hooks are restored automatically on exit

Standalone `ModelManager`

If you're not using a standard DiffusionPipeline (custom inference loop, decomposed graph), drive ModelManager directly:

import torch
from diffusers_mm import ModelManager

mm = ModelManager(strategy="auto")
mm.register_component("transformer", transformer)
mm.register_component("vae", vae)
mm.apply_offload_strategy("cuda")

with mm.use_components("transformer", device="cuda"):
    output = transformer(latents)

# Cross-pipeline component caching (load-or-reuse)
def load_my_transformer():
    return MyTransformer.from_pretrained(...)

transformer = mm.load_component(
    "transformer",
    identifier="/models/my-transformer",
    factory=load_my_transformer,
)
# A second call with the same identifier returns the cached module
# without invoking the factory.

mm.clear()  # remove hooks, drop components, gc + empty_cache

Re-apply hooks after LoRA

Loading LoRA adapters adds new submodules to the transformer; those new submodules won't have offload hooks unless re-applied:

transformer.load_lora_adapter(state_dict, adapter_name="my_lora")
pipe.mm.reapply_group_offload("transformer", device="cuda")

Standalone hook cleanup

Sometimes you need to strip diffusers' group-offload hooks from a module tree without going through the manager — e.g. before serializing or transferring weights. The library exposes a submodule-walking cleanup that fixes diffusers' remove_hook(recurse=True) bug (it misses submodules whose parent lacks a _diffusers_hook attribute):

from diffusers_mm import remove_offload_hooks

remove_offload_hooks(module)  # idempotent; safe if no hooks installed

Debugging memory

record_memory_history is a context manager around torch.cuda.memory._record_memory_history that dumps a snapshot pickle on exit. No-op when CUDA is unavailable so it's safe to leave in CPU-only test runs:

with pipe.mm.record_memory_history("trace.pickle"):
    pipe(prompt="...")
# Visualize with:
#   python -m torch.cuda._memory_viz trace_plot trace.pickle -o trace.html
# or upload to https://docs.pytorch.org/memory_viz

Comparison with Diffusers built-ins

Feature	Diffusers	diffusers-mm
Model CPU offload	`pipe.enable_model_cpu_offload()`	`managed(pipe, strategy="model_offload")`
Group offload	`pipe.enable_group_offloading(...)`	`managed(pipe, strategy="group_offload")` (defaults match the fast config)
Block-level pinning	Not available	`managed(pipe, strategy="block_pin")`
Auto strategy	No	Yes — size-aware (looks at VRAM, RAM, and component sizes)
Per-step strategy override	No	`mm.use_components(..., strategy_override=...)`
Hook cleanup	`remove_hook(recurse=True)` misses nested submodules	`remove_offload_hooks(module)` walks all submodules
Hook restore after override	No	Automatic in `use_components` `finally` block
Re-apply after LoRA	Manual	`mm.reapply_group_offload(name, device)`
Shared components across pipelines	No tracking	Refcount + per-source registration
Thread safety	No	RLock-guarded
Component caching	No	Hash-keyed cache + `load_component(identifier, factory)`

Development

make format       # auto-format with ruff
make lint-fix     # auto-fix lint issues
make check        # CI-friendly: format-check + lint (no modifications)
make test         # CPU-only unit tests (~2s)
make cov          # coverage report (terminal)
make cov-html     # coverage report (HTML, in htmlcov/)

Real-model GPU tests are opt-in (require a CUDA device + downloaded weights):

make test-envs-fast   # strategy-decision tests with synthetic modules (fast)
make test-envs-real   # real LTX-2.3 distilled inference under a 24 GiB VRAM cap

The real-env tests cap VRAM via a held dummy tensor (genuine cudaMalloc OOM if exceeded). For an additional kernel-enforced RAM cap, wrap the invocation in a cgroup:

systemd-run --user --scope -p MemoryMax=32G -p MemorySwapMax=0 make test-envs-real

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

May 27, 2026

0.2.0

May 14, 2026

This version

0.1.0

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diffusers_mm-0.1.0.tar.gz (52.0 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

diffusers_mm-0.1.0-py3-none-any.whl (31.3 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file diffusers_mm-0.1.0.tar.gz.

File metadata

Download URL: diffusers_mm-0.1.0.tar.gz
Upload date: May 14, 2026
Size: 52.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for diffusers_mm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4a7a297854471b4a9875a65c0ac38311bf306c938abbb31af1d8d8d66a6c17e8`
MD5	`b981c30654025a6d3d1cc65163720811`
BLAKE2b-256	`93d1a15816c711bc26fc98d775c7f3708876fac9bf186c42e9a98aacf3feb98c`

See more details on using hashes here.

File details

Details for the file diffusers_mm-0.1.0-py3-none-any.whl.

File metadata

Download URL: diffusers_mm-0.1.0-py3-none-any.whl
Upload date: May 14, 2026
Size: 31.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for diffusers_mm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b068fbf2e6ee9c11a47580b3f64882f01e74a6cfc9afc374a606fe98a1d561d`
MD5	`725e1ebe6adf97e493401330f10ac405`
BLAKE2b-256	`11ebe0fd14014b5b88e716701098fcef416b2df207ac8196852532a5c72bf35f`

See more details on using hashes here.

diffusers-mm 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

diffusers-mm

Installation

Quick Start

Offload Strategies

Auto Resolution

Block-pin tuning

Usage Examples

Explicit strategy

Group offload tuning

Shared manager (multiple pipelines)

Per-step strategy override

Standalone ModelManager

Re-apply hooks after LoRA

Standalone hook cleanup

Debugging memory

Comparison with Diffusers built-ins

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Standalone `ModelManager`