PyTorch helpers for cjm-substrate capabilities: GPU memory release, typed CUDA-OOM handling, and device selection.
Project description
cjm-substrate-torch-utils
Install
pip install cjm_substrate_torch_utils
Project Structure
nbs/
├── device.ipynb # Resolve a device spec ("auto" / "cpu" / "cuda" / "cuda:N") to a concrete torch device string.
├── memory.ipynb # Robust move-to-CPU + drop-references + gc + CUDA-cache cleanup for releasing models, factored out of the per-capability reimplementations.
└── oom.ipynb # Convert torch CUDA out-of-memory exceptions into the substrate's typed `CapabilityResourceError` (SG-47 Track B) so CR-7 reactive retry can evict and reload.
Total: 3 notebooks
Module Dependencies
graph LR
device["device<br/>Device resolution"]
memory["memory<br/>GPU model release"]
oom["oom<br/>CUDA OOM handling"]
No cross-module dependencies detected.
CLI Reference
No CLI commands found in this project.
Module Overview
Detailed documentation for each module in the project:
Device resolution (device.ipynb)
Resolve a device spec (“auto” / “cpu” / “cuda” / “cuda:N”) to a concrete torch device string.
Import
from cjm_substrate_torch_utils.device import (
resolve_torch_device
)
Functions
def resolve_torch_device(
spec: str = "auto", # Requested device: "auto", "cpu", "cuda", or "cuda:N"
) -> str: # Concrete device string
"""
Resolve a device spec to a concrete torch device string.
`"auto"` resolves to `"cuda"` when CUDA is available, else `"cpu"`. Any
explicit spec (`"cpu"`, `"cuda"`, `"cuda:0"`, ...) is returned unchanged.
"""
GPU model release (memory.ipynb)
Robust move-to-CPU + drop-references + gc + CUDA-cache cleanup for releasing models, factored out of the per-capability reimplementations.
Import
from cjm_substrate_torch_utils.memory import (
release_model
)
Functions
def release_model(
obj: Any, # The capability instance holding the model attribute(s)
model_attr_names: List[str], # Names of the attributes to release, in release order
device: str = "cuda", # Device the model is on; gates the CUDA-specific cleanup
*,
logger: logging.Logger, # Logger for best-effort failure reporting
) -> None
"""
Release one or more model objects: move to CPU, drop references, gc, free CUDA cache.
For each name in `model_attr_names`, if `obj` has a non-None attribute:
1. when on CUDA, best-effort `.to('cpu')` (frees GPU tensors; skipped for
objects without a `.to` method, e.g. processors/tokenizers),
2. `setattr(obj, name, None)` and drop the local reference.
Then a single `gc.collect()` and — on CUDA — `empty_cache()` + `synchronize()`.
Best-effort throughout: failures are logged and swallowed. Missing or
already-None attributes are skipped, so the call is idempotent.
"""
CUDA OOM handling (oom.ipynb)
Convert torch CUDA out-of-memory exceptions into the substrate’s typed
CapabilityResourceError(SG-47 Track B) so CR-7 reactive retry can evict and reload.
Import
from cjm_substrate_torch_utils.oom import (
cuda_oom_to_capability_resource_error
)
Functions
def cuda_oom_to_capability_resource_error(
exc: BaseException, # The caught CUDA OOM exception (e.g. torch.cuda.OutOfMemoryError)
*,
label: str, # Context for the message, e.g. "loading model 'X'" or "inference"
headroom_mb: float = 100.0, # Best-effort margin added to `available` to estimate `needed`
) -> CapabilityResourceError: # Typed error for the substrate's CR-7 reactive-retry path
"""
Convert a CUDA out-of-memory exception into a substrate-typed `CapabilityResourceError`.
SG-47 Track B: a capability's GPU inference / model-load site catches
`torch.cuda.OutOfMemoryError` and re-raises the result of this helper so the
substrate sees a typed resource error (evict + reload + retry via CR-7)
instead of an opaque crash.
`needed` is a best-effort estimate (`available + headroom_mb`): the true
required VRAM is unknowable from the exception, and CR-7 triggers eviction
regardless of magnitude, so an approximation above `available` is sufficient.
The caller raises the returned error, preserving the original cause:
try:
model = Model.from_pretrained(repo_id, ...)
except torch.cuda.OutOfMemoryError as e:
raise cuda_oom_to_capability_resource_error(e, label=f"loading {repo_id!r}") from e
"""
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cjm_substrate_torch_utils-0.0.11.tar.gz.
File metadata
- Download URL: cjm_substrate_torch_utils-0.0.11.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1056f2d7fa16526cdce5d038668ce787de3f8480817d69236de23c84acbec99b
|
|
| MD5 |
08178717e6081bc6f9576996e16d614f
|
|
| BLAKE2b-256 |
04b03d72c264e2dbc249a24e34df9acbb17a979a928089745b36c596f457805f
|
File details
Details for the file cjm_substrate_torch_utils-0.0.11-py3-none-any.whl.
File metadata
- Download URL: cjm_substrate_torch_utils-0.0.11-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2520faa3a1ddfaa632cf964eea5740afeab1873a144be0b370b0e35056a0a10a
|
|
| MD5 |
48c9577840f43c2ff64f313ce6812038
|
|
| BLAKE2b-256 |
b58e66b3d6aef9b13d5d6e98670f0c33f4aed1510fd78e48c423bc93057f5923
|