Skip to main content

Zero-code CUDA memory & compute optimizer — works the moment you pip install it

Project description

outofcuda 🚀

Zero-code CUDA memory & compute optimizer for PyTorch.
Install once. It works. No imports, no function calls, no config needed.

PyPI version Python License: MIT Tests


The problem

You're training a model, everything's going great, then:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.50 GiB.

Or your GPU is sitting at 40 % utilization because PyTorch's defaults were written for correctness, not speed.

outofcuda fixes both — automatically, the moment you install it.


Install

pip install outofcuda

That's it. The library installs a Python site-hook (.pth file) that fires before your first import torch, applying every optimization listed below. You don't write a single line of code.


How it works

Python processes every .pth file in site-packages at interpreter startup. outofcuda installs outofcuda_hook.pth, which imports a tiny bootstrap module. The bootstrap attaches a meta-path watcher that fires once, right after torch is imported, and applies the full optimization suite:

Python starts
  └─ site processes outofcuda_hook.pth
       └─ _hook.py registers a meta-path watcher
            └─ torch is imported (by your code or a library)
                 └─ outofcuda.apply() runs automatically ✓

Everything is lazy — if torch is never imported, outofcuda does nothing. If CUDA is unavailable, it silently skips all GPU-specific steps.


What gets optimized

1 · TF32 Matmul & cuDNN (Ampere+)

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32       = True

On RTX 30xx / A100 / H100, TF32 delivers up to 10× faster matrix multiplications with negligible precision loss for most models.

2 · cuDNN autotuner

torch.backends.cudnn.benchmark = True

cuDNN runs a short benchmark on the first batch to select the fastest convolution algorithm for your input shapes — a one-time cost that pays off across every subsequent forward pass.

3 · CUDA memory allocator tuning

PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,
                        garbage_collection_threshold:0.8,
                        expandable_segments:True
  • max_split_size_mb:128 — prevents the caching allocator from fragmenting large blocks, reducing OOM errors caused by fragmentation.
  • garbage_collection_threshold:0.8 — triggers internal GC when 80 % of reserved memory is in-use, proactively recovering dead tensors.
  • expandable_segments:True — allows the allocator to grow segments on demand instead of reserving large chunks upfront.

4 · Per-device memory fraction

torch.cuda.set_per_process_memory_fraction(0.95, device=i)

Prevents a single process from reserving 100 % of VRAM, leaving headroom for the driver, NCCL buffers, and peer processes.

5 · Scaled Dot-Product Attention backends

torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)

Enables all three SDPA backends — PyTorch picks Flash Attention when possible (O(N) memory vs O(N²)), falling back to memory-efficient or math attention automatically.

6 · Automatic bfloat16 on Ampere+

When the GPU supports it (compute capability ≥ 8.0), outofcuda switches the default AMP dtype from float16bfloat16, giving better numerical range without gradient underflow.

7 · OOM recovery hook

A sys.excepthook wrapper catches torch.cuda.OutOfMemoryError and immediately clears the CUDA cache + runs Python GC, giving your process a fighting chance before the kernel kills it.


Optional: use the Python API for more control

Even though you don't need to write code, outofcuda exposes a full API for power users.

AMP context manager

import outofcuda

with outofcuda.autocast_context():
    logits = model(input_ids)
    loss = criterion(logits, labels)

Automatically picks bfloat16 on Ampere+ and float16 on older cards.

Optimise a model

model = outofcuda.compile_model(model)

Converts to channels-last memory layout (better tensor-core utilization) and, if OUTOFCUDA_COMPILE=1 is set, wraps with torch.compile(mode="reduce-overhead").

Smart DataLoader

loader = outofcuda.smart_dataloader(
    dataset,
    batch_size=64,
    shuffle=True,
)

Pre-configured with pin_memory=True, num_workers=4, prefetch_factor=2, and persistent_workers=True — the settings most people forget to set.

Memory report

print(outofcuda.memory_report())
# {
#   "cuda:0": {
#     "name": "NVIDIA A100-SXM4-80GB",
#     "total_gb": 80.0,
#     "allocated_gb": 12.4,
#     "reserved_gb": 14.1,
#     "free_gb": 65.9,
#     "peak_allocated_gb": 18.7
#   }
# }

Clear CUDA cache

outofcuda.clear_cache()   # empty_cache() + gc.collect()

VRAM watchdog

from outofcuda import MemoryMonitor

with MemoryMonitor(threshold=0.85, interval=2.0):
    train(model, loader)
# auto-clears cache whenever VRAM > 85 %

All-in-one class

opt = outofcuda.CudaOptimizer()

model  = opt.prepare(model)          # channels-last + compile
loader = opt.dataloader(dataset)     # smart DataLoader
with opt.autocast():                 # AMP
    loss = model(x)
opt.clear()                          # cache flush
print(opt.report())                  # VRAM stats

Environment variables

Every knob is tunable via env vars — no code changes needed.

Variable Default Description
OUTOFCUDA_DISABLE 0 Set to 1 to fully disable
OUTOFCUDA_VERBOSE 0 Print optimization report on startup
OUTOFCUDA_TF32 1 Enable TF32 for matmul / cuDNN
OUTOFCUDA_CUDNN_BENCHMARK 1 Enable cuDNN autotuner
OUTOFCUDA_CUDNN_DETERMINISTIC 0 Force deterministic cuDNN
OUTOFCUDA_CHANNELS_LAST 1 Prefer channels-last layout
OUTOFCUDA_COMPILE 0 Use torch.compile() in compile_model()
OUTOFCUDA_COMPILE_MODE reduce-overhead torch.compile mode
OUTOFCUDA_COMPILE_FULLGRAPH 0 fullgraph=True for compile
OUTOFCUDA_AMP_DTYPE float16 AMP dtype (float16 / bfloat16)
OUTOFCUDA_MEMORY_FRACTION 0.95 Max VRAM fraction per device
OUTOFCUDA_ALLOC_CONF (see above) PYTORCH_CUDA_ALLOC_CONF string
OUTOFCUDA_GC_COLLECT 1 Run gc.collect() on cache clear
OUTOFCUDA_FLASH_ATTN 1 Enable Flash Attention SDPA backend
OUTOFCUDA_MEM_EFF_ATTN 1 Enable memory-efficient SDPA backend
OUTOFCUDA_MATH_ATTN 1 Enable math SDPA backend
OUTOFCUDA_PIN_MEMORY 1 pin_memory for smart DataLoader
OUTOFCUDA_NUM_WORKERS 4 num_workers for smart DataLoader
OUTOFCUDA_PREFETCH_FACTOR 2 prefetch_factor for smart DataLoader

Example — disable everything except the allocator fix:

OUTOFCUDA_TF32=0 \
OUTOFCUDA_CUDNN_BENCHMARK=0 \
OUTOFCUDA_CHANNELS_LAST=0 \
python train.py

Expected gains

Results vary by GPU generation, model architecture, and batch size. Here are typical observations on an A100-80 GB:

Optimization Speedup / Saving
TF32 matmul 2–10× on dense layers
cuDNN benchmark 5–20 % on conv-heavy models
Flash Attention 2–4× less VRAM, 1.5–3× faster attention
channels-last 10–30 % faster on CNNs
AMP bfloat16 1.5–2× throughput, ~50 % less VRAM
Allocator tuning Fewer OOM crashes, less fragmentation

Requirements

  • Python ≥ 3.9
  • PyTorch ≥ 2.0 (optional — outofcuda installs without torch)
  • CUDA ≥ 11.8 (optional — CPU-only is safe)

No other dependencies.


Design philosophy

The best optimization is the one that happens without you thinking about it.

outofcuda is deliberately a zero-footprint library:

  • No monkey-patching of torch internals
  • No global state mutation beyond documented CUDA / cuDNN flags
  • No background threads unless you explicitly start MemoryMonitor
  • Safe to import in tests, notebooks, scripts, and production services
  • Fully disableable with OUTOFCUDA_DISABLE=1

Contributing

git clone https://github.com/outofcuda/outofcuda
cd outofcuda
pip install -e ".[dev]"
pytest

Issues and PRs welcome.


License

MIT © outofcuda contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

outofcuda-0.1.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

outofcuda-0.1.0-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file outofcuda-0.1.0.tar.gz.

File metadata

  • Download URL: outofcuda-0.1.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for outofcuda-0.1.0.tar.gz
Algorithm Hash digest
SHA256 47f41fda9450336e3536fffb9c93c611b73752602cc4f31ca65c71ea82851613
MD5 31f06496fea94c7acb695ce009594366
BLAKE2b-256 60ef5a901e75e550f9980524e5a6034f8bd2c463dd5525bb9d0a29dfc319d311

See more details on using hashes here.

File details

Details for the file outofcuda-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: outofcuda-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for outofcuda-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf8e49f0a261e93190e6df411ba8c44260e6d62eeee4ff7fa9d6c1ed7a755c2d
MD5 dbaa21a8e42689439489f7cafbb23bc8
BLAKE2b-256 5e0127704503e1fe97df1e65f1006b941d9208fe9085f8cdd8d5ac974ffbe26d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page