Zero-code CUDA memory & compute optimizer — works the moment you pip install it

These details have not been verified by PyPI

Project links

Project description

outofcuda 🚀

Zero-code CUDA memory & compute optimizer for PyTorch.
Install once. It works. No imports, no function calls, no config needed.

The problem

You're training a model, everything's going great, then:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.50 GiB.

Or your GPU is sitting at 40 % utilization because PyTorch's defaults were written for correctness, not speed.

outofcuda fixes both — automatically, the moment you install it.

Install

pip install outofcuda

That's it. The library installs a Python site-hook (.pth file) that fires before your first import torch, applying every optimization listed below. You don't write a single line of code.

How it works

Python processes every .pth file in site-packages at interpreter startup. outofcuda installs outofcuda_hook.pth, which imports a tiny bootstrap module. The bootstrap attaches a meta-path watcher that fires once, right after torch is imported, and applies the full optimization suite:

Python starts
  └─ site processes outofcuda_hook.pth
       └─ _hook.py registers a meta-path watcher
            └─ torch is imported (by your code or a library)
                 └─ outofcuda.apply() runs automatically ✓

Everything is lazy — if torch is never imported, outofcuda does nothing. If CUDA is unavailable, it silently skips all GPU-specific steps.

What gets optimized

1 · TF32 Matmul & cuDNN (Ampere+)

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32       = True

On RTX 30xx / A100 / H100, TF32 delivers up to 10× faster matrix multiplications with negligible precision loss for most models.

2 · cuDNN autotuner

torch.backends.cudnn.benchmark = True

cuDNN runs a short benchmark on the first batch to select the fastest convolution algorithm for your input shapes — a one-time cost that pays off across every subsequent forward pass.

3 · CUDA memory allocator tuning

PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,
                        garbage_collection_threshold:0.8,
                        expandable_segments:True

max_split_size_mb:128 — prevents the caching allocator from fragmenting large blocks, reducing OOM errors caused by fragmentation.
garbage_collection_threshold:0.8 — triggers internal GC when 80 % of reserved memory is in-use, proactively recovering dead tensors.
expandable_segments:True — allows the allocator to grow segments on demand instead of reserving large chunks upfront.

4 · Per-device memory fraction

torch.cuda.set_per_process_memory_fraction(0.95, device=i)

Prevents a single process from reserving 100 % of VRAM, leaving headroom for the driver, NCCL buffers, and peer processes.

5 · Scaled Dot-Product Attention backends

torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)

Enables all three SDPA backends — PyTorch picks Flash Attention when possible (O(N) memory vs O(N²)), falling back to memory-efficient or math attention automatically.

6 · Automatic bfloat16 on Ampere+

When the GPU supports it (compute capability ≥ 8.0), outofcuda switches the default AMP dtype from float16 → bfloat16, giving better numerical range without gradient underflow.

7 · OOM recovery hook

A sys.excepthook wrapper catches torch.cuda.OutOfMemoryError and immediately clears the CUDA cache + runs Python GC, giving your process a fighting chance before the kernel kills it.

Optional: use the Python API for more control

Even though you don't need to write code, outofcuda exposes a full API for power users.

AMP context manager

import outofcuda

with outofcuda.autocast_context():
    logits = model(input_ids)
    loss = criterion(logits, labels)

Automatically picks bfloat16 on Ampere+ and float16 on older cards.

Optimise a model

model = outofcuda.compile_model(model)

Converts to channels-last memory layout (better tensor-core utilization) and, if OUTOFCUDA_COMPILE=1 is set, wraps with torch.compile(mode="reduce-overhead").

Smart DataLoader

loader = outofcuda.smart_dataloader(
    dataset,
    batch_size=64,
    shuffle=True,
)

Pre-configured with pin_memory=True, num_workers=4, prefetch_factor=2, and persistent_workers=True — the settings most people forget to set.

Memory report

print(outofcuda.memory_report())
# {
#   "cuda:0": {
#     "name": "NVIDIA A100-SXM4-80GB",
#     "total_gb": 80.0,
#     "allocated_gb": 12.4,
#     "reserved_gb": 14.1,
#     "free_gb": 65.9,
#     "peak_allocated_gb": 18.7
#   }
# }

Clear CUDA cache

outofcuda.clear_cache()   # empty_cache() + gc.collect()

VRAM watchdog

from outofcuda import MemoryMonitor

with MemoryMonitor(threshold=0.85, interval=2.0):
    train(model, loader)
# auto-clears cache whenever VRAM > 85 %

All-in-one class

opt = outofcuda.CudaOptimizer()

model  = opt.prepare(model)          # channels-last + compile
loader = opt.dataloader(dataset)     # smart DataLoader
with opt.autocast():                 # AMP
    loss = model(x)
opt.clear()                          # cache flush
print(opt.report())                  # VRAM stats

Environment variables

Every knob is tunable via env vars — no code changes needed.

Variable	Default	Description
`OUTOFCUDA_DISABLE`	`0`	Set to `1` to fully disable
`OUTOFCUDA_VERBOSE`	`0`	Print optimization report on startup
`OUTOFCUDA_TF32`	`1`	Enable TF32 for matmul / cuDNN
`OUTOFCUDA_CUDNN_BENCHMARK`	`1`	Enable cuDNN autotuner
`OUTOFCUDA_CUDNN_DETERMINISTIC`	`0`	Force deterministic cuDNN
`OUTOFCUDA_CHANNELS_LAST`	`1`	Prefer channels-last layout
`OUTOFCUDA_COMPILE`	`0`	Use `torch.compile()` in `compile_model()`
`OUTOFCUDA_COMPILE_MODE`	`reduce-overhead`	`torch.compile` mode
`OUTOFCUDA_COMPILE_FULLGRAPH`	`0`	`fullgraph=True` for compile
`OUTOFCUDA_AMP_DTYPE`	`float16`	AMP dtype (`float16` / `bfloat16`)
`OUTOFCUDA_MEMORY_FRACTION`	`0.95`	Max VRAM fraction per device
`OUTOFCUDA_ALLOC_CONF`	(see above)	`PYTORCH_CUDA_ALLOC_CONF` string
`OUTOFCUDA_GC_COLLECT`	`1`	Run `gc.collect()` on cache clear
`OUTOFCUDA_FLASH_ATTN`	`1`	Enable Flash Attention SDPA backend
`OUTOFCUDA_MEM_EFF_ATTN`	`1`	Enable memory-efficient SDPA backend
`OUTOFCUDA_MATH_ATTN`	`1`	Enable math SDPA backend
`OUTOFCUDA_PIN_MEMORY`	`1`	`pin_memory` for smart DataLoader
`OUTOFCUDA_NUM_WORKERS`	`4`	`num_workers` for smart DataLoader
`OUTOFCUDA_PREFETCH_FACTOR`	`2`	`prefetch_factor` for smart DataLoader

Example — disable everything except the allocator fix:

OUTOFCUDA_TF32=0 \
OUTOFCUDA_CUDNN_BENCHMARK=0 \
OUTOFCUDA_CHANNELS_LAST=0 \
python train.py

Expected gains

Results vary by GPU generation, model architecture, and batch size. Here are typical observations on an A100-80 GB:

Optimization	Speedup / Saving
TF32 matmul	2–10× on dense layers
cuDNN benchmark	5–20 % on conv-heavy models
Flash Attention	2–4× less VRAM, 1.5–3× faster attention
channels-last	10–30 % faster on CNNs
AMP bfloat16	1.5–2× throughput, ~50 % less VRAM
Allocator tuning	Fewer OOM crashes, less fragmentation

Requirements

Python ≥ 3.9
PyTorch ≥ 2.0 (optional — outofcuda installs without torch)
CUDA ≥ 11.8 (optional — CPU-only is safe)

No other dependencies.

Design philosophy

The best optimization is the one that happens without you thinking about it.

outofcuda is deliberately a zero-footprint library:

No monkey-patching of torch internals
No global state mutation beyond documented CUDA / cuDNN flags
No background threads unless you explicitly start MemoryMonitor
Safe to import in tests, notebooks, scripts, and production services
Fully disableable with OUTOFCUDA_DISABLE=1

Contributing

git clone https://github.com/outofcuda/outofcuda
cd outofcuda
pip install -e ".[dev]"
pytest

Issues and PRs welcome.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

outofcuda-0.1.0.tar.gz (19.9 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

outofcuda-0.1.0-py3-none-any.whl (15.2 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file outofcuda-0.1.0.tar.gz.

File metadata

Download URL: outofcuda-0.1.0.tar.gz
Upload date: Apr 4, 2026
Size: 19.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for outofcuda-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`47f41fda9450336e3536fffb9c93c611b73752602cc4f31ca65c71ea82851613`
MD5	`31f06496fea94c7acb695ce009594366`
BLAKE2b-256	`60ef5a901e75e550f9980524e5a6034f8bd2c463dd5525bb9d0a29dfc319d311`

See more details on using hashes here.

File details

Details for the file outofcuda-0.1.0-py3-none-any.whl.

File metadata

Download URL: outofcuda-0.1.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 15.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for outofcuda-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf8e49f0a261e93190e6df411ba8c44260e6d62eeee4ff7fa9d6c1ed7a755c2d`
MD5	`dbaa21a8e42689439489f7cafbb23bc8`
BLAKE2b-256	`5e0127704503e1fe97df1e65f1006b941d9208fe9085f8cdd8d5ac974ffbe26d`

See more details on using hashes here.

outofcuda 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

outofcuda 🚀

The problem

Install

How it works

What gets optimized

1 · TF32 Matmul & cuDNN (Ampere+)

2 · cuDNN autotuner

3 · CUDA memory allocator tuning

4 · Per-device memory fraction

5 · Scaled Dot-Product Attention backends

6 · Automatic bfloat16 on Ampere+

7 · OOM recovery hook

Optional: use the Python API for more control

AMP context manager

Optimise a model

Smart DataLoader

Memory report

Clear CUDA cache

VRAM watchdog

All-in-one class

Environment variables

Expected gains

Requirements

Design philosophy

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes