Zero-code CUDA memory & compute optimizer — works the moment you pip install it
Project description
outofcuda 🚀
Zero-code CUDA memory & compute optimizer for PyTorch.
Install once. It works. No imports, no function calls, no config needed.
The problem
You're training a model, everything's going great, then:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.50 GiB.
Or your GPU is sitting at 40 % utilization because PyTorch's defaults were written for correctness, not speed.
outofcuda fixes both — automatically, the moment you install it.
Install
pip install outofcuda
That's it. The library installs a Python site-hook (.pth file) that fires
before your first import torch, applying every optimization listed below.
You don't write a single line of code.
How it works
Python processes every .pth file in site-packages at interpreter startup.
outofcuda installs outofcuda_hook.pth, which imports a tiny bootstrap
module. The bootstrap attaches a meta-path watcher that fires once, right
after torch is imported, and applies the full optimization suite:
Python starts
└─ site processes outofcuda_hook.pth
└─ _hook.py registers a meta-path watcher
└─ torch is imported (by your code or a library)
└─ outofcuda.apply() runs automatically ✓
Everything is lazy — if torch is never imported, outofcuda does nothing. If CUDA is unavailable, it silently skips all GPU-specific steps.
What gets optimized
1 · TF32 Matmul & cuDNN (Ampere+)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
On RTX 30xx / A100 / H100, TF32 delivers up to 10× faster matrix multiplications with negligible precision loss for most models.
2 · cuDNN autotuner
torch.backends.cudnn.benchmark = True
cuDNN runs a short benchmark on the first batch to select the fastest convolution algorithm for your input shapes — a one-time cost that pays off across every subsequent forward pass.
3 · CUDA memory allocator tuning
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,
garbage_collection_threshold:0.8,
expandable_segments:True
max_split_size_mb:128— prevents the caching allocator from fragmenting large blocks, reducing OOM errors caused by fragmentation.garbage_collection_threshold:0.8— triggers internal GC when 80 % of reserved memory is in-use, proactively recovering dead tensors.expandable_segments:True— allows the allocator to grow segments on demand instead of reserving large chunks upfront.
4 · Per-device memory fraction
torch.cuda.set_per_process_memory_fraction(0.95, device=i)
Prevents a single process from reserving 100 % of VRAM, leaving headroom for the driver, NCCL buffers, and peer processes.
5 · Scaled Dot-Product Attention backends
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
Enables all three SDPA backends — PyTorch picks Flash Attention when possible (O(N) memory vs O(N²)), falling back to memory-efficient or math attention automatically.
6 · Automatic bfloat16 on Ampere+
When the GPU supports it (compute capability ≥ 8.0), outofcuda switches
the default AMP dtype from float16 → bfloat16, giving better numerical
range without gradient underflow.
7 · OOM recovery hook
A sys.excepthook wrapper catches torch.cuda.OutOfMemoryError and
immediately clears the CUDA cache + runs Python GC, giving your process a
fighting chance before the kernel kills it.
Optional: use the Python API for more control
Even though you don't need to write code, outofcuda exposes a full API for power users.
AMP context manager
import outofcuda
with outofcuda.autocast_context():
logits = model(input_ids)
loss = criterion(logits, labels)
Automatically picks bfloat16 on Ampere+ and float16 on older cards.
Optimise a model
model = outofcuda.compile_model(model)
Converts to channels-last memory layout (better tensor-core utilization) and,
if OUTOFCUDA_COMPILE=1 is set, wraps with torch.compile(mode="reduce-overhead").
Smart DataLoader
loader = outofcuda.smart_dataloader(
dataset,
batch_size=64,
shuffle=True,
)
Pre-configured with pin_memory=True, num_workers=4, prefetch_factor=2,
and persistent_workers=True — the settings most people forget to set.
Memory report
print(outofcuda.memory_report())
# {
# "cuda:0": {
# "name": "NVIDIA A100-SXM4-80GB",
# "total_gb": 80.0,
# "allocated_gb": 12.4,
# "reserved_gb": 14.1,
# "free_gb": 65.9,
# "peak_allocated_gb": 18.7
# }
# }
Clear CUDA cache
outofcuda.clear_cache() # empty_cache() + gc.collect()
VRAM watchdog
from outofcuda import MemoryMonitor
with MemoryMonitor(threshold=0.85, interval=2.0):
train(model, loader)
# auto-clears cache whenever VRAM > 85 %
All-in-one class
opt = outofcuda.CudaOptimizer()
model = opt.prepare(model) # channels-last + compile
loader = opt.dataloader(dataset) # smart DataLoader
with opt.autocast(): # AMP
loss = model(x)
opt.clear() # cache flush
print(opt.report()) # VRAM stats
Environment variables
Every knob is tunable via env vars — no code changes needed.
| Variable | Default | Description |
|---|---|---|
OUTOFCUDA_DISABLE |
0 |
Set to 1 to fully disable |
OUTOFCUDA_VERBOSE |
0 |
Print optimization report on startup |
OUTOFCUDA_TF32 |
1 |
Enable TF32 for matmul / cuDNN |
OUTOFCUDA_CUDNN_BENCHMARK |
1 |
Enable cuDNN autotuner |
OUTOFCUDA_CUDNN_DETERMINISTIC |
0 |
Force deterministic cuDNN |
OUTOFCUDA_CHANNELS_LAST |
1 |
Prefer channels-last layout |
OUTOFCUDA_COMPILE |
0 |
Use torch.compile() in compile_model() |
OUTOFCUDA_COMPILE_MODE |
reduce-overhead |
torch.compile mode |
OUTOFCUDA_COMPILE_FULLGRAPH |
0 |
fullgraph=True for compile |
OUTOFCUDA_AMP_DTYPE |
float16 |
AMP dtype (float16 / bfloat16) |
OUTOFCUDA_MEMORY_FRACTION |
0.95 |
Max VRAM fraction per device |
OUTOFCUDA_ALLOC_CONF |
(see above) | PYTORCH_CUDA_ALLOC_CONF string |
OUTOFCUDA_GC_COLLECT |
1 |
Run gc.collect() on cache clear |
OUTOFCUDA_FLASH_ATTN |
1 |
Enable Flash Attention SDPA backend |
OUTOFCUDA_MEM_EFF_ATTN |
1 |
Enable memory-efficient SDPA backend |
OUTOFCUDA_MATH_ATTN |
1 |
Enable math SDPA backend |
OUTOFCUDA_PIN_MEMORY |
1 |
pin_memory for smart DataLoader |
OUTOFCUDA_NUM_WORKERS |
4 |
num_workers for smart DataLoader |
OUTOFCUDA_PREFETCH_FACTOR |
2 |
prefetch_factor for smart DataLoader |
Example — disable everything except the allocator fix:
OUTOFCUDA_TF32=0 \
OUTOFCUDA_CUDNN_BENCHMARK=0 \
OUTOFCUDA_CHANNELS_LAST=0 \
python train.py
Expected gains
Results vary by GPU generation, model architecture, and batch size. Here are typical observations on an A100-80 GB:
| Optimization | Speedup / Saving |
|---|---|
| TF32 matmul | 2–10× on dense layers |
| cuDNN benchmark | 5–20 % on conv-heavy models |
| Flash Attention | 2–4× less VRAM, 1.5–3× faster attention |
| channels-last | 10–30 % faster on CNNs |
| AMP bfloat16 | 1.5–2× throughput, ~50 % less VRAM |
| Allocator tuning | Fewer OOM crashes, less fragmentation |
Requirements
- Python ≥ 3.9
- PyTorch ≥ 2.0 (optional — outofcuda installs without torch)
- CUDA ≥ 11.8 (optional — CPU-only is safe)
No other dependencies.
Design philosophy
The best optimization is the one that happens without you thinking about it.
outofcuda is deliberately a zero-footprint library:
- No monkey-patching of
torchinternals - No global state mutation beyond documented CUDA / cuDNN flags
- No background threads unless you explicitly start
MemoryMonitor - Safe to import in tests, notebooks, scripts, and production services
- Fully disableable with
OUTOFCUDA_DISABLE=1
Contributing
git clone https://github.com/outofcuda/outofcuda
cd outofcuda
pip install -e ".[dev]"
pytest
Issues and PRs welcome.
License
MIT © outofcuda contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file outofcuda-0.1.0.tar.gz.
File metadata
- Download URL: outofcuda-0.1.0.tar.gz
- Upload date:
- Size: 19.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47f41fda9450336e3536fffb9c93c611b73752602cc4f31ca65c71ea82851613
|
|
| MD5 |
31f06496fea94c7acb695ce009594366
|
|
| BLAKE2b-256 |
60ef5a901e75e550f9980524e5a6034f8bd2c463dd5525bb9d0a29dfc319d311
|
File details
Details for the file outofcuda-0.1.0-py3-none-any.whl.
File metadata
- Download URL: outofcuda-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf8e49f0a261e93190e6df411ba8c44260e6d62eeee4ff7fa9d6c1ed7a755c2d
|
|
| MD5 |
dbaa21a8e42689439489f7cafbb23bc8
|
|
| BLAKE2b-256 |
5e0127704503e1fe97df1e65f1006b941d9208fe9085f8cdd8d5ac974ffbe26d
|