Run AI models larger than your GPU. Auto-detects hardware and applies optimal memory strategy.
Project description
OverflowML
Run AI models larger than your GPU. One line of code.
OverflowML auto-detects your hardware (NVIDIA, Apple Silicon, AMD, CPU) and applies the optimal memory strategy to load and run models that don't fit in VRAM. No manual configuration needed.
import overflowml
pipe = load_your_model() # 40GB model, 24GB GPU? No problem.
overflowml.optimize_pipeline(pipe, model_size_gb=40)
result = pipe(prompt) # Just works.
The Problem
AI models are getting bigger. A single image generation model can be 40GB+. LLMs regularly hit 70GB-400GB. But most GPUs have 8-24GB of VRAM.
The current solutions are painful:
- Manual offloading — you need to know which PyTorch function to call, which flags work together, and which combinations crash
- Quantization footguns — FP8 is incompatible with CPU offload on Windows. Attention slicing crashes with sequential offload. INT4 needs specific libraries.
- Trial and error — every hardware/model/framework combo has different gotchas
OverflowML handles all of this automatically.
How It Works
Model: 40GB (BF16) Your GPU: 24GB VRAM
│ │
OverflowML detects mismatch │
│ │
┌────▼────────────────────────────▼────┐
│ Strategy: Sequential CPU Offload │
│ Move 1 layer (~1GB) to GPU at a │
│ time, compute, move back. │
│ Peak VRAM: ~3GB │
│ System RAM used: ~40GB │
│ Speed: 33s/image (RTX 5090) │
└──────────────────────────────────────┘
Strategy Decision Tree
| Model vs VRAM | Strategy | Peak VRAM | Speed |
|---|---|---|---|
| Model fits with 15% headroom | Direct GPU load | Full | Fastest |
| FP8 model fits | FP8 quantization | ~55% of model | Fast |
| Components fit individually | Model CPU offload | ~70% of model | Medium |
| Nothing fits | Sequential CPU offload | ~3GB | Slower but works |
| Not enough RAM either | INT4 quantization + sequential | ~3GB | Slowest |
Apple Silicon (Unified Memory)
On Macs, CPU and GPU share the same memory pool — there's nothing to "offload." OverflowML detects this and skips offloading entirely. If the model fits in ~75% of your RAM, it loads directly. If not, quantization is recommended.
| Mac | Unified Memory | Largest Model (4-bit) |
|---|---|---|
| M4 Max | 128GB | ~80B params |
| M3/M4 Ultra | 192GB | ~120B params |
| M3 Ultra | 512GB | 670B params |
Installation
pip install overflowml
# With diffusers support:
pip install overflowml[diffusers]
# With quantization:
pip install overflowml[all]
Usage
Diffusers Pipeline (Recommended)
import torch
import overflowml
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
)
# One line — auto-detects hardware, picks optimal strategy
strategy = overflowml.optimize_pipeline(pipe, model_size_gb=24)
print(strategy.summary())
result = pipe("a sunset over mountains", num_inference_steps=20)
Batch Generation with Memory Guard
from overflowml import MemoryGuard
guard = MemoryGuard(threshold=0.7) # cleanup at 70% VRAM usage
for prompt in prompts:
with guard: # auto-cleans VRAM between iterations
result = pipe(prompt)
result.images[0].save(f"output.png")
CLI — Hardware Detection
$ overflowml detect
=== OverflowML Hardware Detection ===
Accelerator: cuda
GPU: NVIDIA GeForce RTX 5090 (32GB VRAM)
System RAM: 194GB
Overflow capacity: 178GB (total effective: 210GB)
BF16: yes | FP8: yes
$ overflowml plan 40
=== Strategy for 40GB model ===
Offload: sequential_cpu
Dtype: bfloat16
GC cleanup: enabled (threshold 70%)
Estimated peak VRAM: 3.0GB
→ Sequential offload: 1 layer at a time (~3GB VRAM), model lives in 194GB RAM
WARNING: FP8 incompatible with CPU offload on Windows
WARNING: Do NOT enable attention_slicing with sequential offload
Standalone Model
import overflowml
model = load_my_transformer()
strategy = overflowml.optimize_model(model, model_size_gb=14)
Proven Results
Built and battle-tested on a real production pipeline:
| Metric | Before OverflowML | After |
|---|---|---|
| Time per step | 530s (VRAM thrashing) | 6.7s |
| Images generated | 0/30 (crashes) | 30/30 |
| Total time | Impossible | 16.4 minutes |
| Peak VRAM | 32GB (thrashing) | 3GB |
| Reliability | Crashes after 3 images | Zero failures |
40GB model on RTX 5090 (32GB VRAM) + 194GB RAM, sequential offload, Lightning LoRA 4-step
Known Incompatibilities
These are automatically handled by OverflowML's strategy engine:
| Combination | Issue | OverflowML Fix |
|---|---|---|
| FP8 + CPU offload (Windows) | Float8Tensor can't move between devices |
Skips FP8, uses BF16 |
attention_slicing + sequential offload |
CUDA illegal memory access | Never enables both |
enable_model_cpu_offload + 40GB transformer |
Transformer exceeds VRAM | Uses sequential offload instead |
expandable_segments on Windows WDDM |
Not supported | Gracefully ignored |
Architecture
overflowml/
├── detect.py — Hardware detection (CUDA, MPS, MLX, ROCm, CPU)
├── strategy.py — Strategy engine (picks optimal offload + quantization)
├── optimize.py — Applies strategy to pipelines and models
└── cli.py — Command-line interface
Cross-Platform Support
| Platform | Accelerator | Status |
|---|---|---|
| Windows + NVIDIA | CUDA | Production-ready |
| Linux + NVIDIA | CUDA | Production-ready |
| macOS + Apple Silicon | MPS / MLX | Detection ready, optimization in progress |
| Linux + AMD | ROCm | Planned |
| CPU-only | CPU | Fallback always works |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file overflowml-0.2.1.tar.gz.
File metadata
- Download URL: overflowml-0.2.1.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b43347f68e83aa1ee69cd8a2818102cb70897b8c597d8665c23e1892241d233
|
|
| MD5 |
13d122ecadefeba7b56b5aac2d28b054
|
|
| BLAKE2b-256 |
046634277199d133a160bf92ff4ba572f22bbb55bcaf6c4ffa67b50356d93ff9
|
File details
Details for the file overflowml-0.2.1-py3-none-any.whl.
File metadata
- Download URL: overflowml-0.2.1-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4b734832535009a4fb2221500e2ba015002efb1c96f5c7bf84f986767e9aa9b
|
|
| MD5 |
4ba47fd5c2fbd20da14cf773e78e8c58
|
|
| BLAKE2b-256 |
6956f9a980b08116fa2efac14e6ecc1cfcc06ffc27f6ca3e092fe872256faca1
|