Skip to main content

Run AI models larger than your GPU. Auto-detects hardware and applies optimal memory strategy.

Project description

OverflowML

Tests PyPI Python License: MIT

Run AI models larger than your GPU. One line of code.

OverflowML auto-detects your hardware (NVIDIA, Apple Silicon, AMD, CPU) and applies the optimal memory strategy to load and run models that don't fit in VRAM. No manual configuration needed.

import overflowml

pipe = load_your_model()  # 40GB model, 24GB GPU? No problem.
overflowml.optimize_pipeline(pipe, model_size_gb=40)
result = pipe(prompt)     # Just works.

The Problem

AI models are getting bigger. A single image generation model can be 40GB+. LLMs regularly hit 70GB-400GB. But most GPUs have 8-24GB of VRAM.

The current solutions are painful:

  • Manual offloading — you need to know which PyTorch function to call, which flags work together, and which combinations crash
  • Quantization footguns — FP8 is incompatible with CPU offload on Windows. Attention slicing crashes with sequential offload. INT4 needs specific libraries.
  • Trial and error — every hardware/model/framework combo has different gotchas

OverflowML handles all of this automatically.

How It Works

Model: 40GB (BF16)          Your GPU: 24GB VRAM
         │                           │
    OverflowML detects mismatch      │
         │                           │
    ┌────▼────────────────────────────▼────┐
    │  Strategy: Sequential CPU Offload    │
    │  Move 1 layer (~1GB) to GPU at a    │
    │  time, compute, move back.          │
    │  Peak VRAM: ~3GB                     │
    │  System RAM used: ~40GB              │
    │  Speed: 33s/image (RTX 5090)        │
    └──────────────────────────────────────┘

Strategy Decision Tree

Model vs VRAM Strategy Peak VRAM Speed
Model fits with 15% headroom Direct GPU load Full Fastest
FP8 model fits FP8 quantization ~55% of model Fast
Components fit individually Model CPU offload ~70% of model Medium
Nothing fits Sequential CPU offload ~3GB Slower but works
Not enough RAM either INT4 quantization + sequential ~3GB Slowest

Apple Silicon (Unified Memory)

On Macs, CPU and GPU share the same memory pool — there's nothing to "offload." OverflowML detects this and skips offloading entirely. If the model fits in ~75% of your RAM, it loads directly. If not, quantization is recommended.

Mac Unified Memory Largest Model (4-bit)
M4 Max 128GB ~80B params
M3/M4 Ultra 192GB ~120B params
M3 Ultra 512GB 670B params

Installation

pip install overflowml

# With diffusers support:
pip install overflowml[diffusers]

# With quantization:
pip install overflowml[all]

Verify installation:

python -c "import overflowml; print(overflowml.__version__)"
overflowml detect

Usage

Diffusers Pipeline (Recommended)

import torch
import overflowml
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
)

# One line — auto-detects hardware, picks optimal strategy
strategy = overflowml.optimize_pipeline(pipe, model_size_gb=24)
print(strategy.summary())

result = pipe("a sunset over mountains", num_inference_steps=20)

Batch Generation with Memory Guard

from overflowml import MemoryGuard

guard = MemoryGuard(threshold=0.7)  # cleanup at 70% VRAM usage

for prompt in prompts:
    with guard:  # auto-cleans VRAM between iterations
        result = pipe(prompt)
        result.images[0].save(f"output.png")

CLI — Hardware Detection & Planning

$ overflowml detect
=== OverflowML Hardware Detection ===
Accelerator: cuda
GPU: NVIDIA GeForce RTX 5090 (32GB VRAM)
System RAM: 194GB
BF16: yes | FP8: yes

$ overflowml plan 40 --compare
Hardware: NVIDIA GeForce RTX 5090 (32GB VRAM), 194GB RAM

=== Viable Strategies ===
#  Speed     Strategy              Est VRAM  Quality Risk
1  fastest   fp16 FP8              25.3GB    minimal       <- recommended
2  medium    fp16 model cpu        28.0GB    none
3  slow      fp16 sequential cpu   3.0GB     none

Rejected:
  fp16 direct load: exceeds VRAM (46.0GB > 32GB)

=== Reasoning ===
  fp16 weight footprint: 40.0 GB
  Detected GPU: NVIDIA GeForce RTX 5090 (32GB VRAM)
  FP8 reduces model to ~22GB  fits in VRAM
  Known traps handled:
    - FP8 + CPU offload crashes on Windows  FP8 only used without offload
    - attention_slicing disabled with sequential offload
    - expandable_segments disabled (Windows WDDM)

Environment Diagnostics

$ overflowml doctor
=== OverflowML Doctor ===
Environment
  Python: 3.11.8
  Torch: 2.6.0+cu124
Hardware
  GPU: NVIDIA RTX 5090 (32GB)
  System RAM: 196GB
Checks
  [PASS] PyTorch 2.6.0 (CUDA 12.4)
  [PASS] GPU: NVIDIA RTX 5090 (32GB)
  [WARN] torchao not installed
         Fix: pip install overflowml[quantize]

Standalone Model

import overflowml

model = load_my_transformer()
strategy = overflowml.optimize_model(model, model_size_gb=14)

Proven Results

Built and battle-tested on a real production pipeline:

Metric Before OverflowML After
Time per step 530s (VRAM thrashing) 6.7s
Images generated 0/30 (crashes) 30/30
Total time Impossible 16.4 minutes
Peak VRAM 32GB (thrashing) 3GB
Reliability Crashes after 3 images Zero failures

40GB model on RTX 5090 (32GB VRAM) + 194GB RAM, sequential offload, Lightning LoRA 4-step

MoE Expert Offload — Real Benchmarks

OverflowML's MoE strategy enables running 120B+ parameter models on consumer GPUs by keeping shared layers on GPU and swapping experts from RAM:

Model Total Params Active VRAM Used RAM Used Tokens/s Strategy
Nemotron 3 Super 120B 12B 29GB (32% GPU) 63GB (68% CPU) 5.9 t/s Expert offload Q4
Nemotron 3 Nano 30B 3.6B 24GB (100% GPU) 0GB 228 t/s Full GPU

RTX 5090 (32GB VRAM) + 196GB RAM, Ollama, Q4_K_M quantization

Optimization Sweep — Finding the Optimal GPU/CPU Split

We tested 12 configurations to find the best strategy for Nemotron 3 Super on a single RTX 5090:

Config Context GPU Layers CPU/GPU Split Tokens/s
Default (auto) 32K auto 68%/32% 5.9
Reduced context 8K auto 68%/32% 5.6
Minimal context 2K auto 68%/32% 5.8
Fewer GPU layers 8K 20 77%/23% 5.3
More GPU layers 8K 40 54%/46% 1.3
All GPU (forced) 8K 99 0%/100% 0.9
32 threads + batch 1024 8K auto 68%/32% 3.0

Key findings:

  • Ollama's auto-detected 68%/32% CPU/GPU split is optimal
  • Forcing more onto GPU hurts — 100% GPU = 6.5x slower (VRAM thrashing on 86GB model in 32GB)
  • Context size barely matters — bottleneck is PCIe expert swapping, not KV cache
  • Model can't fit in 32GB at any quantization (minimum GGUF is 52.7GB due to unquantizable Mamba2 SSM state)
  • The hard limit is PCIe 5.0 bandwidth (~50 GB/s practical) for expert transfers
$ overflowml plan 120 --moe 120 12 128 8

=== Strategy for 120GB model ===
MoE: 120B total, 12B active, 128 experts (8 active)
Quantization: int4
Offload: expert_offload
Estimated peak VRAM: 14.8GB
  - MoE expert offload + INT4: 36GB total in RAM, 15GB active on GPU

Known Incompatibilities

These are automatically handled by OverflowML's strategy engine:

Combination Issue OverflowML Fix
FP8 + CPU offload (Windows) Float8Tensor can't move between devices Skips FP8, uses BF16
attention_slicing + sequential offload CUDA illegal memory access Never enables both
enable_model_cpu_offload + 40GB transformer Transformer exceeds VRAM Uses sequential offload instead
expandable_segments on Windows WDDM Not supported Gracefully ignored

Architecture

overflowml/
├── detect.py      — Hardware detection (CUDA, MPS, MLX, ROCm, CPU)
├── strategy.py    — Strategy engine (picks optimal offload + quantization)
├── optimize.py    — Applies strategy to pipelines and models
└── cli.py         — Command-line interface

Cross-Platform Support

Platform Accelerator Status
Windows + NVIDIA CUDA Production-ready
Linux + NVIDIA CUDA Production-ready
macOS + Apple Silicon MPS / MLX Detection ready, optimization in progress
Linux + AMD ROCm Planned
CPU-only CPU Fallback always works

Troubleshooting

Hardware not detected / wrong accelerator Run overflowml detect — if it shows CPU when you have a GPU, install the correct PyTorch build for your platform: pytorch.org/get-started.

FP8 errors on Windows Expected. FP8 is incompatible with CPU offload on Windows — OverflowML automatically falls back to BF16. No action needed.

OOM despite using sequential offload Your system RAM may be insufficient. Try overflowml plan <size> --no-quantize to see minimum RAM requirements. INT4 quantization (pip install overflowml[all]) reduces the RAM footprint by ~4x.

torchao import error with FP8 Install: pip install torchao>=0.5 or use pip install overflowml[quantize].

accelerate not found with enable_model_cpu_offload Install: pip install accelerate>=0.30 or use pip install overflowml[diffusers].

torch.compile skipped (Triton not installed) Not an error — OverflowML skips compilation silently. On Windows, Triton requires WSL2. On Linux, install with pip install triton.

Slow generation with sequential offload Sequential offload is intentionally slow (layers move one-at-a-time). Check overflowml plan <size> — if your RAM is large enough, it may suggest a faster strategy like model_cpu_offload.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

overflowml-0.6.0.tar.gz (52.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

overflowml-0.6.0-py3-none-any.whl (42.0 kB view details)

Uploaded Python 3

File details

Details for the file overflowml-0.6.0.tar.gz.

File metadata

  • Download URL: overflowml-0.6.0.tar.gz
  • Upload date:
  • Size: 52.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for overflowml-0.6.0.tar.gz
Algorithm Hash digest
SHA256 4f78d3b93afe71b1baf52598aef2949ce6095887129f00bd882d6fbba10723ed
MD5 187d92be7834f9f9636b5fc40e35e2e1
BLAKE2b-256 2f0679fff4bcf65c0ec24a9ff53a9dc0de881a4df0afb3e1bf37c62786511906

See more details on using hashes here.

File details

Details for the file overflowml-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: overflowml-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 42.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for overflowml-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab0aae914fa31ee05f4defbc4ce4e0fd9766c8b2f0aa9d76a74f34ba8b9037b4
MD5 affd609a8a7d434ab157a00038396ab5
BLAKE2b-256 8859cab02ab2b655626afe4116e6e015cac940d138c395cf2f6dd79ad05df9e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page