Run AI models larger than your GPU. Auto-detects hardware and applies optimal memory strategy.

These details have not been verified by PyPI

Project links

Homepage

Project description

OverflowML

Run AI models larger than your GPU. One line of code.

OverflowML auto-detects your hardware (NVIDIA, Apple Silicon, AMD, CPU) and applies the optimal memory strategy to load and run models that don't fit in VRAM. No manual configuration needed.

import overflowml

pipe = load_your_model()  # 40GB model, 24GB GPU? No problem.
overflowml.optimize_pipeline(pipe, model_size_gb=40)
result = pipe(prompt)     # Just works.

The Problem

AI models are getting bigger. A single image generation model can be 40GB+. LLMs regularly hit 70GB-400GB. But most GPUs have 8-24GB of VRAM.

The current solutions are painful:

Manual offloading — you need to know which PyTorch function to call, which flags work together, and which combinations crash
Quantization footguns — FP8 is incompatible with CPU offload on Windows. Attention slicing crashes with sequential offload. INT4 needs specific libraries.
Trial and error — every hardware/model/framework combo has different gotchas

OverflowML handles all of this automatically.

How It Works

Model: 40GB (BF16)          Your GPU: 24GB VRAM
         │                           │
    OverflowML detects mismatch      │
         │                           │
    ┌────▼────────────────────────────▼────┐
    │  Strategy: Sequential CPU Offload    │
    │  Move 1 layer (~1GB) to GPU at a    │
    │  time, compute, move back.          │
    │  Peak VRAM: ~3GB                     │
    │  System RAM used: ~40GB              │
    │  Speed: 33s/image (RTX 5090)        │
    └──────────────────────────────────────┘

Strategy Decision Tree

Model vs VRAM	Strategy	Peak VRAM	Speed
Model fits with 15% headroom	Direct GPU load	Full	Fastest
FP8 model fits	FP8 quantization	~55% of model	Fast
Components fit individually	Model CPU offload	~70% of model	Medium
Nothing fits	Sequential CPU offload	~3GB	Slower but works
Not enough RAM either	INT4 quantization + sequential	~3GB	Slowest

Apple Silicon (Unified Memory)

On Macs, CPU and GPU share the same memory pool — there's nothing to "offload." OverflowML detects this and skips offloading entirely. If the model fits in ~75% of your RAM, it loads directly. If not, quantization is recommended.

Mac	Unified Memory	Largest Model (4-bit)
M4 Max	128GB	~80B params
M3/M4 Ultra	192GB	~120B params
M3 Ultra	512GB	670B params

Installation

pip install overflowml

# With diffusers support:
pip install overflowml[diffusers]

# With quantization:
pip install overflowml[all]

Usage

Diffusers Pipeline (Recommended)

import torch
import overflowml
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
)

# One line — auto-detects hardware, picks optimal strategy
strategy = overflowml.optimize_pipeline(pipe, model_size_gb=24)
print(strategy.summary())

result = pipe("a sunset over mountains", num_inference_steps=20)

Batch Generation with Memory Guard

from overflowml import MemoryGuard

guard = MemoryGuard(threshold=0.7)  # cleanup at 70% VRAM usage

for prompt in prompts:
    with guard:  # auto-cleans VRAM between iterations
        result = pipe(prompt)
        result.images[0].save(f"output.png")

CLI — Hardware Detection

$ overflowml detect

=== OverflowML Hardware Detection ===
Accelerator: cuda
GPU: NVIDIA GeForce RTX 5090 (32GB VRAM)
System RAM: 194GB
Overflow capacity: 178GB (total effective: 210GB)
BF16: yes | FP8: yes

$ overflowml plan 40

=== Strategy for 40GB model ===
Offload: sequential_cpu
Dtype: bfloat16
GC cleanup: enabled (threshold 70%)
Estimated peak VRAM: 3.0GB
  → Sequential offload: 1 layer at a time (~3GB VRAM), model lives in 194GB RAM
WARNING: FP8 incompatible with CPU offload on Windows
WARNING: Do NOT enable attention_slicing with sequential offload

Standalone Model

import overflowml

model = load_my_transformer()
strategy = overflowml.optimize_model(model, model_size_gb=14)

Proven Results

Built and battle-tested on a real production pipeline:

Metric	Before OverflowML	After
Time per step	530s (VRAM thrashing)	6.7s
Images generated	0/30 (crashes)	30/30
Total time	Impossible	16.4 minutes
Peak VRAM	32GB (thrashing)	3GB
Reliability	Crashes after 3 images	Zero failures

40GB model on RTX 5090 (32GB VRAM) + 194GB RAM, sequential offload, Lightning LoRA 4-step

Known Incompatibilities

These are automatically handled by OverflowML's strategy engine:

Combination	Issue	OverflowML Fix
FP8 + CPU offload (Windows)	`Float8Tensor` can't move between devices	Skips FP8, uses BF16
`attention_slicing` + sequential offload	CUDA illegal memory access	Never enables both
`enable_model_cpu_offload` + 40GB transformer	Transformer exceeds VRAM	Uses sequential offload instead
`expandable_segments` on Windows WDDM	Not supported	Gracefully ignored

Architecture

overflowml/
├── detect.py      — Hardware detection (CUDA, MPS, MLX, ROCm, CPU)
├── strategy.py    — Strategy engine (picks optimal offload + quantization)
├── optimize.py    — Applies strategy to pipelines and models
└── cli.py         — Command-line interface

Cross-Platform Support

Platform	Accelerator	Status
Windows + NVIDIA	CUDA	Production-ready
Linux + NVIDIA	CUDA	Production-ready
macOS + Apple Silicon	MPS / MLX	Detection ready, optimization in progress
Linux + AMD	ROCm	Planned
CPU-only	CPU	Fallback always works

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12.0

Mar 23, 2026

0.11.0

Mar 23, 2026

0.10.0

Mar 23, 2026

0.9.0

Mar 23, 2026

0.8.0

Mar 23, 2026

0.7.0

Mar 23, 2026

0.6.1

Mar 23, 2026

0.6.0

Mar 23, 2026

0.5.1

Mar 23, 2026

0.5.0

Mar 23, 2026

0.4.2

Mar 23, 2026

0.4.1

Mar 23, 2026

0.4.0

Mar 11, 2026

0.3.0

Mar 10, 2026

0.2.1

Mar 10, 2026

This version

0.2.0

Mar 10, 2026

0.1.0

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

overflowml-0.2.0.tar.gz (17.3 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

overflowml-0.2.0-py3-none-any.whl (17.3 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file overflowml-0.2.0.tar.gz.

File metadata

Download URL: overflowml-0.2.0.tar.gz
Upload date: Mar 10, 2026
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for overflowml-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ad2b8c528e38e7a51f1d52ceb12e0c88c84893f3c714e13d33d196257b629eec`
MD5	`cbb41a7c98858dacb23399da576d33bc`
BLAKE2b-256	`8b3e582831314075e72cb237f3d2069ce4d63ae8994ee49d5f81b135a8aa9cff`

See more details on using hashes here.

File details

Details for the file overflowml-0.2.0-py3-none-any.whl.

File metadata

Download URL: overflowml-0.2.0-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 17.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for overflowml-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`697513e4741622b0b16eda8a8f19e9cd3af0e094c3ae5b076ead5e002c8a4541`
MD5	`fb2bf6d658078d579653977049a1f06e`
BLAKE2b-256	`7f38bb55d9e34e42ca593c6719e38c2afd4a7cdd55debf26c769c6747fe40c20`

See more details on using hashes here.

overflowml 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OverflowML

The Problem

How It Works

Strategy Decision Tree

Apple Silicon (Unified Memory)

Installation

Usage

Diffusers Pipeline (Recommended)

Batch Generation with Memory Guard

CLI — Hardware Detection

Standalone Model

Proven Results

Known Incompatibilities

Architecture

Cross-Platform Support

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes