Agentic profiler & tuner for ML training workloads — finds the DataLoader config your GPU was waiting for.
Project description
loadtune 🚀
An agentic profiler and tuner for ML workloads.
loadtune is a deterministic, hardware-aware AI agent that autonomously runs micro-experiments to find the optimal system configuration for your ML pipeline, ensuring your GPUs are never sitting idle.
The Problem: GPU Starvation
GPUs are incredibly fast, but they often sit idle waiting for the CPU to decode and augment data.
Tuning PyTorch DataLoaders (num_workers, pin_memory, prefetch_factor, thread constraints) is tedious, undocumented, and hardware-dependent. Guessing these configurations often leads to silent GPU starvation, where expensive instances (like A100s) waste 80% of their time waiting for data.
loadtune replaces guesswork with empirical measurement. It profiles your code, detects the exact bottleneck (Input-Bound vs Compute-Bound), and tunes the hardware mechanics to maximize samples per second.
System Parameters vs. Hyperparameters
Tools like W&B Sweeps perform Hyperparameter Optimization (HPO) to maximize model accuracy over days of training.
loadtune solves the other half: System Parameter Optimization. It sweeps hardware mechanics (num_workers, pin_memory) to maximize throughput in 2 minutes.
The Workflow: Run loadtune once to lock in the fastest data pipeline, then plug that optimized config into your overnight W&B sweep so every run finishes 5x faster.
Currently Supported Hardware
loadtune is hardware-aware and adjusts its heuristics based on your accelerator:
- NVIDIA GPUs (CUDA): Full support. Automatically tracks GPU memory utilization (
--auto-batch), tests asynchronous transfers (pin_memory,non_blocking), and handles CUDA OOMs. - Apple Silicon (MPS): Full pipeline tuning.
loadtunerecognizes the unified memory architecture (e.g., skipspin_memoryas it's a no-op on Mac) and accurately synchronizes the MPS stream for honest compute timings. - CPUs: Full pipeline tuning. Automatically limits
torch.set_num_threadsto prevent contention between DataLoader workers and the main process.
Real-World Results
loadtune autonomously found these speedups in under 2 minutes of tuning:
- NVIDIA A100 (Food101 Vision): 90% data-wait (Input-bound). Scaled workers and pinned memory. 5.65x speedup (147 → 830 samples/s).
- Colab T4 (Lightning CNN): 95.7% data-wait (Highly input-bound). Handled framework overhead perfectly. 4.25x speedup (1,477 → 6,279 samples/s).
- Colab T4 (HuggingFace DistilBERT): 5.9% data-wait (Compute-bound). Recognized the edge case and applied a mild nudge (
workers=2,non_blocking) for a free 1.06x speedup without wasting time testing massive worker counts. - Apple M2 Pro (Synthetic Vision): Unified memory constraints. 2.11x speedup (199 → 421 samples/s).
Getting Started
1. Setup
pip install loadtune
Optional dependencies for framework integrations:
pip install "loadtune[lightning]" # PyTorch Lightning
pip install "loadtune[nlp]" # HuggingFace Transformers
pip install "loadtune[all]" # Everything
2. How to Run (Three Scenarios)
Scenario A: PyTorch Lightning & HuggingFace (Zero Boilerplate)
If you use a high-level framework, loadtune extracts the components for you.
PyTorch Lightning:
# examples/my_lightning.py
from loadtune import from_lightning
# ... define module and datamodule ...
def get_workload():
return from_lightning(my_lightning_module, datamodule=my_datamodule, batch_size=64)
HuggingFace Transformers:
# examples/my_hf.py
from loadtune import from_hf_trainer
# ... define model and dataset ...
def get_workload():
return from_hf_trainer(model, dataset, tokenizer=tokenizer, batch_size=32)
Run via CLI using fast-mode (in-process trials to avoid framework import overhead):
loadtune tune examples/my_lightning.py --fast
Scenario B: Native PyTorch (Workload API)
If writing custom PyTorch loops, define a Workload dataclass that tells loadtune how to build your dataset, model, and execute a single training step. See examples/synthetic_bottleneck.py for a full example.
Scenario C: The Python API (Notebooks & CI)
You can profile and tune directly from Python scripts without using the CLI:
from loadtune import tune
from loadtune.workload import load_workload
# Load and tune a workload autonomously
workload = load_workload("examples/my_workload.py")
result = tune(workload, steps=50, max_trials=6, auto_batch=True)
print(f"Best Config: {result.best.knobs.label()} — {result.speedup:.2f}x baseline")
print(result.diagnosis)
Advanced Features
- Auto-Batching (
--auto-batch): If you are compute-bound but your GPU memory utilization is low,loadtuneautonomously proposes batch-size doubling until you hit ~80% VRAM utilization. Catches OOMs gracefully. - Auto-Apply (
--apply): Generates aloadtune_apply.pycode snippet containing the best configuration found so you can easily import it into your project. - Fast Mode (
--fast): Runs trials in-process instead of spawning fresh subprocesses. Drastically reduces trial startup overhead for massive models. - Loss Parity Check: Dynamically verifies that semantics-changing configurations (like precision or batch size) don't break mathematical convergence.
Next Steps: The Vision
loadtune is evolving into the definitive Agentic SRE (Site Reliability Engineer) for Machine Learning, split into two core disciplines:
1. loadtune train (Currently Complete)
Goal: Optimizing GPU utilization during R&D and model training.
- ✅ Phase 1: Input-pipeline tuning & worker sweeping.
- ✅ Phase 2: Cloud GPU & asynchronous memory evaluation.
- ✅ Phase 3: GPU memory profiling & automatic batch-scaling.
- ✅ Phase 4: Framework adapters (Lightning, HuggingFace).
2. loadtune serve (Upcoming Phase 5)
Goal: Optimizing server costs in production inference workloads.
- The Challenge: Maximize throughput (Requests/sec) without violating strict latency SLAs (e.g., p99 latency < 100ms).
- The Strategy: Agentic tuning of inference engines (vLLM, Triton, TorchServe).
- The Knobs: Autonomously tuning dynamic batching windows, KV-cache block sizes, quantization precision, and maximum concurrency limits based on synthetic HTTP traffic profiling.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file loadtune-0.3.0.tar.gz.
File metadata
- Download URL: loadtune-0.3.0.tar.gz
- Upload date:
- Size: 894.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44ba4883685bd7020b998aa9ace839b62eccc428a5b06de30f0666ef2c7b08b7
|
|
| MD5 |
6f5b6572c9b48885a3cead85a52044d9
|
|
| BLAKE2b-256 |
1758dd462fcdeef1a4c34deff4edef8b99a580a1bf802e3c20c77b74915cd196
|
File details
Details for the file loadtune-0.3.0-py3-none-any.whl.
File metadata
- Download URL: loadtune-0.3.0-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0004d93f0b03a69e6ef267f50edafcc061649fdd1b76690a0e9593c27061d00
|
|
| MD5 |
177113630b892b3562cc81f01b0cc44c
|
|
| BLAKE2b-256 |
960eb89c5accf9879ca7a6eebc6f35bd49731619f6adb67e22bbb5d35cb837b0
|