Skip to main content

Agentic profiler & tuner for ML training workloads — finds the DataLoader config your GPU was waiting for.

Project description

loadtune 🚀

An agentic profiler and tuner for ML workloads.

loadtune is a deterministic, hardware-aware AI agent that autonomously runs micro-experiments to find the optimal system configuration for your ML pipeline, ensuring your GPUs are never sitting idle.

The Problem: GPU Starvation

GPUs are incredibly fast, but they often sit idle waiting for the CPU to decode and augment data. Tuning PyTorch DataLoaders (num_workers, pin_memory, prefetch_factor, thread constraints) is tedious, undocumented, and hardware-dependent. Guessing these configurations often leads to silent GPU starvation, where expensive instances (like A100s) waste 80% of their time waiting for data.

loadtune replaces guesswork with empirical measurement. It profiles your code, detects the exact bottleneck (Input-Bound vs Compute-Bound), and tunes the hardware mechanics to maximize samples per second.

System Parameters vs. Hyperparameters

Tools like W&B Sweeps perform Hyperparameter Optimization (HPO) to maximize model accuracy over days of training.

loadtune solves the other half: System Parameter Optimization. It sweeps hardware mechanics (num_workers, pin_memory) to maximize throughput in 2 minutes.

The Workflow: Run loadtune once to lock in the fastest data pipeline, then plug that optimized config into your overnight W&B sweep so every run finishes 5x faster.

Currently Supported Hardware

loadtune is hardware-aware and adjusts its heuristics based on your accelerator:

  • NVIDIA GPUs (CUDA): Full support. Automatically tracks GPU memory utilization (--auto-batch), tests asynchronous transfers (pin_memory, non_blocking), and handles CUDA OOMs.
  • Apple Silicon (MPS): Full pipeline tuning. loadtune recognizes the unified memory architecture (e.g., skips pin_memory as it's a no-op on Mac) and accurately synchronizes the MPS stream for honest compute timings.
  • CPUs: Full pipeline tuning. Automatically limits torch.set_num_threads to prevent contention between DataLoader workers and the main process.

Real-World Results

loadtune autonomously found these speedups in under 2 minutes of tuning:

  • NVIDIA A100 (Food101 Vision): 90% data-wait (Input-bound). Scaled workers and pinned memory. 5.65x speedup (147 → 830 samples/s).
  • Colab T4 (Lightning CNN): 95.7% data-wait (Highly input-bound). Handled framework overhead perfectly. 4.25x speedup (1,477 → 6,279 samples/s).
  • Colab T4 (HuggingFace DistilBERT): 5.9% data-wait (Compute-bound). Recognized the edge case and applied a mild nudge (workers=2, non_blocking) for a free 1.06x speedup without wasting time testing massive worker counts.
  • Apple M2 Pro (Synthetic Vision): Unified memory constraints. 2.11x speedup (199 → 421 samples/s).

Getting Started

1. Setup

pip install loadtune

Optional dependencies for framework integrations:

pip install "loadtune[lightning]"  # PyTorch Lightning
pip install "loadtune[nlp]"        # HuggingFace Transformers
pip install "loadtune[all]"        # Everything

2. How to Run (Three Scenarios)

Scenario A: PyTorch Lightning & HuggingFace (Zero Boilerplate)

If you use a high-level framework, loadtune extracts the components for you.

PyTorch Lightning:

# examples/my_lightning.py
from loadtune import from_lightning

# ... define module and datamodule ...
def get_workload():
    return from_lightning(my_lightning_module, datamodule=my_datamodule, batch_size=64)

HuggingFace Transformers:

# examples/my_hf.py
from loadtune import from_hf_trainer

# ... define model and dataset ...
def get_workload():
    return from_hf_trainer(model, dataset, tokenizer=tokenizer, batch_size=32)

Run via CLI using fast-mode (in-process trials to avoid framework import overhead):

loadtune tune examples/my_lightning.py --fast

Scenario B: Native PyTorch (Workload API)

If writing custom PyTorch loops, define a Workload dataclass that tells loadtune how to build your dataset, model, and execute a single training step. See examples/synthetic_bottleneck.py for a full example.

Scenario C: The Python API (Notebooks & CI)

You can profile and tune directly from Python scripts without using the CLI:

from loadtune import tune
from loadtune.workload import load_workload

# Load and tune a workload autonomously
workload = load_workload("examples/my_workload.py")
result = tune(workload, steps=50, max_trials=6, auto_batch=True)

print(f"Best Config: {result.best.knobs.label()}{result.speedup:.2f}x baseline")
print(result.diagnosis)

Advanced Features

  • Auto-Batching (--auto-batch): If you are compute-bound but your GPU memory utilization is low, loadtune autonomously proposes batch-size doubling until you hit ~80% VRAM utilization. Catches OOMs gracefully.
  • Auto-Apply (--apply): Generates a loadtune_apply.py code snippet containing the best configuration found so you can easily import it into your project.
  • Fast Mode (--fast): Runs trials in-process instead of spawning fresh subprocesses. Drastically reduces trial startup overhead for massive models.
  • Loss Parity Check: Dynamically verifies that semantics-changing configurations (like precision or batch size) don't break mathematical convergence.

Next Steps: The Vision

loadtune is evolving into the definitive Agentic SRE (Site Reliability Engineer) for Machine Learning, split into two core disciplines:

1. loadtune train (Currently Complete)

Goal: Optimizing GPU utilization during R&D and model training.

  • ✅ Phase 1: Input-pipeline tuning & worker sweeping.
  • ✅ Phase 2: Cloud GPU & asynchronous memory evaluation.
  • ✅ Phase 3: GPU memory profiling & automatic batch-scaling.
  • ✅ Phase 4: Framework adapters (Lightning, HuggingFace).

2. loadtune serve (Upcoming Phase 5)

Goal: Optimizing server costs in production inference workloads.

  • The Challenge: Maximize throughput (Requests/sec) without violating strict latency SLAs (e.g., p99 latency < 100ms).
  • The Strategy: Agentic tuning of inference engines (vLLM, Triton, TorchServe).
  • The Knobs: Autonomously tuning dynamic batching windows, KV-cache block sizes, quantization precision, and maximum concurrency limits based on synthetic HTTP traffic profiling.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loadtune-0.3.0.tar.gz (894.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

loadtune-0.3.0-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file loadtune-0.3.0.tar.gz.

File metadata

  • Download URL: loadtune-0.3.0.tar.gz
  • Upload date:
  • Size: 894.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for loadtune-0.3.0.tar.gz
Algorithm Hash digest
SHA256 44ba4883685bd7020b998aa9ace839b62eccc428a5b06de30f0666ef2c7b08b7
MD5 6f5b6572c9b48885a3cead85a52044d9
BLAKE2b-256 1758dd462fcdeef1a4c34deff4edef8b99a580a1bf802e3c20c77b74915cd196

See more details on using hashes here.

File details

Details for the file loadtune-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: loadtune-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for loadtune-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b0004d93f0b03a69e6ef267f50edafcc061649fdd1b76690a0e9593c27061d00
MD5 177113630b892b3562cc81f01b0cc44c
BLAKE2b-256 960eb89c5accf9879ca7a6eebc6f35bd49731619f6adb67e22bbb5d35cb837b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page