Agentic profiler & tuner for ML training workloads — finds the DataLoader config your GPU was waiting for.

These details have not been verified by PyPI

Project description

loadtune 🚀

An agentic profiler and tuner for ML workloads.

loadtune is a deterministic, hardware-aware AI agent that autonomously runs micro-experiments to find the optimal system configuration for your ML pipeline, ensuring your GPUs are never sitting idle.

The Problem: GPU Starvation

GPUs are incredibly fast, but they often sit idle waiting for the CPU to decode and augment data. Tuning PyTorch DataLoaders (num_workers, pin_memory, prefetch_factor, thread constraints) is tedious, undocumented, and hardware-dependent. Guessing these configurations often leads to silent GPU starvation, where expensive instances (like A100s) waste 80% of their time waiting for data.

loadtune replaces guesswork with empirical measurement. It profiles your code, detects the exact bottleneck (Input-Bound vs Compute-Bound), and tunes the hardware mechanics to maximize samples per second.

System Parameters vs. Hyperparameters

Tools like W&B Sweeps perform Hyperparameter Optimization (HPO) to maximize model accuracy over days of training.

loadtune solves the other half: System Parameter Optimization. It sweeps hardware mechanics (num_workers, pin_memory) to maximize throughput in 2 minutes.

The Workflow: Run loadtune once to lock in the fastest data pipeline, then plug that optimized config into your overnight W&B sweep so every run finishes 5x faster.

Currently Supported Hardware

loadtune is hardware-aware and adjusts its heuristics based on your accelerator:

NVIDIA GPUs (CUDA): Full support. Automatically tracks GPU memory utilization (--auto-batch), tests asynchronous transfers (pin_memory, non_blocking), and handles CUDA OOMs.
Apple Silicon (MPS): Full pipeline tuning. loadtune recognizes the unified memory architecture (e.g., skips pin_memory as it's a no-op on Mac) and accurately synchronizes the MPS stream for honest compute timings.
CPUs: Full pipeline tuning. Automatically limits torch.set_num_threads to prevent contention between DataLoader workers and the main process.

Real-World Results

loadtune autonomously found these speedups in under 2 minutes of tuning:

NVIDIA A100 (Food101 Vision): 90% data-wait (Input-bound). Scaled workers and pinned memory. 5.65x speedup (147 → 830 samples/s).
Colab T4 (Lightning CNN): 95.7% data-wait (Highly input-bound). Handled framework overhead perfectly. 4.25x speedup (1,477 → 6,279 samples/s).
Colab T4 (HuggingFace DistilBERT): 5.9% data-wait (Compute-bound). Recognized the edge case and applied a mild nudge (workers=2, non_blocking) for a free 1.06x speedup without wasting time testing massive worker counts.
Apple M2 Pro (Synthetic Vision): Unified memory constraints. 2.11x speedup (199 → 421 samples/s).

Getting Started

1. Setup

pip install loadtune

Optional dependencies for framework integrations:

pip install "loadtune[lightning]"  # PyTorch Lightning
pip install "loadtune[nlp]"        # HuggingFace Transformers
pip install "loadtune[all]"        # Everything

2. How to Run (Three Scenarios)

Scenario A: PyTorch Lightning & HuggingFace (Zero Boilerplate)

If you use a high-level framework, loadtune extracts the components for you.

PyTorch Lightning:

# examples/my_lightning.py
from loadtune import from_lightning

# ... define module and datamodule ...
def get_workload():
    return from_lightning(my_lightning_module, datamodule=my_datamodule, batch_size=64)

HuggingFace Transformers:

# examples/my_hf.py
from loadtune import from_hf_trainer

# ... define model and dataset ...
def get_workload():
    return from_hf_trainer(model, dataset, tokenizer=tokenizer, batch_size=32)

Run via CLI using fast-mode (in-process trials to avoid framework import overhead):

loadtune tune examples/my_lightning.py --fast

Scenario B: Native PyTorch (`Workload` API)

If writing custom PyTorch loops, define a Workload dataclass that tells loadtune how to build your dataset, model, and execute a single training step. See examples/synthetic_bottleneck.py for a full example.

Scenario C: The Python API (Notebooks & CI)

You can profile and tune directly from Python scripts without using the CLI:

from loadtune import tune
from loadtune.workload import load_workload

# Load and tune a workload autonomously
workload = load_workload("examples/my_workload.py")
result = tune(workload, steps=50, max_trials=6, auto_batch=True)

print(f"Best Config: {result.best.knobs.label()} — {result.speedup:.2f}x baseline")
print(result.diagnosis)

Advanced Features

Auto-Batching (--auto-batch): If you are compute-bound but your GPU memory utilization is low, loadtune autonomously proposes batch-size doubling until you hit ~80% VRAM utilization. Catches OOMs gracefully.
Auto-Apply (--apply): Generates a loadtune_apply.py code snippet containing the best configuration found so you can easily import it into your project.
Fast Mode (--fast): Runs trials in-process instead of spawning fresh subprocesses. Drastically reduces trial startup overhead for massive models.
Loss Parity Check: Dynamically verifies that semantics-changing configurations (like precision or batch size) don't break mathematical convergence.

Next Steps: The Vision

loadtune is evolving into the definitive Agentic SRE (Site Reliability Engineer) for Machine Learning, split into two core disciplines:

1. `loadtune train` (Currently Complete)

Goal: Optimizing GPU utilization during R&D and model training.

✅ Phase 1: Input-pipeline tuning & worker sweeping.
✅ Phase 2: Cloud GPU & asynchronous memory evaluation.
✅ Phase 3: GPU memory profiling & automatic batch-scaling.
✅ Phase 4: Framework adapters (Lightning, HuggingFace).

2. `loadtune serve` (Upcoming Phase 5)

Goal: Optimizing server costs in production inference workloads.

The Challenge: Maximize throughput (Requests/sec) without violating strict latency SLAs (e.g., p99 latency < 100ms).
The Strategy: Agentic tuning of inference engines (vLLM, Triton, TorchServe).
The Knobs: Autonomously tuning dynamic batching windows, KV-cache block sizes, quantization precision, and maximum concurrency limits based on synthetic HTTP traffic profiling.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Jun 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

loadtune-0.3.0.tar.gz (894.1 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

loadtune-0.3.0-py3-none-any.whl (35.1 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file loadtune-0.3.0.tar.gz.

File metadata

Download URL: loadtune-0.3.0.tar.gz
Upload date: Jun 26, 2026
Size: 894.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for loadtune-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`44ba4883685bd7020b998aa9ace839b62eccc428a5b06de30f0666ef2c7b08b7`
MD5	`6f5b6572c9b48885a3cead85a52044d9`
BLAKE2b-256	`1758dd462fcdeef1a4c34deff4edef8b99a580a1bf802e3c20c77b74915cd196`

See more details on using hashes here.

File details

Details for the file loadtune-0.3.0-py3-none-any.whl.

File metadata

Download URL: loadtune-0.3.0-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 35.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for loadtune-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b0004d93f0b03a69e6ef267f50edafcc061649fdd1b76690a0e9593c27061d00`
MD5	`177113630b892b3562cc81f01b0cc44c`
BLAKE2b-256	`960eb89c5accf9879ca7a6eebc6f35bd49731619f6adb67e22bbb5d35cb837b0`

See more details on using hashes here.

loadtune 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

loadtune 🚀

The Problem: GPU Starvation

System Parameters vs. Hyperparameters

Currently Supported Hardware

Real-World Results

Getting Started

1. Setup

2. How to Run (Three Scenarios)

Scenario A: PyTorch Lightning & HuggingFace (Zero Boilerplate)

Scenario B: Native PyTorch (`Workload` API)

Scenario C: The Python API (Notebooks & CI)

Advanced Features

Next Steps: The Vision

1. `loadtune train` (Currently Complete)

2. `loadtune serve` (Upcoming Phase 5)

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

loadtune 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

loadtune 🚀

The Problem: GPU Starvation

System Parameters vs. Hyperparameters

Currently Supported Hardware

Real-World Results

Getting Started

1. Setup

2. How to Run (Three Scenarios)

Scenario A: PyTorch Lightning & HuggingFace (Zero Boilerplate)

Scenario B: Native PyTorch (Workload API)

Scenario C: The Python API (Notebooks & CI)

Advanced Features

Next Steps: The Vision

1. loadtune train (Currently Complete)

2. loadtune serve (Upcoming Phase 5)

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Scenario B: Native PyTorch (`Workload` API)

1. `loadtune train` (Currently Complete)

2. `loadtune serve` (Upcoming Phase 5)