Skip to main content

PyTorch training accelerator: AMP, torch.compile, fused kernels, EMA, SWA, schedulers, memory tools, DDP — all in one drop-in wrapper.

Project description

qlqoqrqa ⚡

A serious PyTorch training accelerator. Drop in one wrapper, eliminate common bottlenecks.


Install

pip install qlqoqrqa

Requirements: Python ≥ 3.9, PyTorch ≥ 2.0


Quickstart

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from qlqoqrqa import Accelerator, AcceleratorConfig

model     = nn.TransformerEncoder(...)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
loader    = DataLoader(TensorDataset(X, y), batch_size=256, shuffle=True)

cfg = AcceleratorConfig(
    dtype              = "bf16",   # BF16 AMP on Ampere+, FP16+scaler elsewhere
    compile            = True,     # torch.compile with Triton kernel fusion
    grad_accum_steps   = 4,        # effective batch_size = 256 × 4 = 1024
    gradient_checkpointing = False,
    fused_optimizer    = True,
    turbo_dataloader   = True,
    max_grad_norm      = 1.0,
)
acc = Accelerator(model, optimizer, loader, config=cfg)

for epoch in range(10):
    for batch in acc.dataloader:
        loss = acc.step(batch, forward_fn=lambda m, b: F.cross_entropy(m(b[0]), b[1]))

Optimization techniques

Technique What it does
BF16 / FP16 AMP Half-precision forward + backward; FP16 uses loss scaling, BF16 doesn't need it
torch.compile Triton-generated fused GPU kernels via max-autotune
Fused optimizer fused=True Adam/SGD: one CUDA kernel instead of a Python loop per param
Turbo DataLoader pin_memory, persistent_workers, CUDA stream prefetch
set_to_none zero_grad Avoids memset overhead on gradient buffers
channels_last Contiguous NHWC layout — free speedup on conv-heavy nets
TF32 19-bit matmul on Ampere+ with full FP32 accumulation
cuDNN benchmark Auto-selects fastest conv algorithm per input shape
Gradient accumulation Simulates large batches without extra VRAM
Gradient checkpointing Recompute activations during backward — halves peak VRAM
Gradient clipping Stabilises training, prevents loss spikes
DDP (multi-GPU) Auto-detected via RANK env var (torchrun)
EMA weights Exponential moving average — better generalisation
SWA Stochastic Weight Averaging — flatter minima
Warmup + cosine LR Standard LR schedule for transformers
Early stopping Stops when validation stops improving
Memory tracker Measures peak VRAM usage per phase
Auto batch size Binary-searches the largest batch that fits in VRAM
Activation checkpointing Per-module gradient checkpointing for arbitrary architectures

High-level Trainer

from qlqoqrqa import Trainer, EMA, get_scheduler

ema = EMA(model, decay=0.9999)
scheduler = get_scheduler("cosine", optimizer, warmup_steps=500, total_steps=10000)

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    train_loader=train_loader,
    val_loader=val_loader,
    forward_fn=lambda m, b: m(b["input_ids"]),
    loss_fn=lambda out, b: F.cross_entropy(out, b["labels"]),
    metric_fn=lambda out, b: {"loss": F.cross_entropy(out, b["labels"]).item()},
    scheduler=scheduler,
    ema=ema,
    epochs=20,
    early_stopping_patience=5,
    checkpoint_path="checkpoints/best.pt",
)
history = trainer.fit()

Optimizers

from qlqoqrqa import FusedAdamW, Lion

# foreach-vectorized AdamW
opt = FusedAdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# Lion — 3× less memory than Adam, sign-only momentum
opt = Lion(model.parameters(), lr=1e-4, betas=(0.9, 0.99))

Memory tools

from qlqoqrqa import MemoryTracker, find_batch_size, apply_activation_checkpointing, empty_cache

# Find the biggest batch that fits in VRAM
bs = find_batch_size(model, sample_input_fn=lambda n: torch.randn(n, 3, 224, 224))

# Track peak VRAM usage
tracker = MemoryTracker()
tracker.start()
train_one_epoch(...)
print(tracker.stop())
# {'peak_mb': 8192.0, 'delta_mb': 7680.0, 'utilization_pct': 80.0, ...}

# Apply activation checkpointing to specific layer types
from mymodel import TransformerBlock
model = apply_activation_checkpointing(model, module_types=(TransformerBlock,))

# Release memory between train and eval
empty_cache()

Benchmark

from qlqoqrqa import benchmark, compare_speedup

baseline  = benchmark(lambda: model_eager(x),   batch_size=64, n_runs=100)
compiled  = benchmark(lambda: model_compiled(x), batch_size=64, n_runs=100)
compare_speedup(baseline, compiled)
────────────────────────────────────────────────────────
  Metric                      Baseline       qlqoqrqa
────────────────────────────────────────────────────────
  Mean latency (ms)             124.50           5.20
  Throughput (samp/s)              513          12307
  Speedup                         1.00x          24.0x
────────────────────────────────────────────────────────

Profiler

from qlqoqrqa.profiler.trace import profile_training

with profile_training(active_steps=20, output_path="trace.json") as prof:
    for i, batch in enumerate(loader):
        loss = acc.step(batch, forward_fn)
        prof.step()

prof.print_top(20)   # operator-level bottleneck table
# Open trace.json in https://ui.perfetto.dev

LR Schedulers

from qlqoqrqa import get_scheduler

sched = get_scheduler("cosine", optimizer, warmup_steps=200, total_steps=5000)
sched = get_scheduler("linear", optimizer, warmup_steps=100, total_steps=5000)
sched = get_scheduler("onecycle", optimizer, total_steps=5000, max_lr=1e-3)
sched = get_scheduler("constant", optimizer)

EMA & SWA

from qlqoqrqa import EMA, SWA

# EMA — call after every optimizer.step()
ema = EMA(model, decay=0.9999)
ema.update()
with ema.average_parameters():
    val_loss = evaluate(model, val_loader)

# SWA — averages weights across epochs
swa = SWA(model, swa_start_epoch=7, swa_lr=5e-4)
swa.attach_optimizer(optimizer)
for epoch in range(10):
    train(...)
    swa.update(epoch)
swa.finalize(train_loader)

Multi-GPU (torchrun)

torchrun --nproc_per_node=4 train.py

qlqoqrqa auto-detects RANK / LOCAL_RANK and wraps the model in DDP. No code changes needed.


AcceleratorConfig reference

AcceleratorConfig(
    dtype                  = "auto",        # "auto"|"bf16"|"fp16"|"fp32"
    compile                = True,
    compile_mode           = "max-autotune",
    grad_accum_steps       = 1,
    max_grad_norm          = 1.0,           # 0 = disabled
    gradient_checkpointing = False,
    channels_last          = True,
    turbo_dataloader       = True,
    num_workers            = -1,            # -1 = auto
    prefetch_factor        = 4,
    tf32                   = True,
    cudnn_benchmark        = True,
    distributed            = False,
    fused_optimizer        = True,
    verbose                = True,
)

Publish to PyPI

pip install build twine
python -m build
twine upload dist/*

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qlqoqrqa-0.1.0.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qlqoqrqa-0.1.0-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file qlqoqrqa-0.1.0.tar.gz.

File metadata

  • Download URL: qlqoqrqa-0.1.0.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for qlqoqrqa-0.1.0.tar.gz
Algorithm Hash digest
SHA256 228db8aa474b42a34780eecf033e78ef0046ae6e9e7134b81494ee1e3ee89b09
MD5 f6525c710ec7e2a554d6f9e02e4c9f2b
BLAKE2b-256 1953848794c1d8b4f04e45137a7222ed4a821f2e2f5c74d712f237021aac4906

See more details on using hashes here.

File details

Details for the file qlqoqrqa-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: qlqoqrqa-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for qlqoqrqa-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4e6c3c04d160aa1bc3a615c2561b37e137d9020f000e57a717f62cf0cd89c97d
MD5 e385fca3e7f1cd5b5c127db19d18d80a
BLAKE2b-256 42b0e34dc4d8a67afc195ba8ea0d71b4669e3ad9b35500ee9aec362694bf75a8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page