PyTorch training accelerator: AMP, torch.compile, fused kernels, EMA, SWA, schedulers, memory tools, DDP — all in one drop-in wrapper.
Project description
qlqoqrqa ⚡
A serious PyTorch training accelerator. Drop in one wrapper, eliminate common bottlenecks.
Install
pip install qlqoqrqa
Requirements: Python ≥ 3.9, PyTorch ≥ 2.0
Quickstart
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from qlqoqrqa import Accelerator, AcceleratorConfig
model = nn.TransformerEncoder(...)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
loader = DataLoader(TensorDataset(X, y), batch_size=256, shuffle=True)
cfg = AcceleratorConfig(
dtype = "bf16", # BF16 AMP on Ampere+, FP16+scaler elsewhere
compile = True, # torch.compile with Triton kernel fusion
grad_accum_steps = 4, # effective batch_size = 256 × 4 = 1024
gradient_checkpointing = False,
fused_optimizer = True,
turbo_dataloader = True,
max_grad_norm = 1.0,
)
acc = Accelerator(model, optimizer, loader, config=cfg)
for epoch in range(10):
for batch in acc.dataloader:
loss = acc.step(batch, forward_fn=lambda m, b: F.cross_entropy(m(b[0]), b[1]))
Optimization techniques
| Technique | What it does |
|---|---|
| BF16 / FP16 AMP | Half-precision forward + backward; FP16 uses loss scaling, BF16 doesn't need it |
| torch.compile | Triton-generated fused GPU kernels via max-autotune |
| Fused optimizer | fused=True Adam/SGD: one CUDA kernel instead of a Python loop per param |
| Turbo DataLoader | pin_memory, persistent_workers, CUDA stream prefetch |
| set_to_none zero_grad | Avoids memset overhead on gradient buffers |
| channels_last | Contiguous NHWC layout — free speedup on conv-heavy nets |
| TF32 | 19-bit matmul on Ampere+ with full FP32 accumulation |
| cuDNN benchmark | Auto-selects fastest conv algorithm per input shape |
| Gradient accumulation | Simulates large batches without extra VRAM |
| Gradient checkpointing | Recompute activations during backward — halves peak VRAM |
| Gradient clipping | Stabilises training, prevents loss spikes |
| DDP (multi-GPU) | Auto-detected via RANK env var (torchrun) |
| EMA weights | Exponential moving average — better generalisation |
| SWA | Stochastic Weight Averaging — flatter minima |
| Warmup + cosine LR | Standard LR schedule for transformers |
| Early stopping | Stops when validation stops improving |
| Memory tracker | Measures peak VRAM usage per phase |
| Auto batch size | Binary-searches the largest batch that fits in VRAM |
| Activation checkpointing | Per-module gradient checkpointing for arbitrary architectures |
High-level Trainer
from qlqoqrqa import Trainer, EMA, get_scheduler
ema = EMA(model, decay=0.9999)
scheduler = get_scheduler("cosine", optimizer, warmup_steps=500, total_steps=10000)
trainer = Trainer(
model=model,
optimizer=optimizer,
train_loader=train_loader,
val_loader=val_loader,
forward_fn=lambda m, b: m(b["input_ids"]),
loss_fn=lambda out, b: F.cross_entropy(out, b["labels"]),
metric_fn=lambda out, b: {"loss": F.cross_entropy(out, b["labels"]).item()},
scheduler=scheduler,
ema=ema,
epochs=20,
early_stopping_patience=5,
checkpoint_path="checkpoints/best.pt",
)
history = trainer.fit()
Optimizers
from qlqoqrqa import FusedAdamW, Lion
# foreach-vectorized AdamW
opt = FusedAdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
# Lion — 3× less memory than Adam, sign-only momentum
opt = Lion(model.parameters(), lr=1e-4, betas=(0.9, 0.99))
Memory tools
from qlqoqrqa import MemoryTracker, find_batch_size, apply_activation_checkpointing, empty_cache
# Find the biggest batch that fits in VRAM
bs = find_batch_size(model, sample_input_fn=lambda n: torch.randn(n, 3, 224, 224))
# Track peak VRAM usage
tracker = MemoryTracker()
tracker.start()
train_one_epoch(...)
print(tracker.stop())
# {'peak_mb': 8192.0, 'delta_mb': 7680.0, 'utilization_pct': 80.0, ...}
# Apply activation checkpointing to specific layer types
from mymodel import TransformerBlock
model = apply_activation_checkpointing(model, module_types=(TransformerBlock,))
# Release memory between train and eval
empty_cache()
Benchmark
from qlqoqrqa import benchmark, compare_speedup
baseline = benchmark(lambda: model_eager(x), batch_size=64, n_runs=100)
compiled = benchmark(lambda: model_compiled(x), batch_size=64, n_runs=100)
compare_speedup(baseline, compiled)
────────────────────────────────────────────────────────
Metric Baseline qlqoqrqa
────────────────────────────────────────────────────────
Mean latency (ms) 124.50 5.20
Throughput (samp/s) 513 12307
Speedup 1.00x 24.0x
────────────────────────────────────────────────────────
Profiler
from qlqoqrqa.profiler.trace import profile_training
with profile_training(active_steps=20, output_path="trace.json") as prof:
for i, batch in enumerate(loader):
loss = acc.step(batch, forward_fn)
prof.step()
prof.print_top(20) # operator-level bottleneck table
# Open trace.json in https://ui.perfetto.dev
LR Schedulers
from qlqoqrqa import get_scheduler
sched = get_scheduler("cosine", optimizer, warmup_steps=200, total_steps=5000)
sched = get_scheduler("linear", optimizer, warmup_steps=100, total_steps=5000)
sched = get_scheduler("onecycle", optimizer, total_steps=5000, max_lr=1e-3)
sched = get_scheduler("constant", optimizer)
EMA & SWA
from qlqoqrqa import EMA, SWA
# EMA — call after every optimizer.step()
ema = EMA(model, decay=0.9999)
ema.update()
with ema.average_parameters():
val_loss = evaluate(model, val_loader)
# SWA — averages weights across epochs
swa = SWA(model, swa_start_epoch=7, swa_lr=5e-4)
swa.attach_optimizer(optimizer)
for epoch in range(10):
train(...)
swa.update(epoch)
swa.finalize(train_loader)
Multi-GPU (torchrun)
torchrun --nproc_per_node=4 train.py
qlqoqrqa auto-detects RANK / LOCAL_RANK and wraps the model in DDP.
No code changes needed.
AcceleratorConfig reference
AcceleratorConfig(
dtype = "auto", # "auto"|"bf16"|"fp16"|"fp32"
compile = True,
compile_mode = "max-autotune",
grad_accum_steps = 1,
max_grad_norm = 1.0, # 0 = disabled
gradient_checkpointing = False,
channels_last = True,
turbo_dataloader = True,
num_workers = -1, # -1 = auto
prefetch_factor = 4,
tf32 = True,
cudnn_benchmark = True,
distributed = False,
fused_optimizer = True,
verbose = True,
)
Publish to PyPI
pip install build twine
python -m build
twine upload dist/*
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qlqoqrqa-0.1.0.tar.gz.
File metadata
- Download URL: qlqoqrqa-0.1.0.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
228db8aa474b42a34780eecf033e78ef0046ae6e9e7134b81494ee1e3ee89b09
|
|
| MD5 |
f6525c710ec7e2a554d6f9e02e4c9f2b
|
|
| BLAKE2b-256 |
1953848794c1d8b4f04e45137a7222ed4a821f2e2f5c74d712f237021aac4906
|
File details
Details for the file qlqoqrqa-0.1.0-py3-none-any.whl.
File metadata
- Download URL: qlqoqrqa-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e6c3c04d160aa1bc3a615c2561b37e137d9020f000e57a717f62cf0cd89c97d
|
|
| MD5 |
e385fca3e7f1cd5b5c127db19d18d80a
|
|
| BLAKE2b-256 |
42b0e34dc4d8a67afc195ba8ea0d71b4669e3ad9b35500ee9aec362694bf75a8
|