PyTorch training accelerator: AMP, torch.compile, fused kernels, EMA, SWA, schedulers, memory tools, DDP — all in one drop-in wrapper.

These details have not been verified by PyPI

Project links

Project description

qlqoqrqa ⚡

A serious PyTorch training accelerator. Drop in one wrapper, eliminate common bottlenecks.

Install

pip install qlqoqrqa

Requirements: Python ≥ 3.9, PyTorch ≥ 2.0

Quickstart

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from qlqoqrqa import Accelerator, AcceleratorConfig

model     = nn.TransformerEncoder(...)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
loader    = DataLoader(TensorDataset(X, y), batch_size=256, shuffle=True)

cfg = AcceleratorConfig(
    dtype              = "bf16",   # BF16 AMP on Ampere+, FP16+scaler elsewhere
    compile            = True,     # torch.compile with Triton kernel fusion
    grad_accum_steps   = 4,        # effective batch_size = 256 × 4 = 1024
    gradient_checkpointing = False,
    fused_optimizer    = True,
    turbo_dataloader   = True,
    max_grad_norm      = 1.0,
)
acc = Accelerator(model, optimizer, loader, config=cfg)

for epoch in range(10):
    for batch in acc.dataloader:
        loss = acc.step(batch, forward_fn=lambda m, b: F.cross_entropy(m(b[0]), b[1]))

Optimization techniques

Technique	What it does
BF16 / FP16 AMP	Half-precision forward + backward; FP16 uses loss scaling, BF16 doesn't need it
torch.compile	Triton-generated fused GPU kernels via max-autotune
Fused optimizer	`fused=True` Adam/SGD: one CUDA kernel instead of a Python loop per param
Turbo DataLoader	`pin_memory`, `persistent_workers`, CUDA stream prefetch
set_to_none zero_grad	Avoids memset overhead on gradient buffers
channels_last	Contiguous NHWC layout — free speedup on conv-heavy nets
TF32	19-bit matmul on Ampere+ with full FP32 accumulation
cuDNN benchmark	Auto-selects fastest conv algorithm per input shape
Gradient accumulation	Simulates large batches without extra VRAM
Gradient checkpointing	Recompute activations during backward — halves peak VRAM
Gradient clipping	Stabilises training, prevents loss spikes
DDP (multi-GPU)	Auto-detected via `RANK` env var (torchrun)
EMA weights	Exponential moving average — better generalisation
SWA	Stochastic Weight Averaging — flatter minima
Warmup + cosine LR	Standard LR schedule for transformers
Early stopping	Stops when validation stops improving
Memory tracker	Measures peak VRAM usage per phase
Auto batch size	Binary-searches the largest batch that fits in VRAM
Activation checkpointing	Per-module gradient checkpointing for arbitrary architectures

High-level Trainer

from qlqoqrqa import Trainer, EMA, get_scheduler

ema = EMA(model, decay=0.9999)
scheduler = get_scheduler("cosine", optimizer, warmup_steps=500, total_steps=10000)

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    train_loader=train_loader,
    val_loader=val_loader,
    forward_fn=lambda m, b: m(b["input_ids"]),
    loss_fn=lambda out, b: F.cross_entropy(out, b["labels"]),
    metric_fn=lambda out, b: {"loss": F.cross_entropy(out, b["labels"]).item()},
    scheduler=scheduler,
    ema=ema,
    epochs=20,
    early_stopping_patience=5,
    checkpoint_path="checkpoints/best.pt",
)
history = trainer.fit()

Optimizers

from qlqoqrqa import FusedAdamW, Lion

# foreach-vectorized AdamW
opt = FusedAdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# Lion — 3× less memory than Adam, sign-only momentum
opt = Lion(model.parameters(), lr=1e-4, betas=(0.9, 0.99))

Memory tools

from qlqoqrqa import MemoryTracker, find_batch_size, apply_activation_checkpointing, empty_cache

# Find the biggest batch that fits in VRAM
bs = find_batch_size(model, sample_input_fn=lambda n: torch.randn(n, 3, 224, 224))

# Track peak VRAM usage
tracker = MemoryTracker()
tracker.start()
train_one_epoch(...)
print(tracker.stop())
# {'peak_mb': 8192.0, 'delta_mb': 7680.0, 'utilization_pct': 80.0, ...}

# Apply activation checkpointing to specific layer types
from mymodel import TransformerBlock
model = apply_activation_checkpointing(model, module_types=(TransformerBlock,))

# Release memory between train and eval
empty_cache()

Benchmark

from qlqoqrqa import benchmark, compare_speedup

baseline  = benchmark(lambda: model_eager(x),   batch_size=64, n_runs=100)
compiled  = benchmark(lambda: model_compiled(x), batch_size=64, n_runs=100)
compare_speedup(baseline, compiled)

────────────────────────────────────────────────────────
  Metric                      Baseline       qlqoqrqa
────────────────────────────────────────────────────────
  Mean latency (ms)             124.50           5.20
  Throughput (samp/s)              513          12307
  Speedup                         1.00x          24.0x
────────────────────────────────────────────────────────

Profiler

from qlqoqrqa.profiler.trace import profile_training

with profile_training(active_steps=20, output_path="trace.json") as prof:
    for i, batch in enumerate(loader):
        loss = acc.step(batch, forward_fn)
        prof.step()

prof.print_top(20)   # operator-level bottleneck table
# Open trace.json in https://ui.perfetto.dev

LR Schedulers

from qlqoqrqa import get_scheduler

sched = get_scheduler("cosine", optimizer, warmup_steps=200, total_steps=5000)
sched = get_scheduler("linear", optimizer, warmup_steps=100, total_steps=5000)
sched = get_scheduler("onecycle", optimizer, total_steps=5000, max_lr=1e-3)
sched = get_scheduler("constant", optimizer)

EMA & SWA

from qlqoqrqa import EMA, SWA

# EMA — call after every optimizer.step()
ema = EMA(model, decay=0.9999)
ema.update()
with ema.average_parameters():
    val_loss = evaluate(model, val_loader)

# SWA — averages weights across epochs
swa = SWA(model, swa_start_epoch=7, swa_lr=5e-4)
swa.attach_optimizer(optimizer)
for epoch in range(10):
    train(...)
    swa.update(epoch)
swa.finalize(train_loader)

Multi-GPU (torchrun)

torchrun --nproc_per_node=4 train.py

qlqoqrqa auto-detects RANK / LOCAL_RANK and wraps the model in DDP. No code changes needed.

AcceleratorConfig reference

AcceleratorConfig(
    dtype                  = "auto",        # "auto"|"bf16"|"fp16"|"fp32"
    compile                = True,
    compile_mode           = "max-autotune",
    grad_accum_steps       = 1,
    max_grad_norm          = 1.0,           # 0 = disabled
    gradient_checkpointing = False,
    channels_last          = True,
    turbo_dataloader       = True,
    num_workers            = -1,            # -1 = auto
    prefetch_factor        = 4,
    tf32                   = True,
    cudnn_benchmark        = True,
    distributed            = False,
    fused_optimizer        = True,
    verbose                = True,
)

Publish to PyPI

pip install build twine
python -m build
twine upload dist/*

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qlqoqrqa-0.1.0.tar.gz (24.3 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

qlqoqrqa-0.1.0-py3-none-any.whl (26.3 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file qlqoqrqa-0.1.0.tar.gz.

File metadata

Download URL: qlqoqrqa-0.1.0.tar.gz
Upload date: Apr 27, 2026
Size: 24.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for qlqoqrqa-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`228db8aa474b42a34780eecf033e78ef0046ae6e9e7134b81494ee1e3ee89b09`
MD5	`f6525c710ec7e2a554d6f9e02e4c9f2b`
BLAKE2b-256	`1953848794c1d8b4f04e45137a7222ed4a821f2e2f5c74d712f237021aac4906`

See more details on using hashes here.

File details

Details for the file qlqoqrqa-0.1.0-py3-none-any.whl.

File metadata

Download URL: qlqoqrqa-0.1.0-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 26.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for qlqoqrqa-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4e6c3c04d160aa1bc3a615c2561b37e137d9020f000e57a717f62cf0cd89c97d`
MD5	`e385fca3e7f1cd5b5c127db19d18d80a`
BLAKE2b-256	`42b0e34dc4d8a67afc195ba8ea0d71b4669e3ad9b35500ee9aec362694bf75a8`

See more details on using hashes here.

qlqoqrqa 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

qlqoqrqa ⚡

Install

Quickstart

Optimization techniques

High-level Trainer

Optimizers

Memory tools

Benchmark

Profiler

LR Schedulers

EMA & SWA

Multi-GPU (torchrun)

AcceleratorConfig reference

Publish to PyPI

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes