Catches training failures before they waste GPU hours. Thermodynamic diagnostics for any PyTorch optimiser — per-layer collapse detection, entropy decomposition, and plain-English recommendations.
Project description
Thermoclaw
Thermoclaw catches training failures before they waste GPU hours.
A lightweight diagnostic layer for any PyTorch optimiser. One line wraps AdamW, SGD, or anything else — and you get real-time alerts when your training is about to collapse, plateau, or diverge, with layer-level explanations of why.
Works alongside W&B, TensorBoard, and the HuggingFace Trainer. Thermoclaw tells you why your loss curve looks the way it does.
It found the problem. Acting on it worked.
We trained GPT-2 small (124M parameters) from scratch on WikiText-103 with SGD + momentum and weight_decay=5.0. At step 19, CollapseDetector flagged a HIGH-confidence weight-decay collapse:
[HIGH] Weight decay is eroding 2 embedding layers: param norms dropped
36% during training. Reduce weight_decay.
We branched at that point — one arm continued unchanged, the other reduced weight decay to 0.01.
| Run | Final PPL (600 steps post-branch) |
|---|---|
Unmodified (wd=5.0 throughout) |
50,257 — model completely dead, outputting uniform noise |
Thermoclaw intervention (wd→0.01) |
1,377 |
| Improvement | 36× lower perplexity |
Replicated across 3 seeds (42, 137, 2024). The unmodified arm locks at ppl = vocab_size — zero information. The intervention arm learns.
No hyperparameter search. No manual inspection. One warning, one change. The warning is causal, not correlational.
Install
pip install thermoclaw # core (PyTorch + NumPy only)
pip install thermoclaw[viz] # + Matplotlib dashboards
pip install thermoclaw[hf] # + HuggingFace Trainer callback
pip install thermoclaw[all] # everything
Quick Start
Observe any optimiser
from thermoclaw import Observer, diagnose
model = YourModel()
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)
observer = Observer(model, optimiser)
for batch in loader:
loss = criterion(model(batch))
loss.backward()
optimiser.step()
observer.step(loss=loss.item())
optimiser.zero_grad()
report = diagnose(observer)
print(report)
observer.plot_dashboard(save_path='dashboard.png')
HuggingFace Trainer (one line)
from thermoclaw.integrations.huggingface import ThermoclawCallback
trainer = Trainer(
model=model,
args=training_args,
callbacks=[ThermoclawCallback()], # ← that's it
)
trainer.train()
# Dashboard PNG, CSV, and diagnostic report saved to output_dir automatically
# Thermodynamic metrics logged to W&B / TensorBoard automatically
See BLOG_HF.md for a full walkthrough.
Catch weight-decay collapse in real time
from thermoclaw import CollapseDetector
from thermoclaw import make_param_groups
# Per-layer groups give full resolution (recommended)
groups = make_param_groups(model, lr=3e-4, weight_decay=0.01)
optimiser = torch.optim.AdamW(groups)
detector = CollapseDetector(model, optimiser)
for batch in loader:
loss.backward()
optimiser.step()
detector.step() # call before zero_grad
optimiser.zero_grad()
if detector.is_collapsing:
for pg in optimiser.param_groups:
pg['weight_decay'] *= 0.1
recs = detector.get_recommendations()
# → ["[HIGH] Weight decay collapse in 5 mlp layers: grad/param ratio
# dropped 4.2× from early to late training. Reduce weight_decay."]
is_collapsing fires as soon as any HIGH or MEDIUM signal is confirmed. See Known Issues for AdamW behaviour at typical weight-decay values.
Decompose entropy into productive vs overhead
from thermoclaw import Observer, EntropySplit, diagnose
observer = Observer(model, optimiser)
splitter = EntropySplit(model, optimiser, observer)
for batch in loader:
loss = criterion(model(batch))
loss.backward()
optimiser.step()
observer.step(loss=loss.item())
splitter.step() # decomposes entropy each step
optimiser.zero_grad()
report = diagnose(observer, splitter)
print(report)
# Example output:
# [HIGH] Weight decay is the dominant entropy source for 12 attention
# layers (mean R_ie=4.2). Consider reducing weight_decay for attention
# layers by 4-8×, or excluding them.
splitter.plot_entropy_split(save_path='entropy_split.png')
Thermodynamically-aware LR schedule (drop-in)
from thermoclaw import ThermoScheduler
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)
scheduler = ThermoScheduler(optimiser, total_steps=10000)
for step, batch in enumerate(loader):
loss = criterion(model(batch))
loss.backward()
optimiser.step()
scheduler.step() # replaces cosine_scheduler.step()
optimiser.zero_grad()
What am I looking at?
You've run Thermoclaw and have numbers. Here is what they mean in plain English.
Entropy ratio r_l = σ / σ*
The ratio of how much thermodynamic work layer l is doing now versus its own historical baseline.
| Value | Meaning | What to do |
|---|---|---|
r ≈ 1.0 |
Layer is at equilibrium — learning at a steady, sustainable rate | Nothing |
r < 0.85 |
Under-trained — this layer is doing less work than its baseline | Consider a higher LR or check for gradient starvation |
r > 1.15 |
Over-trained — this layer is overheating | Consider a lower LR, more weight decay, or gradient clipping |
Dispersion D = Var(r_l) across layers
How uniformly your model is learning across all layers simultaneously.
| Value | Meaning |
|---|---|
D < 0.05 |
Clean — all layers are coordinated, gradients are coherent |
D ≈ 0.1–0.4 |
Some inter-layer tension — often a sign of noisy labels or an aggressive LR |
D > 0.5 |
Significant fragmentation — some layers are updating aggressively whilst others stall |
Entropy ratio R_ie = d_iS / d_eS
The single most actionable number. It tells you how much of the optimiser's energy budget is going on overhead (weight decay, momentum friction, noise) versus productive gradient descent.
| Value | Meaning | What to do |
|---|---|---|
R_ie < 1 |
Most entropy is productive — training is efficient | Nothing |
R_ie 1–2 |
Modest overhead — normal for most runs | Monitor |
R_ie 2–5 |
Warning — overhead is dominating. Check weight decay and momentum | Reduce weight_decay, lower β₁, or clip gradients |
R_ie > 5 |
Critical — the optimiser is mostly generating heat. Training may plateau or collapse | Intervene immediately |
Gradient coherence ρ = cos(g_t, g_{t-1})
How consistent the gradient direction is from step to step.
| Value | Meaning |
|---|---|
ρ > 0.3 |
Coherent — training is stable, loss is likely decreasing smoothly |
ρ ≈ 0 |
Incoherent — gradients are effectively random walk. Check batch size, LR, and data ordering |
ρ < -0.1 |
Oscillating — gradients are reversing direction. Reduce LR or increase batch size |
Non-obvious: high momentum (
β₁ → 1) increases ρ by smoothing consecutive gradient vectors. A high ρ reading does not necessarily mean training is healthy if the equilibrium fraction is low. Use both together.
Equilibrium fraction eq_frac
The fraction of recent steps where layers were operating at or near their equilibrium entropy ratio (r ≈ 1). Think of it as a "steady-state score".
| Value | Meaning |
|---|---|
eq_frac > 0.5 |
More than half of steps are at equilibrium — stable training |
eq_frac < 0.2 |
Training is rarely at equilibrium — unstable. Check LR schedule and weight decay |
CollapseDetector confidence levels
| Level | Trigger | Action |
|---|---|---|
[HIGH] |
Grad/param ratio dropped >2× AND confirmed across multiple layers | Act immediately |
[MEDIUM] |
Clear signal, moderate severity | Investigate |
[LOW] |
Signal present, multiple contributing sources | Informational |
What Thermoclaw measures
| Quantity | Symbol | What it means |
|---|---|---|
| Entropy production | σ_l = η‖g‖² |
How much thermodynamic work each layer is doing |
| Entropy ratio | r_l = σ/σ* |
1.0 = equilibrium. <0.85 = under-trained. >1.15 = over-trained |
| External entropy | d_eS |
Entropy that reduces loss — productive learning |
| Internal entropy | d_iS |
Entropy from weight decay, momentum, noise — overhead |
R_ie = d_iS/d_eS |
— | The diagnostic ratio. >2 = warning, >5 = critical |
| Grad/param ratio | ‖g‖/‖θ‖ |
CollapseDetector signal. A falling trend signals weight-decay erosion |
| Dispersion | D = Var(r_l) |
Inter-layer training uniformity |
| Gradient alignment | ρ = cos(g_t, g_{t-1}) |
Step coherence. Negative = oscillation |
| Parameter distance | E = ‖θ−θ₀‖² |
How far weights have moved from initialisation |
The d_iS / d_eS decomposition
Standard training observes total loss and calls it a day. But total entropy production σ conflates two fundamentally different thermodynamic flows:
d_eS(external) — gradient-driven parameter updates that reduce loss. This is productive work.d_iS(internal) — entropy from weight decay, momentum friction, and stochastic noise. This is heat.
When d_iS >> d_eS, the optimiser is spending most of its entropy budget on overhead. Thermoclaw decomposes d_iS further:
d_iS_wd— weight-decay contributiond_iS_momentum— momentum frictiond_iS_noise— stochastic gradient noise
This tells you exactly which layers, at which step, are wasting compute — and why.
Per-layer parameter groups
For full per-layer resolution, use make_param_groups:
from thermoclaw import make_param_groups
groups = make_param_groups(model, lr=3e-4, weight_decay=0.01)
optimiser = torch.optim.AdamW(groups)
observer = Observer(model, optimiser)
Confidence scoring
Recommendations are conservative. Thermoclaw only flags issues where the physics signal is unambiguous.
- [HIGH] — Single dominant source (>60% of
d_iS),R_ie > 5, consistent across regions. Safe to act on. - [MEDIUM] — Clear signal but moderate
R_ie(2–5). Worth investigating. - [LOW] — Signal present but multiple sources contribute. Informational only.
Wrong recommendations that sound authoritative destroy trust faster than no recommendations at all.
Validated
Three-tier validation on H100 80 GB (Pythia-410M, WikiText-103, bfloat16):
| Tier | Test | Result |
|---|---|---|
| T1: Analytical | σ, ρ, d_iS_wd, d_eS+d_iS=σ, E, D — 8 ground-truth checks | 8/8 PASS |
| T2A: High LR | lr=3e-2 → R_ie=1.6×10²⁰, eq=0.014, flagged HIGH |
PASS |
| T2B: High WD | wd=5.0 → unhealthy, flagged MEDIUM |
PASS |
| T2C: Over-damped | β₁=0.999 → ρ=0.63 (vs baseline 0.39), eq=0.15, flagged HIGH |
PASS |
| T2D: Baseline | lr=3e-4 / wd=0.01 / β₁=0.9 → no collapse or WD pathology flagged |
PASS |
| T3: Intervention | CollapseDetector fires step 19 HIGH (SGD wd=5.0), PPL gap +48,880 vs dead arm (3/3 seeds) |
PASS |
Origin
Thermoclaw's thermodynamic framework comes from the EPTO (Entropy-Production Targeted Optimisation) research project. The key insight: neural network training is a non-equilibrium thermodynamic process, and the quantities that matter — entropy production, entropy ratios, equilibrium fraction — can be measured for any optimiser, not just EPTO.
Licence
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thermoclaw-0.1.6.tar.gz.
File metadata
- Download URL: thermoclaw-0.1.6.tar.gz
- Upload date:
- Size: 61.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
328bdf2f0d40906644f3b7b8ee3474408426845e7608ca0ffe20dc14ba6898b5
|
|
| MD5 |
b0cf7134ab3e729b9db986e7e868d2a5
|
|
| BLAKE2b-256 |
aa356c0e2274fca79e6c24c99a8e17e912e4726fb91ab435fc0cbb76760c0e7a
|
File details
Details for the file thermoclaw-0.1.6-py3-none-any.whl.
File metadata
- Download URL: thermoclaw-0.1.6-py3-none-any.whl
- Upload date:
- Size: 57.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81311cc0e8db01723a2bcd1f60504ebc29da73ed840aa064e51547dde9b5cbe2
|
|
| MD5 |
b88f1122f340b17dc5c449b17cc8da23
|
|
| BLAKE2b-256 |
4118ac7893c56ef16baa7e8bd3ba33c174901ceaa42fe114babb30747d3ee722
|