Skip to main content

Catches training failures before they waste GPU hours. Thermodynamic diagnostics for any PyTorch optimiser — per-layer collapse detection, entropy decomposition, and plain-English recommendations.

Project description

Thermoclaw

Thermoclaw catches training failures before they waste GPU hours.

A lightweight diagnostic layer for any PyTorch optimiser. One line wraps AdamW, SGD, or anything else — and you get real-time alerts when your training is about to collapse, plateau, or diverge, with layer-level explanations of why.

Works alongside W&B, TensorBoard, and the HuggingFace Trainer. Thermoclaw tells you why your loss curve looks the way it does.


It found the problem. Acting on it worked.

We trained GPT-2 small (124M parameters) from scratch on WikiText-103 with SGD + momentum and weight_decay=5.0. At step 19, CollapseDetector flagged a HIGH-confidence weight-decay collapse:

[HIGH] Weight decay is eroding 2 embedding layers: param norms dropped
36% during training. Reduce weight_decay.

We branched at that point — one arm continued unchanged, the other reduced weight decay to 0.01.

Run Final PPL (600 steps post-branch)
Unmodified (wd=5.0 throughout) 50,257 — model completely dead, outputting uniform noise
Thermoclaw intervention (wd→0.01) 1,377
Improvement 36× lower perplexity

Replicated across 3 seeds (42, 137, 2024). The unmodified arm locks at ppl = vocab_size — zero information. The intervention arm learns.

No hyperparameter search. No manual inspection. One warning, one change. The warning is causal, not correlational.


Install

pip install thermoclaw                    # core (PyTorch + NumPy only)
pip install thermoclaw[viz]               # + Matplotlib dashboards
pip install thermoclaw[hf]                # + HuggingFace Trainer callback
pip install thermoclaw[all]               # everything

Quick Start

Observe any optimiser

from thermoclaw import Observer, diagnose

model     = YourModel()
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)
observer  = Observer(model, optimiser)

for batch in loader:
    loss = criterion(model(batch))
    loss.backward()
    optimiser.step()
    observer.step(loss=loss.item())
    optimiser.zero_grad()

report = diagnose(observer)
print(report)
observer.plot_dashboard(save_path='dashboard.png')

HuggingFace Trainer (one line)

from thermoclaw.integrations.huggingface import ThermoclawCallback

trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[ThermoclawCallback()],      # ← that's it
)
trainer.train()
# Dashboard PNG, CSV, and diagnostic report saved to output_dir automatically
# Thermodynamic metrics logged to W&B / TensorBoard automatically

See BLOG_HF.md for a full walkthrough.

Catch weight-decay collapse in real time

from thermoclaw import CollapseDetector
from thermoclaw import make_param_groups

# Per-layer groups give full resolution (recommended)
groups    = make_param_groups(model, lr=3e-4, weight_decay=0.01)
optimiser = torch.optim.AdamW(groups)
detector  = CollapseDetector(model, optimiser)

for batch in loader:
    loss.backward()
    optimiser.step()
    detector.step()             # call before zero_grad
    optimiser.zero_grad()

    if detector.is_collapsing:
        for pg in optimiser.param_groups:
            pg['weight_decay'] *= 0.1

recs = detector.get_recommendations()
# → ["[HIGH] Weight decay collapse in 5 mlp layers: grad/param ratio
#     dropped 4.2× from early to late training. Reduce weight_decay."]

is_collapsing fires as soon as any HIGH or MEDIUM signal is confirmed. See Known Issues for AdamW behaviour at typical weight-decay values.

Decompose entropy into productive vs overhead

from thermoclaw import Observer, EntropySplit, diagnose

observer = Observer(model, optimiser)
splitter = EntropySplit(model, optimiser, observer)

for batch in loader:
    loss = criterion(model(batch))
    loss.backward()
    optimiser.step()
    observer.step(loss=loss.item())
    splitter.step()             # decomposes entropy each step
    optimiser.zero_grad()

report = diagnose(observer, splitter)
print(report)
# Example output:
#   [HIGH] Weight decay is the dominant entropy source for 12 attention
#   layers (mean R_ie=4.2). Consider reducing weight_decay for attention
#   layers by 4-8×, or excluding them.

splitter.plot_entropy_split(save_path='entropy_split.png')

Thermodynamically-aware LR schedule (drop-in)

from thermoclaw import ThermoScheduler

optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)
scheduler = ThermoScheduler(optimiser, total_steps=10000)

for step, batch in enumerate(loader):
    loss = criterion(model(batch))
    loss.backward()
    optimiser.step()
    scheduler.step()            # replaces cosine_scheduler.step()
    optimiser.zero_grad()

What am I looking at?

You've run Thermoclaw and have numbers. Here is what they mean in plain English.

Entropy ratio r_l = σ / σ*

The ratio of how much thermodynamic work layer l is doing now versus its own historical baseline.

Value Meaning What to do
r ≈ 1.0 Layer is at equilibrium — learning at a steady, sustainable rate Nothing
r < 0.85 Under-trained — this layer is doing less work than its baseline Consider a higher LR or check for gradient starvation
r > 1.15 Over-trained — this layer is overheating Consider a lower LR, more weight decay, or gradient clipping

Dispersion D = Var(r_l) across layers

How uniformly your model is learning across all layers simultaneously.

Value Meaning
D < 0.05 Clean — all layers are coordinated, gradients are coherent
D ≈ 0.1–0.4 Some inter-layer tension — often a sign of noisy labels or an aggressive LR
D > 0.5 Significant fragmentation — some layers are updating aggressively whilst others stall

Entropy ratio R_ie = d_iS / d_eS

The single most actionable number. It tells you how much of the optimiser's energy budget is going on overhead (weight decay, momentum friction, noise) versus productive gradient descent.

Value Meaning What to do
R_ie < 1 Most entropy is productive — training is efficient Nothing
R_ie 1–2 Modest overhead — normal for most runs Monitor
R_ie 2–5 Warning — overhead is dominating. Check weight decay and momentum Reduce weight_decay, lower β₁, or clip gradients
R_ie > 5 Critical — the optimiser is mostly generating heat. Training may plateau or collapse Intervene immediately

Gradient coherence ρ = cos(g_t, g_{t-1})

How consistent the gradient direction is from step to step.

Value Meaning
ρ > 0.3 Coherent — training is stable, loss is likely decreasing smoothly
ρ ≈ 0 Incoherent — gradients are effectively random walk. Check batch size, LR, and data ordering
ρ < -0.1 Oscillating — gradients are reversing direction. Reduce LR or increase batch size

Non-obvious: high momentum (β₁ → 1) increases ρ by smoothing consecutive gradient vectors. A high ρ reading does not necessarily mean training is healthy if the equilibrium fraction is low. Use both together.

Equilibrium fraction eq_frac

The fraction of recent steps where layers were operating at or near their equilibrium entropy ratio (r ≈ 1). Think of it as a "steady-state score".

Value Meaning
eq_frac > 0.5 More than half of steps are at equilibrium — stable training
eq_frac < 0.2 Training is rarely at equilibrium — unstable. Check LR schedule and weight decay

CollapseDetector confidence levels

Level Trigger Action
[HIGH] Grad/param ratio dropped >2× AND confirmed across multiple layers Act immediately
[MEDIUM] Clear signal, moderate severity Investigate
[LOW] Signal present, multiple contributing sources Informational

What Thermoclaw measures

Quantity Symbol What it means
Entropy production σ_l = η‖g‖² How much thermodynamic work each layer is doing
Entropy ratio r_l = σ/σ* 1.0 = equilibrium. <0.85 = under-trained. >1.15 = over-trained
External entropy d_eS Entropy that reduces loss — productive learning
Internal entropy d_iS Entropy from weight decay, momentum, noise — overhead
R_ie = d_iS/d_eS The diagnostic ratio. >2 = warning, >5 = critical
Grad/param ratio ‖g‖/‖θ‖ CollapseDetector signal. A falling trend signals weight-decay erosion
Dispersion D = Var(r_l) Inter-layer training uniformity
Gradient alignment ρ = cos(g_t, g_{t-1}) Step coherence. Negative = oscillation
Parameter distance E = ‖θ−θ₀‖² How far weights have moved from initialisation

The d_iS / d_eS decomposition

Standard training observes total loss and calls it a day. But total entropy production σ conflates two fundamentally different thermodynamic flows:

  • d_eS (external) — gradient-driven parameter updates that reduce loss. This is productive work.
  • d_iS (internal) — entropy from weight decay, momentum friction, and stochastic noise. This is heat.

When d_iS >> d_eS, the optimiser is spending most of its entropy budget on overhead. Thermoclaw decomposes d_iS further:

  • d_iS_wd — weight-decay contribution
  • d_iS_momentum — momentum friction
  • d_iS_noise — stochastic gradient noise

This tells you exactly which layers, at which step, are wasting compute — and why.


Per-layer parameter groups

For full per-layer resolution, use make_param_groups:

from thermoclaw import make_param_groups

groups    = make_param_groups(model, lr=3e-4, weight_decay=0.01)
optimiser = torch.optim.AdamW(groups)
observer  = Observer(model, optimiser)

Confidence scoring

Recommendations are conservative. Thermoclaw only flags issues where the physics signal is unambiguous.

  • [HIGH] — Single dominant source (>60% of d_iS), R_ie > 5, consistent across regions. Safe to act on.
  • [MEDIUM] — Clear signal but moderate R_ie (2–5). Worth investigating.
  • [LOW] — Signal present but multiple sources contribute. Informational only.

Wrong recommendations that sound authoritative destroy trust faster than no recommendations at all.


Validated

Three-tier validation on H100 80 GB (Pythia-410M, WikiText-103, bfloat16):

Tier Test Result
T1: Analytical σ, ρ, d_iS_wd, d_eS+d_iS=σ, E, D — 8 ground-truth checks 8/8 PASS
T2A: High LR lr=3e-2R_ie=1.6×10²⁰, eq=0.014, flagged HIGH PASS
T2B: High WD wd=5.0 → unhealthy, flagged MEDIUM PASS
T2C: Over-damped β₁=0.999ρ=0.63 (vs baseline 0.39), eq=0.15, flagged HIGH PASS
T2D: Baseline lr=3e-4 / wd=0.01 / β₁=0.9 → no collapse or WD pathology flagged PASS
T3: Intervention CollapseDetector fires step 19 HIGH (SGD wd=5.0), PPL gap +48,880 vs dead arm (3/3 seeds) PASS

Origin

Thermoclaw's thermodynamic framework comes from the EPTO (Entropy-Production Targeted Optimisation) research project. The key insight: neural network training is a non-equilibrium thermodynamic process, and the quantities that matter — entropy production, entropy ratios, equilibrium fraction — can be measured for any optimiser, not just EPTO.


Licence

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thermoclaw-0.1.6.tar.gz (61.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thermoclaw-0.1.6-py3-none-any.whl (57.1 kB view details)

Uploaded Python 3

File details

Details for the file thermoclaw-0.1.6.tar.gz.

File metadata

  • Download URL: thermoclaw-0.1.6.tar.gz
  • Upload date:
  • Size: 61.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for thermoclaw-0.1.6.tar.gz
Algorithm Hash digest
SHA256 328bdf2f0d40906644f3b7b8ee3474408426845e7608ca0ffe20dc14ba6898b5
MD5 b0cf7134ab3e729b9db986e7e868d2a5
BLAKE2b-256 aa356c0e2274fca79e6c24c99a8e17e912e4726fb91ab435fc0cbb76760c0e7a

See more details on using hashes here.

File details

Details for the file thermoclaw-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: thermoclaw-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 57.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for thermoclaw-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 81311cc0e8db01723a2bcd1f60504ebc29da73ed840aa064e51547dde9b5cbe2
MD5 b88f1122f340b17dc5c449b17cc8da23
BLAKE2b-256 4118ac7893c56ef16baa7e8bd3ba33c174901ceaa42fe114babb30747d3ee722

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page