Catches training failures before they waste GPU hours. Thermodynamic diagnostics for any PyTorch optimiser — per-layer collapse detection, entropy decomposition, and plain-English recommendations.

These details have not been verified by PyPI

Project links

Project description

Thermoclaw

Thermoclaw catches training failures before they waste GPU hours.

A lightweight diagnostic layer for any PyTorch optimiser. One line wraps AdamW, SGD, or anything else — and you get real-time alerts when your training is about to collapse, plateau, or diverge, with layer-level explanations of why.

Works alongside W&B, TensorBoard, and the HuggingFace Trainer. Thermoclaw tells you why your loss curve looks the way it does.

It found the problem. Acting on it worked.

We trained GPT-2 small (124M parameters) from scratch on WikiText-103 with SGD + momentum and weight_decay=5.0. At step 19, CollapseDetector flagged a HIGH-confidence weight-decay collapse:

[HIGH] Weight decay is eroding 2 embedding layers: param norms dropped
36% during training. Reduce weight_decay.

We branched at that point — one arm continued unchanged, the other reduced weight decay to 0.01.

Run	Final PPL (600 steps post-branch)
Unmodified (`wd=5.0` throughout)	50,257 — model completely dead, outputting uniform noise
Thermoclaw intervention (`wd→0.01`)	1,377
Improvement	36× lower perplexity

Replicated across 3 seeds (42, 137, 2024). The unmodified arm locks at ppl = vocab_size — zero information. The intervention arm learns.

No hyperparameter search. No manual inspection. One warning, one change. The warning is causal, not correlational.

Install

pip install thermoclaw                    # core (PyTorch + NumPy only)
pip install thermoclaw[viz]               # + Matplotlib dashboards
pip install thermoclaw[hf]                # + HuggingFace Trainer callback
pip install thermoclaw[all]               # everything

Quick Start

Observe any optimiser

from thermoclaw import Observer, diagnose

model     = YourModel()
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)
observer  = Observer(model, optimiser)

for batch in loader:
    loss = criterion(model(batch))
    loss.backward()
    optimiser.step()
    observer.step(loss=loss.item())
    optimiser.zero_grad()

report = diagnose(observer)
print(report)
observer.plot_dashboard(save_path='dashboard.png')

HuggingFace Trainer (one line)

from thermoclaw.integrations.huggingface import ThermoclawCallback

trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[ThermoclawCallback()],      # ← that's it
)
trainer.train()
# Dashboard PNG, CSV, and diagnostic report saved to output_dir automatically
# Thermodynamic metrics logged to W&B / TensorBoard automatically

See BLOG_HF.md for a full walkthrough.

Catch weight-decay collapse in real time

from thermoclaw import CollapseDetector
from thermoclaw import make_param_groups

# Per-layer groups give full resolution (recommended)
groups    = make_param_groups(model, lr=3e-4, weight_decay=0.01)
optimiser = torch.optim.AdamW(groups)
detector  = CollapseDetector(model, optimiser)

for batch in loader:
    loss.backward()
    optimiser.step()
    detector.step()             # call before zero_grad
    optimiser.zero_grad()

    if detector.is_collapsing:
        for pg in optimiser.param_groups:
            pg['weight_decay'] *= 0.1

recs = detector.get_recommendations()
# → ["[HIGH] Weight decay collapse in 5 mlp layers: grad/param ratio
#     dropped 4.2× from early to late training. Reduce weight_decay."]

is_collapsing fires as soon as any HIGH or MEDIUM signal is confirmed. See Known Issues for AdamW behaviour at typical weight-decay values.

Decompose entropy into productive vs overhead

from thermoclaw import Observer, EntropySplit, diagnose

observer = Observer(model, optimiser)
splitter = EntropySplit(model, optimiser, observer)

for batch in loader:
    loss = criterion(model(batch))
    loss.backward()
    optimiser.step()
    observer.step(loss=loss.item())
    splitter.step()             # decomposes entropy each step
    optimiser.zero_grad()

report = diagnose(observer, splitter)
print(report)
# Example output:
#   [HIGH] Weight decay is the dominant entropy source for 12 attention
#   layers (mean R_ie=4.2). Consider reducing weight_decay for attention
#   layers by 4-8×, or excluding them.

splitter.plot_entropy_split(save_path='entropy_split.png')

Thermodynamically-aware LR schedule (drop-in)

from thermoclaw import ThermoScheduler

optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)
scheduler = ThermoScheduler(optimiser, total_steps=10000)

for step, batch in enumerate(loader):
    loss = criterion(model(batch))
    loss.backward()
    optimiser.step()
    scheduler.step()            # replaces cosine_scheduler.step()
    optimiser.zero_grad()

What am I looking at?

You've run Thermoclaw and have numbers. Here is what they mean in plain English.

Entropy ratio `r_l = σ / σ*`

The ratio of how much thermodynamic work layer l is doing now versus its own historical baseline.

Value	Meaning	What to do
`r ≈ 1.0`	Layer is at equilibrium — learning at a steady, sustainable rate	Nothing
`r < 0.85`	Under-trained — this layer is doing less work than its baseline	Consider a higher LR or check for gradient starvation
`r > 1.15`	Over-trained — this layer is overheating	Consider a lower LR, more weight decay, or gradient clipping

Dispersion `D = Var(r_l)` across layers

How uniformly your model is learning across all layers simultaneously.

Value	Meaning
`D < 0.05`	Clean — all layers are coordinated, gradients are coherent
`D ≈ 0.1–0.4`	Some inter-layer tension — often a sign of noisy labels or an aggressive LR
`D > 0.5`	Significant fragmentation — some layers are updating aggressively whilst others stall

Entropy ratio `R_ie = d_iS / d_eS`

The single most actionable number. It tells you how much of the optimiser's energy budget is going on overhead (weight decay, momentum friction, noise) versus productive gradient descent.

Value	Meaning	What to do
`R_ie < 1`	Most entropy is productive — training is efficient	Nothing
`R_ie 1–2`	Modest overhead — normal for most runs	Monitor
`R_ie 2–5`	Warning — overhead is dominating. Check weight decay and momentum	Reduce `weight_decay`, lower `β₁`, or clip gradients
`R_ie > 5`	Critical — the optimiser is mostly generating heat. Training may plateau or collapse	Intervene immediately

Gradient coherence `ρ = cos(g_t, g_{t-1})`

How consistent the gradient direction is from step to step.

Value	Meaning
`ρ > 0.3`	Coherent — training is stable, loss is likely decreasing smoothly
`ρ ≈ 0`	Incoherent — gradients are effectively random walk. Check batch size, LR, and data ordering
`ρ < -0.1`	Oscillating — gradients are reversing direction. Reduce LR or increase batch size

Non-obvious: high momentum (β₁ → 1) increases ρ by smoothing consecutive gradient vectors. A high ρ reading does not necessarily mean training is healthy if the equilibrium fraction is low. Use both together.

Equilibrium fraction `eq_frac`

The fraction of recent steps where layers were operating at or near their equilibrium entropy ratio (r ≈ 1). Think of it as a "steady-state score".

Value	Meaning
`eq_frac > 0.5`	More than half of steps are at equilibrium — stable training
`eq_frac < 0.2`	Training is rarely at equilibrium — unstable. Check LR schedule and weight decay

CollapseDetector confidence levels

Level	Trigger	Action
`[HIGH]`	Grad/param ratio dropped >2× AND confirmed across multiple layers	Act immediately
`[MEDIUM]`	Clear signal, moderate severity	Investigate
`[LOW]`	Signal present, multiple contributing sources	Informational

What Thermoclaw measures

Quantity	Symbol	What it means
Entropy production	`σ_l = η‖g‖²`	How much thermodynamic work each layer is doing
Entropy ratio	`r_l = σ/σ*`	1.0 = equilibrium. <0.85 = under-trained. >1.15 = over-trained
External entropy	`d_eS`	Entropy that reduces loss — productive learning
Internal entropy	`d_iS`	Entropy from weight decay, momentum, noise — overhead
`R_ie = d_iS/d_eS`	—	The diagnostic ratio. >2 = warning, >5 = critical
Grad/param ratio	`‖g‖/‖θ‖`	CollapseDetector signal. A falling trend signals weight-decay erosion
Dispersion	`D = Var(r_l)`	Inter-layer training uniformity
Gradient alignment	`ρ = cos(g_t, g_{t-1})`	Step coherence. Negative = oscillation
Parameter distance	`E = ‖θ−θ₀‖²`	How far weights have moved from initialisation

The `d_iS / d_eS` decomposition

Standard training observes total loss and calls it a day. But total entropy production σ conflates two fundamentally different thermodynamic flows:

d_eS (external) — gradient-driven parameter updates that reduce loss. This is productive work.
d_iS (internal) — entropy from weight decay, momentum friction, and stochastic noise. This is heat.

When d_iS >> d_eS, the optimiser is spending most of its entropy budget on overhead. Thermoclaw decomposes d_iS further:

d_iS_wd — weight-decay contribution
d_iS_momentum — momentum friction
d_iS_noise — stochastic gradient noise

This tells you exactly which layers, at which step, are wasting compute — and why.

Per-layer parameter groups

For full per-layer resolution, use make_param_groups:

from thermoclaw import make_param_groups

groups    = make_param_groups(model, lr=3e-4, weight_decay=0.01)
optimiser = torch.optim.AdamW(groups)
observer  = Observer(model, optimiser)

Confidence scoring

Recommendations are conservative. Thermoclaw only flags issues where the physics signal is unambiguous.

[HIGH] — Single dominant source (>60% of d_iS), R_ie > 5, consistent across regions. Safe to act on.
[MEDIUM] — Clear signal but moderate R_ie (2–5). Worth investigating.
[LOW] — Signal present but multiple sources contribute. Informational only.

Wrong recommendations that sound authoritative destroy trust faster than no recommendations at all.

Validated

Three-tier validation on H100 80 GB (Pythia-410M, WikiText-103, bfloat16):

Tier	Test	Result
T1: Analytical	σ, ρ, d_iS_wd, d_eS+d_iS=σ, E, D — 8 ground-truth checks	8/8 PASS
T2A: High LR	`lr=3e-2` → `R_ie=1.6×10²⁰`, `eq=0.014`, flagged HIGH	PASS
T2B: High WD	`wd=5.0` → unhealthy, flagged MEDIUM	PASS
T2C: Over-damped	`β₁=0.999` → `ρ=0.63` (vs baseline 0.39), `eq=0.15`, flagged HIGH	PASS
T2D: Baseline	`lr=3e-4 / wd=0.01 / β₁=0.9` → no collapse or WD pathology flagged	PASS
T3: Intervention	CollapseDetector fires step 19 HIGH (SGD `wd=5.0`), PPL gap +48,880 vs dead arm (3/3 seeds)	PASS

Origin

Thermoclaw's thermodynamic framework comes from the EPTO (Entropy-Production Targeted Optimisation) research project. The key insight: neural network training is a non-equilibrium thermodynamic process, and the quantities that matter — entropy production, entropy ratios, equilibrium fraction — can be measured for any optimiser, not just EPTO.

Licence

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.6

Apr 4, 2026

0.1.5

Apr 4, 2026

0.1.4

Apr 3, 2026

0.1.3

Apr 3, 2026

0.1.2

Apr 3, 2026

0.1.1

Apr 3, 2026

0.1.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thermoclaw-0.1.6.tar.gz (61.8 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

thermoclaw-0.1.6-py3-none-any.whl (57.1 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file thermoclaw-0.1.6.tar.gz.

File metadata

Download URL: thermoclaw-0.1.6.tar.gz
Upload date: Apr 4, 2026
Size: 61.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for thermoclaw-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`328bdf2f0d40906644f3b7b8ee3474408426845e7608ca0ffe20dc14ba6898b5`
MD5	`b0cf7134ab3e729b9db986e7e868d2a5`
BLAKE2b-256	`aa356c0e2274fca79e6c24c99a8e17e912e4726fb91ab435fc0cbb76760c0e7a`

See more details on using hashes here.

File details

Details for the file thermoclaw-0.1.6-py3-none-any.whl.

File metadata

Download URL: thermoclaw-0.1.6-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 57.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for thermoclaw-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`81311cc0e8db01723a2bcd1f60504ebc29da73ed840aa064e51547dde9b5cbe2`
MD5	`b88f1122f340b17dc5c449b17cc8da23`
BLAKE2b-256	`4118ac7893c56ef16baa7e8bd3ba33c174901ceaa42fe114babb30747d3ee722`

See more details on using hashes here.

thermoclaw 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Thermoclaw

It found the problem. Acting on it worked.

Install

Quick Start

Observe any optimiser

HuggingFace Trainer (one line)

Catch weight-decay collapse in real time

Decompose entropy into productive vs overhead

Thermodynamically-aware LR schedule (drop-in)

What am I looking at?

Entropy ratio r_l = σ / σ*

Dispersion D = Var(r_l) across layers

Entropy ratio R_ie = d_iS / d_eS

Gradient coherence ρ = cos(g_t, g_{t-1})

Equilibrium fraction eq_frac

CollapseDetector confidence levels

What Thermoclaw measures

The d_iS / d_eS decomposition

Per-layer parameter groups

Confidence scoring

Validated

Origin

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Entropy ratio `r_l = σ / σ*`

Dispersion `D = Var(r_l)` across layers

Entropy ratio `R_ie = d_iS / d_eS`

Gradient coherence `ρ = cos(g_t, g_{t-1})`

Equilibrium fraction `eq_frac`

The `d_iS / d_eS` decomposition