Post-mortem debugger for LLM training loss spikes
Project description
trainscope
Post-mortem debugger for LLM training loss spikes.
When a spike hits, you usually know that it happened but not why. trainscope records per-layer gradients, weight distributions, and activation kurtosis at every step, then lets you scrub back through the event in a browser UI.
Install
pip install -e .
Dependencies: torch, pyarrow, fastapi, uvicorn, click, numpy.
Quickstart
from trainscope import TrainScope
from trainscope.core.config import TrainScopeConfig
scope = TrainScope(model, optimizer, config=TrainScopeConfig()).attach()
for step, batch in enumerate(dataloader):
loss = forward_and_backward(batch)
optimizer.step()
spike = scope.step(loss.item(), batch_index=step)
if spike:
print(f"Spike at step {spike['step']}, z={spike['z_score']:.2f}")
scope.writer.close()
scope.detach()
Then open the UI:
trainscope ui --run ./trainscope_runs/<run-name>
What gets recorded
Per step (global)
- Train loss, global grad norm (pre- and post-clip), learning rate
- Adam second-moment (v) norm — stale momentum indicator
- Step time, batch index
Per step, per layer
- Gradient L2 norm
- Weight L2 norm
- Activation mean / std / max-abs / kurtosis — kurtosis is the earliest spike signal
- NaN/Inf ratio in gradients
- 16-bin weight histogram
On spike
- Full snapshot of the surrounding window (configurable before/after)
- Per-layer data for the same window
- RNG state at the spike step (for exact replay)
Overhead
Measured on CPU with a 2-layer GPT-2 (144 parameters). GPU overhead is ~3–8× lower.
| Config | CPU overhead | GPU overhead |
|---|---|---|
Default (hist/50, act/5) |
~55% | ~4% |
+ activation_layer_filter=["attn","mlp"] |
~38% | ~2% |
Minimal (hist/50, act/50, filter) |
~18% | ~1% |
CPU measured on 2-layer mini-GPT (144 params), Apple M2. GPU measured on the same model with CUDA. Results will differ on larger models — histogram cost scales with parameter count, activation cost scales with layer count × sequence length.
UI
Four views, one command:
| View | What it shows |
|---|---|
| Timeline | Loss + grad norm, top-8 layers by grad variance |
| Layer Drill-down | Kurtosis / grad norm / weight norm per layer; histogram scrubber |
| Diff View | KL divergence of weight distributions between any two steps |
| Spike Inspector | Per-spike window: loss+grad timeline and layer kurtosis/grad breakdown |
The UI works immediately after pip install — a built-in fallback HTML with Plotly CDN is served when the React build is absent. For the full React build:
cd frontend && npm install && npm run build
CLI
# Open UI for a completed or in-progress run
trainscope ui --run ./trainscope_runs/run_20250516_143022 [--host 127.0.0.1] [--port 7007]
# Generate replay_config.json (does NOT resume training automatically)
trainscope replay --checkpoint ./checkpoints/step_4400.pt --skip-batches 4521,4522,4523 [--resume]
To actually skip batches, use SkippingDataLoader in your training script:
from trainscope.replay import SkippingDataLoader
import json
with open("replay_config.json") as f:
cfg = json.load(f)
loader = SkippingDataLoader(original_loader, skip_batches=cfg["skip_batches"])
for batch in loader:
...
Configuration
TrainScopeConfig(
run_dir="./trainscope_runs", # output root
spike_threshold=3.5, # z-score threshold (rolling window baseline)
full_resolution_window=500, # last N steps at full resolution
decimation_factor=10, # older steps: keep every Nth
spike_window_before=50, # steps before spike to save (≤ full_resolution_window)
spike_window_after=10, # steps after spike to save
histogram_every_n_steps=50, # weight histograms are expensive; sample them
activation_metrics_every_n_steps=5, # kurtosis sampling; always captured at spike
activation_layer_filter=["attn", "mlp"],# None = all leaf layers
stop_on_spike=False, # raise StopTraining on detection
trace_every_n_steps=1, # subsample for very large models
rank=None, # DDP rank → adds _rank{N} suffix to run dir
)
Demo
python examples/gpt2_spike_demo.py
Trains a 2-layer mini-GPT, injects a ×50 loss spike at step 50, and shows trainscope detecting it. Run trainscope ui on the output directory to explore the event.
Storage layout
trainscope_runs/<run-name>/
meta.json model config + trainscope config
global.arrow step-level scalars (Arrow IPC)
layers/<param-name>.arrow per-layer metrics
spikes/spike_step_<N>.arrow global window around spike N
spikes/spike_step_<N>_layers/ per-layer data for that window
rng_states/step_<N>.pkl RNG state for replay
Estimated storage: ~10 MB/step at full resolution. Rolling 500-step window → ~5 GB max for a 1B-param model. Spike windows are small.
Publishing
CI runs on every push to main and every PR (pytest + ruff, Python 3.11 + 3.12, Vite build).
To publish a release to PyPI:
- Set up Trusted Publishing on PyPI for this repo (environment name:
pypi). - Tag and push:
git tag v0.1.0 && git push origin v0.1.0
The publish workflow builds the React frontend, bundles it into the wheel, and uploads via OIDC — no API token needed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trainscope-0.1.0.tar.gz.
File metadata
- Download URL: trainscope-0.1.0.tar.gz
- Upload date:
- Size: 71.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d6509e79cb3a3881e9001be3cbfb43e9496e5ecaadb00e0dd4e0789b6be6c5e
|
|
| MD5 |
7f99b3b5c24cf0aac3a74f8b0d31818c
|
|
| BLAKE2b-256 |
f14fb1ae17a8dd59941b40710ac2fca327166f1e5474175d2d6a084779bc603c
|
Provenance
The following attestation bundles were made for trainscope-0.1.0.tar.gz:
Publisher:
publish.yml on kaelvalen/trainscope
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
trainscope-0.1.0.tar.gz -
Subject digest:
4d6509e79cb3a3881e9001be3cbfb43e9496e5ecaadb00e0dd4e0789b6be6c5e - Sigstore transparency entry: 1568014864
- Sigstore integration time:
-
Permalink:
kaelvalen/trainscope@0b5c95e49358b01c155fb8aae2994aa0da1536be -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/kaelvalen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0b5c95e49358b01c155fb8aae2994aa0da1536be -
Trigger Event:
push
-
Statement type:
File details
Details for the file trainscope-0.1.0-py3-none-any.whl.
File metadata
- Download URL: trainscope-0.1.0-py3-none-any.whl
- Upload date:
- Size: 1.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32922f44d5770e439034443ce470bca442504fb97d853a4e60e941ce37e0b327
|
|
| MD5 |
187a6233f189d369436759f4cd888ec4
|
|
| BLAKE2b-256 |
b7ee3f02cabfee2e434ea11276f43a3ab662bf7b2ebb14ea23d6d6054eec2970
|
Provenance
The following attestation bundles were made for trainscope-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on kaelvalen/trainscope
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
trainscope-0.1.0-py3-none-any.whl -
Subject digest:
32922f44d5770e439034443ce470bca442504fb97d853a4e60e941ce37e0b327 - Sigstore transparency entry: 1568015131
- Sigstore integration time:
-
Permalink:
kaelvalen/trainscope@0b5c95e49358b01c155fb8aae2994aa0da1536be -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/kaelvalen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0b5c95e49358b01c155fb8aae2994aa0da1536be -
Trigger Event:
push
-
Statement type: