Skip to main content

Post-mortem debugger for LLM training loss spikes

Project description

trainscope

Post-mortem debugger for LLM training loss spikes.

When a spike hits, you usually know that it happened but not why. trainscope records per-layer gradients, weight distributions, and activation kurtosis at every step, then lets you scrub back through the event in a browser UI.

Install

pip install -e .

Dependencies: torch, pyarrow, fastapi, uvicorn, click, numpy.

Quickstart

from trainscope import TrainScope
from trainscope.core.config import TrainScopeConfig

scope = TrainScope(model, optimizer, config=TrainScopeConfig()).attach()

for step, batch in enumerate(dataloader):
    loss = forward_and_backward(batch)
    optimizer.step()

    spike = scope.step(loss.item(), batch_index=step)
    if spike:
        print(f"Spike at step {spike['step']}, z={spike['z_score']:.2f}")

scope.writer.close()
scope.detach()

Then open the UI:

trainscope ui --run ./trainscope_runs/<run-name>

What gets recorded

Per step (global)

  • Train loss, global grad norm (pre- and post-clip), learning rate
  • Adam second-moment (v) norm — stale momentum indicator
  • Step time, batch index

Per step, per layer

  • Gradient L2 norm
  • Weight L2 norm
  • Activation mean / std / max-abs / kurtosis — kurtosis is the earliest spike signal
  • NaN/Inf ratio in gradients
  • 16-bin weight histogram

On spike

  • Full snapshot of the surrounding window (configurable before/after)
  • Per-layer data for the same window
  • RNG state at the spike step (for exact replay)

Overhead

Measured on CPU with a 2-layer GPT-2 (144 parameters). GPU overhead is ~3–8× lower.

Config CPU overhead GPU overhead
Default (hist/50, act/5) ~55% ~4%
+ activation_layer_filter=["attn","mlp"] ~38% ~2%
Minimal (hist/50, act/50, filter) ~18% ~1%

CPU measured on 2-layer mini-GPT (144 params), Apple M2. GPU measured on the same model with CUDA. Results will differ on larger models — histogram cost scales with parameter count, activation cost scales with layer count × sequence length.

UI

Four views, one command:

View What it shows
Timeline Loss + grad norm, top-8 layers by grad variance
Layer Drill-down Kurtosis / grad norm / weight norm per layer; histogram scrubber
Diff View KL divergence of weight distributions between any two steps
Spike Inspector Per-spike window: loss+grad timeline and layer kurtosis/grad breakdown

The UI works immediately after pip install — a built-in fallback HTML with Plotly CDN is served when the React build is absent. For the full React build:

cd frontend && npm install && npm run build

CLI

# Open UI for a completed or in-progress run
trainscope ui --run ./trainscope_runs/run_20250516_143022 [--host 127.0.0.1] [--port 7007]

# Generate replay_config.json (does NOT resume training automatically)
trainscope replay --checkpoint ./checkpoints/step_4400.pt --skip-batches 4521,4522,4523 [--resume]

To actually skip batches, use SkippingDataLoader in your training script:

from trainscope.replay import SkippingDataLoader
import json

with open("replay_config.json") as f:
    cfg = json.load(f)

loader = SkippingDataLoader(original_loader, skip_batches=cfg["skip_batches"])
for batch in loader:
    ...

Configuration

TrainScopeConfig(
    run_dir="./trainscope_runs",                 # output root
    spike_threshold=3.5,                     # z-score threshold (rolling window baseline)
    full_resolution_window=500,              # last N steps at full resolution
    decimation_factor=10,                    # older steps: keep every Nth
    spike_window_before=50,                  # steps before spike to save (≤ full_resolution_window)
    spike_window_after=10,                   # steps after spike to save
    histogram_every_n_steps=50,             # weight histograms are expensive; sample them
    activation_metrics_every_n_steps=5,     # kurtosis sampling; always captured at spike
    activation_layer_filter=["attn", "mlp"],# None = all leaf layers
    stop_on_spike=False,                     # raise StopTraining on detection
    trace_every_n_steps=1,                   # subsample for very large models
    rank=None,                               # DDP rank → adds _rank{N} suffix to run dir
)

Demo

python examples/gpt2_spike_demo.py

Trains a 2-layer mini-GPT, injects a ×50 loss spike at step 50, and shows trainscope detecting it. Run trainscope ui on the output directory to explore the event.

Storage layout

trainscope_runs/<run-name>/
    meta.json                          model config + trainscope config
    global.arrow                       step-level scalars (Arrow IPC)
    layers/<param-name>.arrow          per-layer metrics
    spikes/spike_step_<N>.arrow        global window around spike N
    spikes/spike_step_<N>_layers/      per-layer data for that window
    rng_states/step_<N>.pkl            RNG state for replay

Estimated storage: ~10 MB/step at full resolution. Rolling 500-step window → ~5 GB max for a 1B-param model. Spike windows are small.

Publishing

CI runs on every push to main and every PR (pytest + ruff, Python 3.11 + 3.12, Vite build).

To publish a release to PyPI:

  1. Set up Trusted Publishing on PyPI for this repo (environment name: pypi).
  2. Tag and push: git tag v0.1.0 && git push origin v0.1.0

The publish workflow builds the React frontend, bundles it into the wheel, and uploads via OIDC — no API token needed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainscope-0.1.0.tar.gz (71.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trainscope-0.1.0-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file trainscope-0.1.0.tar.gz.

File metadata

  • Download URL: trainscope-0.1.0.tar.gz
  • Upload date:
  • Size: 71.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainscope-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4d6509e79cb3a3881e9001be3cbfb43e9496e5ecaadb00e0dd4e0789b6be6c5e
MD5 7f99b3b5c24cf0aac3a74f8b0d31818c
BLAKE2b-256 f14fb1ae17a8dd59941b40710ac2fca327166f1e5474175d2d6a084779bc603c

See more details on using hashes here.

Provenance

The following attestation bundles were made for trainscope-0.1.0.tar.gz:

Publisher: publish.yml on kaelvalen/trainscope

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file trainscope-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: trainscope-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainscope-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 32922f44d5770e439034443ce470bca442504fb97d853a4e60e941ce37e0b327
MD5 187a6233f189d369436759f4cd888ec4
BLAKE2b-256 b7ee3f02cabfee2e434ea11276f43a3ab662bf7b2ebb14ea23d6d6054eec2970

See more details on using hashes here.

Provenance

The following attestation bundles were made for trainscope-0.1.0-py3-none-any.whl:

Publisher: publish.yml on kaelvalen/trainscope

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page