Post-mortem debugger for LLM training loss spikes

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kaelvalen

These details have not been verified by PyPI

Project description

trainscope

Post-mortem debugger for LLM training loss spikes.

When a spike hits, you usually know that it happened but not why. trainscope records per-layer gradients, weight distributions, and activation kurtosis at every step, then lets you scrub back through the event in a browser UI.

Install

pip install -e .

Dependencies: torch, pyarrow, fastapi, uvicorn, click, numpy.

Quickstart

from trainscope import TrainScope
from trainscope.core.config import TrainScopeConfig

scope = TrainScope(model, optimizer, config=TrainScopeConfig()).attach()

for step, batch in enumerate(dataloader):
    loss = forward_and_backward(batch)
    optimizer.step()

    spike = scope.step(loss.item(), batch_index=step)
    if spike:
        print(f"Spike at step {spike['step']}, z={spike['z_score']:.2f}")

scope.writer.close()
scope.detach()

Then open the UI:

trainscope ui --run ./trainscope_runs/<run-name>

What gets recorded

Per step (global)

Train loss, global grad norm (pre- and post-clip), learning rate
Adam second-moment (v) norm — stale momentum indicator
Step time, batch index

Per step, per layer

Gradient L2 norm
Weight L2 norm
Activation mean / std / max-abs / kurtosis — kurtosis is the earliest spike signal
NaN/Inf ratio in gradients
16-bin weight histogram

On spike

Full snapshot of the surrounding window (configurable before/after)
Per-layer data for the same window
RNG state at the spike step (for exact replay)

Overhead

Measured on CPU with a 2-layer GPT-2 (144 parameters). GPU overhead is ~3–8× lower.

Config	CPU overhead	GPU overhead
Default (`hist/50`, `act/5`)	~55%	~4%
+ `activation_layer_filter=["attn","mlp"]`	~38%	~2%
Minimal (`hist/50`, `act/50`, filter)	~18%	~1%

CPU measured on 2-layer mini-GPT (144 params), Apple M2. GPU measured on the same model with CUDA. Results will differ on larger models — histogram cost scales with parameter count, activation cost scales with layer count × sequence length.

UI

Four views, one command:

View	What it shows
Timeline	Loss + grad norm, top-8 layers by grad variance
Layer Drill-down	Kurtosis / grad norm / weight norm per layer; histogram scrubber
Diff View	KL divergence of weight distributions between any two steps
Spike Inspector	Per-spike window: loss+grad timeline and layer kurtosis/grad breakdown

The UI works immediately after pip install — a built-in fallback HTML with Plotly CDN is served when the React build is absent. For the full React build:

cd frontend && npm install && npm run build

CLI

# Open UI for a completed or in-progress run
trainscope ui --run ./trainscope_runs/run_20250516_143022 [--host 127.0.0.1] [--port 7007]

# Generate replay_config.json (does NOT resume training automatically)
trainscope replay --checkpoint ./checkpoints/step_4400.pt --skip-batches 4521,4522,4523 [--resume]

To actually skip batches, use SkippingDataLoader in your training script:

from trainscope.replay import SkippingDataLoader
import json

with open("replay_config.json") as f:
    cfg = json.load(f)

loader = SkippingDataLoader(original_loader, skip_batches=cfg["skip_batches"])
for batch in loader:
    ...

Configuration

TrainScopeConfig(
    run_dir="./trainscope_runs",                 # output root
    spike_threshold=3.5,                     # z-score threshold (rolling window baseline)
    full_resolution_window=500,              # last N steps at full resolution
    decimation_factor=10,                    # older steps: keep every Nth
    spike_window_before=50,                  # steps before spike to save (≤ full_resolution_window)
    spike_window_after=10,                   # steps after spike to save
    histogram_every_n_steps=50,             # weight histograms are expensive; sample them
    activation_metrics_every_n_steps=5,     # kurtosis sampling; always captured at spike
    activation_layer_filter=["attn", "mlp"],# None = all leaf layers
    stop_on_spike=False,                     # raise StopTraining on detection
    trace_every_n_steps=1,                   # subsample for very large models
    rank=None,                               # DDP rank → adds _rank{N} suffix to run dir
)

Demo

python examples/gpt2_spike_demo.py

Trains a 2-layer mini-GPT, injects a ×50 loss spike at step 50, and shows trainscope detecting it. Run trainscope ui on the output directory to explore the event.

Storage layout

trainscope_runs/<run-name>/
    meta.json                          model config + trainscope config
    global.arrow                       step-level scalars (Arrow IPC)
    layers/<param-name>.arrow          per-layer metrics
    spikes/spike_step_<N>.arrow        global window around spike N
    spikes/spike_step_<N>_layers/      per-layer data for that window
    rng_states/step_<N>.pkl            RNG state for replay

Estimated storage: ~10 MB/step at full resolution. Rolling 500-step window → ~5 GB max for a 1B-param model. Spike windows are small.

Publishing

CI runs on every push to main and every PR (pytest + ruff, Python 3.11 + 3.12, Vite build).

To publish a release to PyPI:

Set up Trusted Publishing on PyPI for this repo (environment name: pypi).
Tag and push: git tag v0.1.0 && git push origin v0.1.0

The publish workflow builds the React frontend, bundles it into the wheel, and uploads via OIDC — no API token needed.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kaelvalen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainscope-0.1.0.tar.gz (71.2 kB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trainscope-0.1.0-py3-none-any.whl (1.5 MB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file trainscope-0.1.0.tar.gz.

File metadata

Download URL: trainscope-0.1.0.tar.gz
Upload date: May 18, 2026
Size: 71.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainscope-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4d6509e79cb3a3881e9001be3cbfb43e9496e5ecaadb00e0dd4e0789b6be6c5e`
MD5	`7f99b3b5c24cf0aac3a74f8b0d31818c`
BLAKE2b-256	`f14fb1ae17a8dd59941b40710ac2fca327166f1e5474175d2d6a084779bc603c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for trainscope-0.1.0.tar.gz:

Publisher: publish.yml on kaelvalen/trainscope

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: trainscope-0.1.0.tar.gz
- Subject digest: 4d6509e79cb3a3881e9001be3cbfb43e9496e5ecaadb00e0dd4e0789b6be6c5e
- Sigstore transparency entry: 1568014864
- Sigstore integration time: May 18, 2026
Source repository:
- Permalink: kaelvalen/trainscope@0b5c95e49358b01c155fb8aae2994aa0da1536be
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/kaelvalen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0b5c95e49358b01c155fb8aae2994aa0da1536be
- Trigger Event: push

File details

Details for the file trainscope-0.1.0-py3-none-any.whl.

File metadata

Download URL: trainscope-0.1.0-py3-none-any.whl
Upload date: May 18, 2026
Size: 1.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainscope-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`32922f44d5770e439034443ce470bca442504fb97d853a4e60e941ce37e0b327`
MD5	`187a6233f189d369436759f4cd888ec4`
BLAKE2b-256	`b7ee3f02cabfee2e434ea11276f43a3ab662bf7b2ebb14ea23d6d6054eec2970`

See more details on using hashes here.

Provenance

The following attestation bundles were made for trainscope-0.1.0-py3-none-any.whl:

Publisher: publish.yml on kaelvalen/trainscope

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: trainscope-0.1.0-py3-none-any.whl
- Subject digest: 32922f44d5770e439034443ce470bca442504fb97d853a4e60e941ce37e0b327
- Sigstore transparency entry: 1568015131
- Sigstore integration time: May 18, 2026
Source repository:
- Permalink: kaelvalen/trainscope@0b5c95e49358b01c155fb8aae2994aa0da1536be
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/kaelvalen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0b5c95e49358b01c155fb8aae2994aa0da1536be
- Trigger Event: push

trainscope 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

trainscope

Install

Quickstart

What gets recorded

Overhead

UI

CLI

Configuration

Demo

Storage layout

Publishing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance