Lightweight training health monitor. Detect loss spikes, gradient explosions, and NaN — 2 lines of code, no server, no signup.
Project description
trainpulse
Lightweight training health monitor. Detect loss spikes, gradient explosions, NaN/Inf, and plateaus — 2 lines of code, no server, no signup.
Why trainpulse?
| Feature | W&B / Neptune | TensorBoard | trainpulse |
|---|---|---|---|
| Setup | Account + API key | TF dependency | pip install trainpulse |
| NaN/Inf detection | Manual | No | Automatic |
| Loss spike alerts | Manual | No | Automatic |
| Gradient monitoring | Manual | Manual | Automatic |
| Plateau detection | No | No | Automatic |
| Zero dependencies | No | No | Yes |
| Works offline | No | Yes | Yes |
Install
pip install trainpulse
With PyTorch integration:
pip install trainpulse[torch]
With CLI:
pip install trainpulse[cli]
Quick Start
Minimal — 2 lines
from trainpulse import Monitor
monitor = Monitor()
for step in range(num_steps):
loss = train_step()
monitor.log("loss", step, loss)
report = monitor.report()
print(f"Health: {report.health_score:.0%}")
Full training loop
from trainpulse import Monitor, MonitorConfig
config = MonitorConfig(
loss_spike_threshold=5.0, # Alert if loss > 5x rolling average
grad_norm_threshold=100.0, # Alert if gradient norm > 100
plateau_patience=200, # Alert after 200 steps without improvement
)
monitor = Monitor(config)
for step in range(num_steps):
monitor.step_start()
loss = train_step()
grad_norm = get_grad_norm()
lr = scheduler.get_last_lr()[0]
monitor.log("loss", step, loss)
monitor.log("grad_norm", step, grad_norm)
monitor.log("learning_rate", step, lr)
monitor.step_end(step)
report = monitor.report()
Callback API
from trainpulse import TrainingCallback
cb = TrainingCallback()
for step in range(num_steps):
cb.on_step_begin(step)
loss = train_step()
cb.on_step_end(step, loss=loss, grad_norm=grad_norm, lr=lr)
report = cb.report()
Real-time alerts
def my_alert_handler(alert):
print(f"⚠ {alert}")
# Or send to Slack, Discord, email...
config = MonitorConfig(alert_callbacks=[my_alert_handler])
monitor = Monitor(config)
Detectors
| Detector | What it catches | Default threshold |
|---|---|---|
| NaN/Inf | NaN or Inf in any metric | Always on |
| Loss spike | Sudden loss increase vs rolling average | 5x |
| Gradient explosion | Gradient norm too large | 100.0 |
| Gradient vanishing | Gradient norm too small | 1e-7 |
| LR anomaly | Learning rate jumps | 10x change |
| Plateau | No loss improvement | 100 steps |
| Step time | Unusually slow steps | 3x average |
CLI
Analyze training logs (JSONL format):
trainpulse analyze train.jsonl
trainpulse analyze train.jsonl --json-out report.json
trainpulse show report.json
Expected JSONL format:
{"step": 0, "loss": 2.5, "grad_norm": 1.2, "learning_rate": 0.001}
{"step": 1, "loss": 2.3, "grad_norm": 1.1, "learning_rate": 0.001}
Health Score
The health score (0.0–1.0) is computed from alert severity:
- Critical alerts (NaN, gradient explosion): −0.15 each
- Warning alerts (spikes, plateaus): −0.05 each
- Info alerts: −0.01 each
A score above 0.80 generally indicates healthy training.
API Reference
Monitor(config=None)
Main class. Call .log(name, step, value) to record metrics.
MonitorConfig
| Parameter | Default | Description |
|---|---|---|
loss_spike_threshold |
5.0 | Multiplier over rolling average |
loss_spike_window |
50 | Rolling window size |
grad_norm_threshold |
100.0 | Max acceptable gradient norm |
grad_vanish_threshold |
1e-7 | Min acceptable gradient norm |
check_nan |
True | Enable NaN/Inf detection |
lr_change_threshold |
10.0 | Max LR change ratio per step |
plateau_patience |
100 | Steps without improvement |
plateau_min_delta |
1e-5 | Minimum improvement delta |
step_time_spike_threshold |
3.0 | Step time spike multiplier |
alert_callbacks |
[] | Functions called on each alert |
TrainingReport
| Property | Type | Description |
|---|---|---|
.health_score |
float | 0.0 (terrible) to 1.0 (perfect) |
.is_healthy |
bool | True if no critical alerts |
.n_warnings |
int | Number of warning alerts |
.n_critical |
int | Number of critical alerts |
.alerts |
list[Alert] | All triggered alerts |
.metrics_summary |
dict | Per-metric min/max/mean/last |
See Also
Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:
| Project | What it does |
|---|---|
| tokonomics | Token counting & cost management for LLM APIs |
| datacrux | Training data quality — dedup, PII, contamination |
| castwright | Synthetic instruction data generation |
| datamix | Dataset mixing & curriculum optimization |
| toksight | Tokenizer analysis & comparison |
| ckpt | Checkpoint inspection, diffing & merging |
| quantbench | Quantization quality analysis |
| infermark | Inference benchmarking |
| modeldiff | Behavioral regression testing |
| vibesafe | AI-generated code safety scanner |
| injectionguard | Prompt injection detection |
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trainpulse-0.4.0.tar.gz.
File metadata
- Download URL: trainpulse-0.4.0.tar.gz
- Upload date:
- Size: 50.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df5e4464940386a3d10644893767ed19ba10c10e81c64ea23a4dd92f6a833e8f
|
|
| MD5 |
d8df1075b46c5400df285b5ac21c4a68
|
|
| BLAKE2b-256 |
27761d01d64afed23684af8a4ad772c9a2ea2977c82a7edf986de0f5ef8f4126
|
File details
Details for the file trainpulse-0.4.0-py3-none-any.whl.
File metadata
- Download URL: trainpulse-0.4.0-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8f315d8a35472903ec33edee53de2e065c83d0fb4d92bb29566b1882fd765d9
|
|
| MD5 |
64118b1a6dcee8bf613dfccd577a87ac
|
|
| BLAKE2b-256 |
8b1c437981b98af9ecf26ca2abb402d4e5caf6e843c92f14594ddb51d517919a
|