TrainLens: Lightweight training runtime health monitor.
Project description
TrainLens
Find why PyTorch training got slow — while the run is still live.
Quickstart • Client Test Kit • Tutorials • Client Onboarding • Setup Agent • ML Intelligence • Server Operations • Known Limitations • Examples
TrainLens is a lightweight, step-aware bottleneck finder for PyTorch training runs. It attaches to your training loop and surfaces what is actually slowing things down — per step, per rank — without heavyweight profiling overhead.
The gap it fills: system dashboards show utilization over time. TrainLens shows what happened per training step and, in DDP, which rank is holding the run back.
What it catches
- Input pipeline stalls (dataloader / preprocessing wait)
- Step time drift and jitter over the run
- DDP rank stragglers in single-node and multi-node setups
- Memory creep and OOM trajectory
- Gradient explosions and NaN/Inf conditions
- FSDP and Pipeline Parallel overhead (
--grad-diagnostics) - NCCL communication failures with root-cause attribution
Supported configurations
| Configuration | Status |
|---|---|
| Single GPU | Supported |
| Single-node DDP | Supported |
| Multi-node DDP (2–4 nodes, up to ~32 ranks) | Collector can ingest this scale, but true multi-node launch needs deployment-specific launcher integration |
| Multi-node DDP (8+ nodes, 64+ ranks) | Experimental; load-test runtime, storage, and collector throughput first |
| FSDP diagnostics | Supported (--grad-diagnostics) |
| Pipeline Parallel bubble diagnostics | Supported (--grad-diagnostics) |
| Tensor Parallel diagnostics | Partial (trace_tp_model) |
| Full fleet backend / multi-aggregator coordination | Planned |
| TensorFlow / Keras | Planned |
Quick start
pip install trainlens-ai
trainlens server start
trainlens server start pulls the protected server image from GHCR, starts the
collector on localhost:29765, and serves the dashboard at
http://localhost:8765.
Wrap your training step:
from trainlens.decorators import trace_step
for batch in dataloader:
with trace_step(model):
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Run your script through TrainLens and stream it to the dashboard:
trainlens run --aggregator-host localhost:29765 train.py
For local terminal-only development, trainlens run train.py still works and
starts a collector for that one process.
See docs/quickstart.md for full setup details.
For design-partner or client trials, start with the
docs/client-test-kit.md packaging guide.
Use docs/tutorials.md for guided examples across
PyTorch, DDP, Hugging Face, Lightning, text, image, and LLM fine-tuning
workloads.
For client onboarding, docs/client-setup-agent.md
describes an optional agent workflow that applies the GitBook installation steps
as a small, reviewable patch.
The notebook path is covered in
docs/client-onboarding-notebooks.md.
What TrainLens shows
- Step time and its breakdown (forward / backward / optimizer / overhead)
- Dataloader and input wait per step
- Step jitter and drift over time
- GPU memory trend and OOM trajectory
- CPU / RAM / GPU utilization signals
- In DDP: worst-rank vs. median-rank timing and skew per step
This lets you tell whether a slowdown is coming from input, compute, the optimizer, or rank imbalance — before reaching for torch.profiler.
Integrations
Plain PyTorch
from trainlens.decorators import trace_step
with trace_step(model):
...
Hugging Face Trainer
from trainlens.integrations.huggingface import TrainLensTrainer
trainer = TrainLensTrainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=eval_ds,
trainlens_enabled=True,
)
See docs/huggingface.md.
PyTorch Lightning
import lightning as L
from trainlens.integrations.lightning import TrainLensCallback
trainer = L.Trainer(callbacks=[TrainLensCallback()])
See docs/lightning.md.
ML Intelligence
TrainLens watches the run in real time, estimates 5 outcome probabilities, and surfaces an action recommendation. When a termination-grade condition is confirmed, the aggregator writes a .trainlens_terminate signal file; trace_step() polls that file at step boundaries.
Prediction chain:
- ColdStartFallback — rule-based heuristics, active from step 1
- XGBoost predictor — 90-feature tabular run-outcome classifier trained from RunStore history
- ROCKET+Ridge predictor — optional sequence model over per-step loss, grad norm, memory, and step time when enough labeled
step_seriesdata exists - Parallel ensemble — runs XGBoost and ROCKET together when both are available, with an agreement gate for alert-grade actions
Termination signal: when TrainLens confirms a termination-grade condition, it writes a .trainlens_terminate signal file. If your loop is wrapped in trace_step(), TrainLens checks this file at the end of each traced step. Manual checks are also available:
from trainlens.runtime.auto_terminate import check_terminate_signal
for step in range(max_steps):
with trace_step(model):
...
if check_terminate_signal(session_dir):
print("TrainLens: auto-terminating")
break
Run history:
trainlens history # auto-discover from ./logs
trainlens history --db path/to/ml.db # specific database
trainlens history --n 50 # most recent 50 runs
Train a predictor manually:
trainlens train-model --db ./logs/<session>/aggregator/telemetry_ml.db --model-dir ./models
See docs/trainlens.md for the full ML intelligence reference.
CLI reference
| Subcommand | Description |
|---|---|
trainlens run train.py |
Live bottleneck diagnosis |
trainlens deep train.py |
Adds per-layer timing and memory signals |
trainlens inspect telemetry.msgpack |
Decode and print binary telemetry logs |
trainlens history |
Review ML run outcomes from RunStore |
trainlens train-model |
Train an XGBoost run-outcome predictor |
trainlens server start |
Start the protected Docker server with collector and dashboard |
trainlens serve |
Serve the React/FastAPI dashboard over existing logs |
trainlens collect |
Run a standalone TCP collector for remote GPU pods |
Add --grad-diagnostics to run or deep to enable gradient diagnostics:
trainlens run train.py --grad-diagnostics
trainlens deep train.py --grad-diagnostics --nproc-per-node=4
Activates: gradient norm tracking, NaN/Inf detection, MFU, FSDP latency, comm-overlap, and pipeline bubble ratio. Confirmed NaN/Inf conditions write the termination signal file.
Optional model hooks
from trainlens.decorators import trace_model_instance
trace_model_instance(model)
Use alongside trace_step(model) for per-layer timing and memory signals. The core step-level view works without it.
Scope
TrainLens is for lightweight diagnosis during real PyTorch training runs.
It is not:
- a kernel-level tracer
- a general-purpose auto-tuner
- a replacement for
torch.profilerfor deep kernel analysis - a managed fleet observability platform
Start with TrainLens when you need a fast answer. Reach for deeper profiling after you know where to look.
Feedback
If TrainLens caught a slowdown, please open an issue and include:
- hardware / CUDA / PyTorch versions
- single GPU or DDP
- whether you used core tracing only or model hooks
- the end-of-run summary
- a minimal repro if possible
Email: vsnm.tej@gmail.com
Contributing
External contribution workflow is currently managed through GitHub issues and email. Please open an issue with a minimal reproduction before sending a patch.
License
Proprietary. Copyright 2026 Venkata Pydipalli. All Rights Reserved.
Use of this software requires explicit written permission. See LICENSE for details or contact vsnm.tej@gmail.com.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trainlens_ai-1.2.7.tar.gz.
File metadata
- Download URL: trainlens_ai-1.2.7.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddd6360db3287f313a0a59903950fa1d1780528e61709644716b131bc25cc4ae
|
|
| MD5 |
98cd3b79a2f102d32cf8fcd837cf681f
|
|
| BLAKE2b-256 |
dd7f97f1af2eafb4b61480120cfaf0a795550a216aaba4f66aec982ae92589aa
|
File details
Details for the file trainlens_ai-1.2.7-py3-none-any.whl.
File metadata
- Download URL: trainlens_ai-1.2.7-py3-none-any.whl
- Upload date:
- Size: 1.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dde0851533de55a1469eebc4b7338898c1d34a552b2d8c0a1c6e90e60180061f
|
|
| MD5 |
80973b164e24219f4d93257fd56f2409
|
|
| BLAKE2b-256 |
85abd469b8a76275521754e07ba0bd83425bb30c3d2f569f553c868543b449ab
|