TrainLens: Lightweight training runtime health monitor.

These details have not been verified by PyPI

Project description

TrainLens

Find why PyTorch training got slow — while the run is still live.

Quickstart • Client Test Kit • Tutorials • Client Onboarding • Setup Agent • ML Intelligence • Server Operations • Known Limitations • Examples

TrainLens is a lightweight, step-aware bottleneck finder for PyTorch training runs. It attaches to your training loop and surfaces what is actually slowing things down — per step, per rank — without heavyweight profiling overhead.

The gap it fills: system dashboards show utilization over time. TrainLens shows what happened per training step and, in DDP, which rank is holding the run back.

What it catches

Input pipeline stalls (dataloader / preprocessing wait)
Step time drift and jitter over the run
DDP rank stragglers in single-node and multi-node setups
Memory creep and OOM trajectory
Gradient explosions and NaN/Inf conditions
FSDP and Pipeline Parallel overhead (--grad-diagnostics)
NCCL communication failures with root-cause attribution

Supported configurations

Configuration	Status
Single GPU	Supported
Single-node DDP	Supported
Multi-node DDP (2–4 nodes, up to ~32 ranks)	Collector can ingest this scale, but true multi-node launch needs deployment-specific launcher integration
Multi-node DDP (8+ nodes, 64+ ranks)	Experimental; load-test runtime, storage, and collector throughput first
FSDP diagnostics	Supported (`--grad-diagnostics`)
Pipeline Parallel bubble diagnostics	Supported (`--grad-diagnostics`)
Tensor Parallel diagnostics	Partial (`trace_tp_model`)
Full fleet backend / multi-aggregator coordination	Planned
TensorFlow / Keras	Planned

Quick start

pip install trainlens-ai
trainlens server start

trainlens server start pulls the protected server image from GHCR, starts the collector on localhost:29765, and serves the dashboard at http://localhost:8765.

Wrap your training step:

from trainlens.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Run your script through TrainLens and stream it to the dashboard:

trainlens run --aggregator-host localhost:29765 train.py

For local terminal-only development, trainlens run train.py still works and starts a collector for that one process.

See docs/quickstart.md for full setup details. For design-partner or client trials, start with the docs/client-test-kit.md packaging guide. Use docs/tutorials.md for guided examples across PyTorch, DDP, Hugging Face, Lightning, text, image, and LLM fine-tuning workloads. For client onboarding, docs/client-setup-agent.md describes an optional agent workflow that applies the GitBook installation steps as a small, reviewable patch. The notebook path is covered in docs/client-onboarding-notebooks.md.

What TrainLens shows

Step time and its breakdown (forward / backward / optimizer / overhead)
Dataloader and input wait per step
Step jitter and drift over time
GPU memory trend and OOM trajectory
CPU / RAM / GPU utilization signals
In DDP: worst-rank vs. median-rank timing and skew per step

This lets you tell whether a slowdown is coming from input, compute, the optimizer, or rank imbalance — before reaching for torch.profiler.

Integrations

Plain PyTorch

from trainlens.decorators import trace_step

with trace_step(model):
    ...

Hugging Face Trainer

from trainlens.integrations.huggingface import TrainLensTrainer

trainer = TrainLensTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    trainlens_enabled=True,
)

See docs/huggingface.md.

PyTorch Lightning

import lightning as L
from trainlens.integrations.lightning import TrainLensCallback

trainer = L.Trainer(callbacks=[TrainLensCallback()])

See docs/lightning.md.

ML Intelligence

TrainLens watches the run in real time, estimates 5 outcome probabilities, and surfaces an action recommendation. When a termination-grade condition is confirmed, the aggregator writes a .trainlens_terminate signal file; trace_step() polls that file at step boundaries.

Prediction chain:

ColdStartFallback — rule-based heuristics, active from step 1
XGBoost predictor — 90-feature tabular run-outcome classifier trained from RunStore history
ROCKET+Ridge predictor — optional sequence model over per-step loss, grad norm, memory, and step time when enough labeled step_series data exists
Parallel ensemble — runs XGBoost and ROCKET together when both are available, with an agreement gate for alert-grade actions

Termination signal: when TrainLens confirms a termination-grade condition, it writes a .trainlens_terminate signal file. If your loop is wrapped in trace_step(), TrainLens checks this file at the end of each traced step. Manual checks are also available:

from trainlens.runtime.auto_terminate import check_terminate_signal

for step in range(max_steps):
    with trace_step(model):
        ...
    if check_terminate_signal(session_dir):
        print("TrainLens: auto-terminating")
        break

Run history:

trainlens history                          # auto-discover from ./logs
trainlens history --db path/to/ml.db      # specific database
trainlens history --n 50                   # most recent 50 runs

Train a predictor manually:

trainlens train-model --db ./logs/<session>/aggregator/telemetry_ml.db --model-dir ./models

See docs/trainlens.md for the full ML intelligence reference.

CLI reference

Subcommand	Description
`trainlens run train.py`	Live bottleneck diagnosis
`trainlens deep train.py`	Adds per-layer timing and memory signals
`trainlens inspect telemetry.msgpack`	Decode and print binary telemetry logs
`trainlens history`	Review ML run outcomes from RunStore
`trainlens train-model`	Train an XGBoost run-outcome predictor
`trainlens server start`	Start the protected Docker server with collector and dashboard
`trainlens serve`	Serve the React/FastAPI dashboard over existing logs
`trainlens collect`	Run a standalone TCP collector for remote GPU pods

Add --grad-diagnostics to run or deep to enable gradient diagnostics:

trainlens run train.py --grad-diagnostics
trainlens deep train.py --grad-diagnostics --nproc-per-node=4

Activates: gradient norm tracking, NaN/Inf detection, MFU, FSDP latency, comm-overlap, and pipeline bubble ratio. Confirmed NaN/Inf conditions write the termination signal file.

Optional model hooks

from trainlens.decorators import trace_model_instance

trace_model_instance(model)

Use alongside trace_step(model) for per-layer timing and memory signals. The core step-level view works without it.

Scope

TrainLens is for lightweight diagnosis during real PyTorch training runs.

It is not:

a kernel-level tracer
a general-purpose auto-tuner
a replacement for torch.profiler for deep kernel analysis
a managed fleet observability platform

Start with TrainLens when you need a fast answer. Reach for deeper profiling after you know where to look.

Feedback

If TrainLens caught a slowdown, please open an issue and include:

hardware / CUDA / PyTorch versions
single GPU or DDP
whether you used core tracing only or model hooks
the end-of-run summary
a minimal repro if possible

Email: vsnm.tej@gmail.com

Contributing

External contribution workflow is currently managed through GitHub issues and email. Please open an issue with a minimal reproduction before sending a patch.

License

Use of this software requires explicit written permission. See LICENSE for details or contact vsnm.tej@gmail.com.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.9

Jun 9, 2026

This version

1.2.7

Jun 3, 2026

1.2.6

May 21, 2026

1.2.5

May 20, 2026

1.2.4

May 19, 2026

1.2.3

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainlens_ai-1.2.7.tar.gz (1.1 MB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trainlens_ai-1.2.7-py3-none-any.whl (1.3 MB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file trainlens_ai-1.2.7.tar.gz.

File metadata

Download URL: trainlens_ai-1.2.7.tar.gz
Upload date: Jun 3, 2026
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainlens_ai-1.2.7.tar.gz
Algorithm	Hash digest
SHA256	`ddd6360db3287f313a0a59903950fa1d1780528e61709644716b131bc25cc4ae`
MD5	`98cd3b79a2f102d32cf8fcd837cf681f`
BLAKE2b-256	`dd7f97f1af2eafb4b61480120cfaf0a795550a216aaba4f66aec982ae92589aa`

See more details on using hashes here.

File details

Details for the file trainlens_ai-1.2.7-py3-none-any.whl.

File metadata

Download URL: trainlens_ai-1.2.7-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 1.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainlens_ai-1.2.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dde0851533de55a1469eebc4b7338898c1d34a552b2d8c0a1c6e90e60180061f`
MD5	`80973b164e24219f4d93257fd56f2409`
BLAKE2b-256	`85abd469b8a76275521754e07ba0bd83425bb30c3d2f569f553c868543b449ab`

See more details on using hashes here.

trainlens-ai 1.2.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TrainLens

What it catches

Supported configurations

Quick start

What TrainLens shows

Integrations

Plain PyTorch

Hugging Face Trainer

PyTorch Lightning

ML Intelligence

CLI reference

Optional model hooks

Scope

Feedback

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes