Skip to main content

TrainLens: Lightweight training runtime health monitor.

Project description

TrainLens

Find why PyTorch training got slow — while the run is still live.

PyPI version Python 3.10+ License GitHub stars GitHub issues

QuickstartML IntelligenceServer OperationsKnown LimitationsExamples

TrainLens is a lightweight, step-aware bottleneck finder for PyTorch training runs. It attaches to your training loop and surfaces what is actually slowing things down — per step, per rank — without heavyweight profiling overhead.

The gap it fills: system dashboards show utilization over time. TrainLens shows what happened per training step and, in DDP, which rank is holding the run back.


What it catches

  • Input pipeline stalls (dataloader / preprocessing wait)
  • Step time drift and jitter over the run
  • DDP rank stragglers in single-node and multi-node setups
  • Memory creep and OOM trajectory
  • Gradient explosions and NaN/Inf conditions
  • FSDP and Pipeline Parallel overhead (--grad-diagnostics)
  • NCCL communication failures with root-cause attribution

Supported configurations

Configuration Status
Single GPU Supported
Single-node DDP Supported
Multi-node DDP (2–4 nodes, up to ~32 ranks) Collector can ingest this scale, but true multi-node launch needs deployment-specific launcher integration
Multi-node DDP (8+ nodes, 64+ ranks) Experimental; load-test runtime, storage, and collector throughput first
FSDP diagnostics Supported (--grad-diagnostics)
Pipeline Parallel bubble diagnostics Supported (--grad-diagnostics)
Tensor Parallel diagnostics Partial (trace_tp_model)
Full fleet backend / multi-aggregator coordination Planned
TensorFlow / Keras Planned

Quick start

pip install trainlens-ai

Wrap your training step:

from trainlens.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Run your script through TrainLens:

trainlens run train.py

TrainLens opens a live terminal view alongside your logs and prints a compact summary when the run ends.

See docs/quickstart.md for full setup details.


What TrainLens shows

  • Step time and its breakdown (forward / backward / optimizer / overhead)
  • Dataloader and input wait per step
  • Step jitter and drift over time
  • GPU memory trend and OOM trajectory
  • CPU / RAM / GPU utilization signals
  • In DDP: worst-rank vs. median-rank timing and skew per step

This lets you tell whether a slowdown is coming from input, compute, the optimizer, or rank imbalance — before reaching for torch.profiler.


Integrations

Plain PyTorch

from trainlens.decorators import trace_step

with trace_step(model):
    ...

Hugging Face Trainer

from trainlens.integrations.huggingface import TrainLensTrainer

trainer = TrainLensTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    trainlens_enabled=True,
)

See docs/huggingface.md.

PyTorch Lightning

import lightning as L
from trainlens.integrations.lightning import TrainLensCallback

trainer = L.Trainer(callbacks=[TrainLensCallback()])

See docs/lightning.md.


ML Intelligence

TrainLens watches the run in real time, estimates 5 outcome probabilities, and surfaces an action recommendation. When a termination-grade condition is confirmed, the aggregator writes a .trainlens_terminate signal file; trace_step() polls that file at step boundaries.

Prediction chain:

  1. ColdStartFallback — rule-based heuristics, active from step 1
  2. XGBoost predictor — 90-feature tabular run-outcome classifier trained from RunStore history
  3. ROCKET+Ridge predictor — optional sequence model over per-step loss, grad norm, memory, and step time when enough labeled step_series data exists
  4. Parallel ensemble — runs XGBoost and ROCKET together when both are available, with an agreement gate for alert-grade actions

Termination signal: when TrainLens confirms a termination-grade condition, it writes a .trainlens_terminate signal file. If your loop is wrapped in trace_step(), TrainLens checks this file at the end of each traced step. Manual checks are also available:

from trainlens.runtime.auto_terminate import check_terminate_signal

for step in range(max_steps):
    with trace_step(model):
        ...
    if check_terminate_signal(session_dir):
        print("TrainLens: auto-terminating")
        break

Run history:

trainlens history                          # auto-discover from ./logs
trainlens history --db path/to/ml.db      # specific database
trainlens history --n 50                   # most recent 50 runs

Train a predictor manually:

trainlens train-model --db ./logs/<session>/aggregator/telemetry_ml.db --model-dir ./models

See docs/trainlens.md for the full ML intelligence reference.


CLI reference

Subcommand Description
trainlens run train.py Live bottleneck diagnosis
trainlens deep train.py Adds per-layer timing and memory signals
trainlens inspect telemetry.msgpack Decode and print binary telemetry logs
trainlens history Review ML run outcomes from RunStore
trainlens train-model Train an XGBoost run-outcome predictor
trainlens serve Serve the React/FastAPI dashboard over existing logs
trainlens collect Run a standalone TCP collector for remote GPU pods

Add --grad-diagnostics to run or deep to enable gradient diagnostics:

trainlens run train.py --grad-diagnostics
trainlens deep train.py --grad-diagnostics --nproc-per-node=4

Activates: gradient norm tracking, NaN/Inf detection, MFU, FSDP latency, comm-overlap, and pipeline bubble ratio. Confirmed NaN/Inf conditions write the termination signal file.


Optional model hooks

from trainlens.decorators import trace_model_instance

trace_model_instance(model)

Use alongside trace_step(model) for per-layer timing and memory signals. The core step-level view works without it.


Scope

TrainLens is for lightweight diagnosis during real PyTorch training runs.

It is not:

  • a kernel-level tracer
  • a general-purpose auto-tuner
  • a replacement for torch.profiler for deep kernel analysis
  • a managed fleet observability platform

Start with TrainLens when you need a fast answer. Reach for deeper profiling after you know where to look.


Feedback

If TrainLens caught a slowdown, please open an issue and include:

  • hardware / CUDA / PyTorch versions
  • single GPU or DDP
  • whether you used core tracing only or model hooks
  • the end-of-run summary
  • a minimal repro if possible

Email: vsnm.tej@gmail.com


Contributing

External contribution workflow is currently managed through GitHub issues and email. Please open an issue with a minimal reproduction before sending a patch.


License

Proprietary. Copyright 2026 Venkata Pydipalli. All Rights Reserved.

Use of this software requires explicit written permission. See LICENSE for details or contact vsnm.tej@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainlens_ai-1.2.3.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trainlens_ai-1.2.3-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file trainlens_ai-1.2.3.tar.gz.

File metadata

  • Download URL: trainlens_ai-1.2.3.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainlens_ai-1.2.3.tar.gz
Algorithm Hash digest
SHA256 e9472fd1bdfc81e8f5de6ad87096ab9e055e7af21948e4199c15a41df7c08def
MD5 3e085936fe49d1c769e81b82cf085175
BLAKE2b-256 9874b3f15ae901846921782484a601af15a73e6bf15e3764602f89d1986894c5

See more details on using hashes here.

File details

Details for the file trainlens_ai-1.2.3-py3-none-any.whl.

File metadata

  • Download URL: trainlens_ai-1.2.3-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainlens_ai-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2c00519059ce738a748a640c41bb97ddd00c6f5e723f67fd0b97db751d279b2e
MD5 b2c4c75bf0676f95d20abbea63753b61
BLAKE2b-256 5268a06895ccee92a8bf6ecc19a9646b667e9acc7bb048eb08eda5973e9d17b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page