Skip to main content

TrainLens: Lightweight training runtime health monitor.

Project description

TrainLens

Find why PyTorch training got slow — while the run is still live.

PyPI version Python 3.10+ License GitHub stars GitHub issues

QuickstartClient Test KitTutorialsClient OnboardingSetup AgentML IntelligenceServer OperationsKnown LimitationsExamples

TrainLens is a lightweight, step-aware bottleneck finder for PyTorch training runs. It attaches to your training loop and surfaces what is actually slowing things down — per step, per rank — without heavyweight profiling overhead.

The gap it fills: system dashboards show utilization over time. TrainLens shows what happened per training step and, in DDP, which rank is holding the run back.


What it catches

  • Input pipeline stalls (dataloader / preprocessing wait)
  • Step time drift and jitter over the run
  • DDP rank stragglers in single-node and multi-node setups
  • Memory creep and OOM trajectory
  • Gradient explosions and NaN/Inf conditions
  • FSDP and Pipeline Parallel overhead (--grad-diagnostics)
  • NCCL communication failures with root-cause attribution

Supported configurations

Configuration Status
Single GPU Supported
Single-node DDP Supported
Multi-node DDP (2–4 nodes, up to ~32 ranks) Collector can ingest this scale, but true multi-node launch needs deployment-specific launcher integration
Multi-node DDP (8+ nodes, 64+ ranks) Experimental; load-test runtime, storage, and collector throughput first
FSDP diagnostics Supported (--grad-diagnostics)
Pipeline Parallel bubble diagnostics Supported (--grad-diagnostics)
Tensor Parallel diagnostics Partial (trace_tp_model)
Full fleet backend / multi-aggregator coordination Planned
TensorFlow / Keras Planned

Quick start

pip install trainlens-ai
trainlens server start

trainlens server start pulls the protected server image from GHCR, starts the collector on localhost:29765, and serves the dashboard at http://localhost:8765.

Wrap your training step:

from trainlens.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Run your script through TrainLens and stream it to the dashboard:

trainlens run --aggregator-host localhost:29765 train.py

For local terminal-only development, trainlens run train.py still works and starts a collector for that one process.

See docs/quickstart.md for full setup details. For design-partner or client trials, start with the docs/client-test-kit.md packaging guide. Use docs/tutorials.md for guided examples across PyTorch, DDP, Hugging Face, Lightning, text, image, and LLM fine-tuning workloads. For client onboarding, docs/client-setup-agent.md describes an optional agent workflow that applies the GitBook installation steps as a small, reviewable patch. The notebook path is covered in docs/client-onboarding-notebooks.md.


What TrainLens shows

  • Step time and its breakdown (forward / backward / optimizer / overhead)
  • Dataloader and input wait per step
  • Step jitter and drift over time
  • GPU memory trend and OOM trajectory
  • CPU / RAM / GPU utilization signals
  • In DDP: worst-rank vs. median-rank timing and skew per step

This lets you tell whether a slowdown is coming from input, compute, the optimizer, or rank imbalance — before reaching for torch.profiler.


Integrations

Plain PyTorch

from trainlens.decorators import trace_step

with trace_step(model):
    ...

Hugging Face Trainer

from trainlens.integrations.huggingface import TrainLensTrainer

trainer = TrainLensTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    trainlens_enabled=True,
)

See docs/huggingface.md.

PyTorch Lightning

import lightning as L
from trainlens.integrations.lightning import TrainLensCallback

trainer = L.Trainer(callbacks=[TrainLensCallback()])

See docs/lightning.md.


ML Intelligence

TrainLens watches the run in real time, estimates 5 outcome probabilities, and surfaces an action recommendation. When a termination-grade condition is confirmed, the aggregator writes a .trainlens_terminate signal file; trace_step() polls that file at step boundaries.

Prediction chain:

  1. ColdStartFallback — rule-based heuristics, active from step 1
  2. XGBoost predictor — 90-feature tabular run-outcome classifier trained from RunStore history
  3. ROCKET+Ridge predictor — optional sequence model over per-step loss, grad norm, memory, and step time when enough labeled step_series data exists
  4. Parallel ensemble — runs XGBoost and ROCKET together when both are available, with an agreement gate for alert-grade actions

Termination signal: when TrainLens confirms a termination-grade condition, it writes a .trainlens_terminate signal file. If your loop is wrapped in trace_step(), TrainLens checks this file at the end of each traced step. Manual checks are also available:

from trainlens.runtime.auto_terminate import check_terminate_signal

for step in range(max_steps):
    with trace_step(model):
        ...
    if check_terminate_signal(session_dir):
        print("TrainLens: auto-terminating")
        break

Run history:

trainlens history                          # auto-discover from ./logs
trainlens history --db path/to/ml.db      # specific database
trainlens history --n 50                   # most recent 50 runs

Train a predictor manually:

trainlens train-model --db ./logs/<session>/aggregator/telemetry_ml.db --model-dir ./models

See docs/trainlens.md for the full ML intelligence reference.


CLI reference

Subcommand Description
trainlens run train.py Live bottleneck diagnosis
trainlens deep train.py Adds per-layer timing and memory signals
trainlens inspect telemetry.msgpack Decode and print binary telemetry logs
trainlens history Review ML run outcomes from RunStore
trainlens train-model Train an XGBoost run-outcome predictor
trainlens server start Start the protected Docker server with collector and dashboard
trainlens serve Serve the React/FastAPI dashboard over existing logs
trainlens collect Run a standalone TCP collector for remote GPU pods

Add --grad-diagnostics to run or deep to enable gradient diagnostics:

trainlens run train.py --grad-diagnostics
trainlens deep train.py --grad-diagnostics --nproc-per-node=4

Activates: gradient norm tracking, NaN/Inf detection, MFU, FSDP latency, comm-overlap, and pipeline bubble ratio. Confirmed NaN/Inf conditions write the termination signal file.


Optional model hooks

from trainlens.decorators import trace_model_instance

trace_model_instance(model)

Use alongside trace_step(model) for per-layer timing and memory signals. The core step-level view works without it.


Scope

TrainLens is for lightweight diagnosis during real PyTorch training runs.

It is not:

  • a kernel-level tracer
  • a general-purpose auto-tuner
  • a replacement for torch.profiler for deep kernel analysis
  • a managed fleet observability platform

Start with TrainLens when you need a fast answer. Reach for deeper profiling after you know where to look.


Feedback

If TrainLens caught a slowdown, please open an issue and include:

  • hardware / CUDA / PyTorch versions
  • single GPU or DDP
  • whether you used core tracing only or model hooks
  • the end-of-run summary
  • a minimal repro if possible

Email: vsnm.tej@gmail.com


Contributing

External contribution workflow is currently managed through GitHub issues and email. Please open an issue with a minimal reproduction before sending a patch.


License

Proprietary. Copyright 2026 Venkata Pydipalli. All Rights Reserved.

Use of this software requires explicit written permission. See LICENSE for details or contact vsnm.tej@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainlens_ai-1.2.7.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trainlens_ai-1.2.7-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file trainlens_ai-1.2.7.tar.gz.

File metadata

  • Download URL: trainlens_ai-1.2.7.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainlens_ai-1.2.7.tar.gz
Algorithm Hash digest
SHA256 ddd6360db3287f313a0a59903950fa1d1780528e61709644716b131bc25cc4ae
MD5 98cd3b79a2f102d32cf8fcd837cf681f
BLAKE2b-256 dd7f97f1af2eafb4b61480120cfaf0a795550a216aaba4f66aec982ae92589aa

See more details on using hashes here.

File details

Details for the file trainlens_ai-1.2.7-py3-none-any.whl.

File metadata

  • Download URL: trainlens_ai-1.2.7-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trainlens_ai-1.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 dde0851533de55a1469eebc4b7338898c1d34a552b2d8c0a1c6e90e60180061f
MD5 80973b164e24219f4d93257fd56f2409
BLAKE2b-256 85abd469b8a76275521754e07ba0bd83425bb30c3d2f569f553c868543b449ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page