Skip to main content

TraceML: Lightweight training runtime health monitor.

Project description

TraceML

Find why training is slow, while it is still running.

PyPI version Python 3.10+ License GitHub stars GitHub issues PRs Welcome

QuickstartExamplesContributing

TraceML is a lightweight bottleneck finder for PyTorch training. It helps you catch:

  • input stalls
  • unstable or drifting step times
  • DDP rank stragglers
  • memory creep over time

without jumping straight to heavyweight profiling.

The gap it fills: system dashboards show utilization over time. TraceML shows what happens during training steps and, in distributed settings, which rank is slowing the run down.

Works today: Single GPU, Single-node DDP/FSDP

Not yet: Multi-node, TP, PP

With minimal setup observe system and process behaviour during training

pip install traceml-ai
traceml watch train.py

When to use TraceML

Use it when training feels:

  • slower than expected
  • jittery from step to step
  • imbalanced across distributed ranks
  • stable in dashboards but still underperforming

Start with TraceML when you need a fast answer in the terminal. Reach for torch.profiler once you know where to dig.


Quick start

Zero-code first look

traceml watch train.py

Use watch for a zero-code live view of system and process behavior while training is running.

Step-aware bottleneck diagnosis

Wrap your training step to see where time goes:

from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Run through TraceML:

traceml run train.py

During training, TraceML opens a live CLI view alongside your logs.

TraceML terminal dashboard

At the end of the run, it prints a compact summary.

TraceML summary

TraceML also includes a local UI. See docs/quickstart.md for setup details.


Run modes

traceml watch train.py

Zero-code live visibility for system and process behavior.

traceml run train.py

Default mode for live bottleneck diagnosis.

traceml deep train.py

Adds per-layer timing and memory signals for deeper inspection (experimental).

Start with watch for fast visibility. Use run when you need step-aware diagnosis. Use deep only when you need layer-level root cause.


What TraceML shows

  • CPU / RAM / GPU signals
  • step time and its breakdown
  • dataloader / input wait
  • forward / backward / optimizer / overhead timing
  • step jitter and drift
  • GPU memory trend
  • in distributed settings: worst-rank vs median-rank timing and skew

This helps you tell whether the slowdown is coming from input, compute, optimizer work, or rank imbalance.


Supported stacks

Standard PyTorch loop

Use trace_step(model) around your training step.

Hugging Face Trainer

from traceml.integrations.huggingface import TraceMLTrainer

trainer = TraceMLTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    traceml_enabled=True,
)

See docs/huggingface.md for the full setup.

PyTorch Lightning

import lightning as L
from traceml.integrations.lightning import TraceMLCallback

trainer = L.Trainer(callbacks=[TraceMLCallback()])

See docs/lightning.md for the full setup.


Optional model hooks (experimental)

from traceml.decorators import trace_model_instance

trace_model_instance(model)

Use this with trace_step(model) when you want optional per-layer timing and memory signals. The core step-level view works without it.

This is experimental and may not work with torch.compile, especially with full-graph compilation. The core step-level view works without model hooks.


Scope

TraceML is for lightweight diagnosis during real PyTorch training runs.

It is not:

  • a kernel-level tracer
  • an auto-tuner
  • a replacement for deep profilers
  • a full observability platform

Example cases

Start with examples such as:

  • basic example
  • input / dataloader stall
  • DDP straggler / rank skew

See Examples for runnable cases.


Feedback

If TraceML caught a slowdown for you, please open an issue and include:

  • hardware / CUDA / PyTorch versions
  • single or multi GPU
  • whether you used watch, run, or deep
  • whether you used core tracing only or model hooks
  • the end-of-run summary
  • a minimal repro if possible

📧 Email: support@traceopt.ai

📋 User Survey: https://forms.gle/KwPSLaPmJnJjoVXSA


Contributing

Contributions are welcome, especially:

  • reproducible slowdown cases
  • integrations
  • bug reports
  • examples

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.6.tar.gz (212.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.2.6-py3-none-any.whl (297.5 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.2.6.tar.gz.

File metadata

  • Download URL: traceml_ai-0.2.6.tar.gz
  • Upload date:
  • Size: 212.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.6.tar.gz
Algorithm Hash digest
SHA256 6c930cf5d4551fb9c1332a600f64fba07f8c7b00c1e827d1ff0274d405c74cab
MD5 43bc421bdb08e29f41e7a3fff018d53a
BLAKE2b-256 abf51f20c862f759b932eb99aec350aa1000d470fb6010def20128e00452555d

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 297.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 bd5b2cff8ef51c128828780627448733e8867f43c990121cf8fd216830523359
MD5 c0bcb597295c3db081b18af6429bcfaf
BLAKE2b-256 8b15ecffe7997d47eb11ce7155b3d7223d4f6b2ea3b04dac05baa6f88fe114bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page