Skip to main content

TraceML: Lightweight training runtime health monitor.

Project description

TraceML

Find why PyTorch training is slow while the job is still running.

PyPI version Python 3.10+ License GitHub stars

QuickstartHow to Read OutputFAQUse with W&B / MLflowIssues

TraceML helps you find training bottlenecks in PyTorch while the job is still running. It helps you catch:

  • input bottlenecks
  • compute-bound steps
  • DDP stragglers
  • wait-heavy training
  • memory creep over time

without jumping straight to a heavyweight profiler.

Why this exists: dashboards show utilization and curves. TraceML shows why throughput is poor inside the training step.


The fastest way to try it

Install:

pip install traceml-ai

Wrap your training step:

import traceml

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

Run:

traceml run train.py

During training, TraceML opens a live terminal view alongside your logs.

TraceML terminal dashboard

At the end of the run, it prints a compact summary you can review or share.

TraceML summary

If you want a low-noise run and a structured summary you can log into W&B or MLflow, launch in summary mode and call traceml.final_summary() near the end of your script:

traceml run train.py --mode=summary

For full setup details, see docs/quickstart.md.

Not sure how to interpret the output? Read How to Read TraceML Output.


What TraceML tells you

TraceML helps answer questions like:

  • Is training input-bound or compute-bound?
  • Is one DDP rank slower than the others?
  • Is the job wait-heavy because of uneven progress?
  • Is memory drifting upward over time?
  • Is the slowdown coming from dataloader, forward, backward, or optimizer work?

When to use TraceML

Use TraceML when training feels:

  • slower than expected
  • unstable from step to step
  • imbalanced across distributed ranks
  • fine in dashboards but still underperforming

Start with TraceML when you need a fast answer in the terminal. Reach for torch.profiler once you know where to dig deeper.


How it fits with your stack

TraceML is designed to work alongside tools like W&B, MLflow, and TensorBoard.

Use those for:

  • experiment tracking
  • artifacts
  • dashboards
  • team reporting

Use TraceML for:

  • bottleneck diagnosis
  • rank imbalance / straggler detection
  • memory trend debugging
  • structured final summaries you can forward into W&B or MLflow

See Use TraceML with W&B / MLflow.


Current support

Works today:

  • single GPU
  • single-node DDP/FSDP

Not yet:

  • multi-node
  • tensor parallel
  • pipeline parallel

Next steps


Feedback

If TraceML helped you find a slowdown, please open an issue and include:

  • hardware / CUDA / PyTorch versions
  • single GPU or multi-GPU
  • whether you used run, watch, or deep
  • the end-of-run summary
  • a minimal repro if possible

GitHub issues: https://github.com/traceopt-ai/traceml/issues

Email: support@traceopt.ai


Contributing

Contributions are welcome, especially:

  • reproducible slowdown cases
  • bug reports
  • docs improvements
  • integrations
  • examples

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.8.tar.gz (217.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.2.8-py3-none-any.whl (299.1 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.2.8.tar.gz.

File metadata

  • Download URL: traceml_ai-0.2.8.tar.gz
  • Upload date:
  • Size: 217.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.8.tar.gz
Algorithm Hash digest
SHA256 c6b001a78e168877491ef46959cbb4cbce6aa9b6c6e738dc6d01815ee4ac32e6
MD5 8191de1ed3721b6d47a931d2e5b7a5af
BLAKE2b-256 fdbd159fe7632b8abde0eba002af17415b057c7dbcb85b809fb26073be9e96f6

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.8-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.2.8-py3-none-any.whl
  • Upload date:
  • Size: 299.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 0ae58ba2afe11729368ad8c10b2dbcbbe98ca67bd88353edfd2c2fc329443e27
MD5 91910bf6c6a59bdf06b129b4df1dc090
BLAKE2b-256 79284274afa758498aff6f87e6558c965196a47f334be7a3490ba27aa6048610

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page