Skip to main content

TraceML: Lightweight training runtime health monitor.

Project description

TraceML

Catch PyTorch training slowdowns early, while the job is still running.

PyPI version Python 3.10+ License GitHub stars

QuickstartCompare RunsHow to Read OutputFAQUse with W&B / MLflowIssues

TraceML is an open-source tool for catching PyTorch training slowdowns early, so bad runs do not quietly waste costly compute.

It gives you lightweight step-level signals while the job is still running, so you can quickly tell whether the slowdown looks input-bound, compute-bound, wait-heavy, imbalanced across ranks, or memory-related.

Use TraceML when you want a fast answer before reaching for a heavyweight profiler.

⭐ If TraceML helps you, please consider starring the repo.

Upcoming rename: TraceML will transition to TraceOpt in a future release. For now, the active package remains traceml-ai and Python imports remain traceml. The future PyPI package name traceopt-ai is now in place as we prepare the migration.


The fastest way to try it

Install:

pip install traceml-ai

Wrap your training step:

import traceml

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

Run:

traceml run train.py

During training, TraceML opens a live terminal view alongside your logs.

TraceML terminal dashboard

At the end of the run, it prints a compact summary you can review or share.

TraceML summary

Start with traceml run train.py. Most users do not need watch or deep first.


Core workflows

1. Live diagnosis

Use the default workflow when you want live step-aware diagnosis during training plus the end-of-run summary.

traceml run train.py

2. Low-noise summary runs

Use summary mode when you mainly want the structured final summary for logging into W&B or MLflow.

traceml run train.py --mode=summary

Then call traceml.final_summary() near the end of your script.

TraceML also writes canonical summary artifacts for the run, including final_summary.json, which is the intended machine-readable output for downstream logging and later run comparison.

3. Compare two runs

If you have final_summary.json from two runs, compare them directly:

traceml compare run_a.json run_b.json

TraceML writes both a structured compare JSON and a compact text report.

See docs/compare.md.


What TraceML helps you see

TraceML is currently strongest at surfacing:

  • step-time slowdowns while training is still running
  • whether the pattern looks input-bound, compute-bound, or wait-heavy
  • whether work is uneven across distributed ranks
  • whether memory is drifting upward over time
  • where time is showing up across dataloader, forward, backward, and optimizer phases

It is designed to help you decide quickly whether a run looks healthy or whether it is worth digging deeper.


When to use TraceML

Use TraceML when training feels:

  • slower than expected
  • unstable from step to step
  • imbalanced across distributed ranks
  • fine in dashboards but still underperforming

Start with TraceML when you need a fast answer in the terminal. Reach for torch.profiler once you know where to dig deeper.


How it fits with your stack

TraceML is designed to work alongside tools like W&B, MLflow, and TensorBoard.

Use those for:

  • experiment tracking
  • artifacts
  • dashboards
  • team reporting

Use TraceML for:

  • bottleneck diagnosis while a run is still in progress
  • spotting throughput drift during a run
  • checking for rank imbalance or straggler patterns
  • checking for memory creep or pressure signals
  • structured final summaries you can forward into W&B or MLflow
  • simple run-to-run comparison from saved TraceML summary JSON files

See Use TraceML with W&B / MLflow.


Current support

Works today:

  • single GPU
  • single-node DDP/FSDP

Not yet:

  • multi-node
  • tensor parallel
  • pipeline parallel

Learn more

Need a lighter zero-code first look or a deeper follow-up run? See the Quickstart and FAQ for watch and deep.


Feedback

If TraceML helped you catch a slowdown, please open an issue and include:

  • hardware / CUDA / PyTorch versions
  • single GPU or multi-GPU
  • whether you used run, watch, or deep
  • the end-of-run summary
  • a minimal repro if possible

GitHub issues: https://github.com/traceopt-ai/traceml/issues

Email: support@traceopt.ai


Contributing

Contributions are welcome, especially:

  • reproducible slowdown cases
  • bug reports
  • docs improvements
  • integrations
  • examples

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.10.tar.gz (234.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.2.10-py3-none-any.whl (316.3 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.2.10.tar.gz.

File metadata

  • Download URL: traceml_ai-0.2.10.tar.gz
  • Upload date:
  • Size: 234.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.2.10.tar.gz
Algorithm Hash digest
SHA256 c5c1f08aac68d03a995d3f0a27a29e8e1315bcc55c1e971f19c564e42cb3c814
MD5 dab9ac2e776db07bb4e397f0a13c3f16
BLAKE2b-256 a11dedb7b73a7dc8b642cd55ad6f5bd64bf327c7b31d78bca3cd1f5c83e5f789

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.10-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.2.10-py3-none-any.whl
  • Upload date:
  • Size: 316.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.2.10-py3-none-any.whl
Algorithm Hash digest
SHA256 bb2ed989af8342ccfba711d63477ab53b03ff798d9e1dd45c2972ebb9fafcda8
MD5 46cf44d6db7828dacb773e288801fd5c
BLAKE2b-256 2e4b0a7b2617aa348100f0aa5e1f691bd451a0f89de2a72f249b9d41216c6df2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page