Skip to main content

TraceML: Lightweight training runtime health monitor.

Project description

TraceML

Find why PyTorch training is slow while the job is still running.

PyPI version Python 3.10+ License GitHub stars

QuickstartHow to Read OutputFAQUse with W&B / MLflowIssues

TraceML helps you find training bottlenecks in PyTorch while the job is still running. It helps you catch:

  • input bottlenecks
  • compute-bound steps
  • DDP stragglers
  • wait-heavy training
  • memory creep over time

without jumping straight to a heavyweight profiler.

Why this exists: dashboards show utilization and curves. TraceML shows why throughput is poor inside the training step.


The fastest way to try it

Install:

pip install traceml-ai

Wrap your training step:

from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

Run:

traceml run train.py

During training, TraceML opens a live terminal view alongside your logs.

TraceML terminal dashboard

At the end of the run, it prints a compact summary you can review or share.

TraceML summary

For full setup details, see docs/quickstart.md.

Not sure how to interpret the output? Read How to Read TraceML Output.


What TraceML tells you

TraceML helps answer questions like:

  • Is training input-bound or compute-bound?
  • Is one DDP rank slower than the others?
  • Is the job wait-heavy because of uneven progress?
  • Is memory drifting upward over time?
  • Is the slowdown coming from dataloader, forward, backward, or optimizer work?

When to use TraceML

Use TraceML when training feels:

  • slower than expected
  • unstable from step to step
  • imbalanced across distributed ranks
  • fine in dashboards but still underperforming

Start with TraceML when you need a fast answer in the terminal. Reach for torch.profiler once you know where to dig deeper.


How it fits with your stack

TraceML is designed to work alongside tools like W&B, MLflow, and TensorBoard.

Use those for:

  • experiment tracking
  • artifacts
  • dashboards
  • team reporting

Use TraceML for:

  • bottleneck diagnosis
  • rank imbalance / straggler detection
  • memory trend debugging

See Use TraceML with W&B / MLflow.


Current support

Works today:

  • single GPU
  • single-node DDP/FSDP

Not yet:

  • multi-node
  • tensor parallel
  • pipeline parallel

Next steps


Feedback

If TraceML helped you find a slowdown, please open an issue and include:

  • hardware / CUDA / PyTorch versions
  • single GPU or multi-GPU
  • whether you used run, watch, or deep
  • the end-of-run summary
  • a minimal repro if possible

GitHub issues: https://github.com/traceopt-ai/traceml/issues

Email: support@traceopt.ai


Contributing

Contributions are welcome, especially:

  • reproducible slowdown cases
  • bug reports
  • docs improvements
  • integrations
  • examples

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.7.tar.gz (204.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.2.7-py3-none-any.whl (280.8 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.2.7.tar.gz.

File metadata

  • Download URL: traceml_ai-0.2.7.tar.gz
  • Upload date:
  • Size: 204.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.7.tar.gz
Algorithm Hash digest
SHA256 ea3bf23033fb5334ff5928260b66069002f620215dd3fae420676c36e3dc042b
MD5 8ddf631161d3dc64213d15c78e9adf32
BLAKE2b-256 4e000f0f2bb48a6f31c840f53a51ce998f315b9ef27017ace90f735a830f2043

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 280.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 5b5849ba0f7750546b7b82bb38f45cfa8d7a2ae9c202105fdfa5f9d84a0a31cb
MD5 d700c6b5eb2ebbdf17d3ff0ebdb6c49e
BLAKE2b-256 810568eecd2a5ea8c16dedbda793c8ad1266ad08a2fac44a8801ed437f304c2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page