Skip to main content

TraceML: Lightweight runtime bottleneck diagnostics for PyTorch training.

Project description

TraceML

Runtime bottleneck detection for PyTorch training jobs.

PyPI version CI CodeQL Python 3.10+ License GitHub stars Discord

QuickstartCompare RunsHow to Read OutputUse With Your StackFAQSecurityIssuesDiscussions

TraceML gives every PyTorch training run a structured performance fingerprint with low overhead (<2% in our current benchmark runs). It answers the questions that usually come before heavyweight operator-level profiling:

  • Are my GPUs waiting on a slow dataloader (input-bound)?
  • Is one distributed rank consistently slower than the others (straggler)?
  • Is memory usage silently creeping upward during the run (memory creep)?
  • Did a recent code or infrastructure change slow training down (regression)?

Where TraceML Fits in the Stack

TraceML does not replace torch.profiler. It is the low-overhead, always-on first pass that tells you where to aim heavier profiling tools.

Tool Best used for Output Cost / overhead
TraceML Classifying high-level bottlenecks: input, compute, wait, memory, rank skew JSON fingerprint, text summary, live views <2% in current benchmark runs; small code wrapper
torch.profiler Inspecting expensive ops, kernels, and CUDA activity Profiler trace Higher overhead; requires profiler context
Nsight Systems Debugging low-level CUDA and kernel behavior GPU timeline Separate profiler run
W&B / MLflow Tracking training metrics and experiment history Metrics dashboard / run history Logging integration
nvidia-smi Checking machine-level GPU health and utilization Terminal metrics No code changes

3-Minute Quickstart

1. Install the package

pip install traceml-ai

2. Wrap your training step

import traceml_ai as traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

3. Run your script

traceml run train.py

For DDP, FSDP, and multi-node runs, see Distributed Training.

What You Get: The Output

TraceML writes two end-of-run artifacts:

logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt

Instead of guessing why training feels slow, you get a compact diagnosis of where step time and memory went:

+----------------------------------------------------------------------------+
|  Step Time                                                                 |
|  - Diagnosis: INPUT STRAGGLER                                              |
|  - Scope: compared over last 460 aligned steps across 4 global ranks       |
|  - Stats: total 303.7ms | input 254.5ms | compute 259.5ms | wait 40.5ms    |
|  - Why: r0 input was slower than median global rank (254.5/3.8ms).         |
+----------------------------------------------------------------------------+

In this example, rank 0 is the slow input rank, which can hold back the aligned distributed step.

For experiment trackers, call traceml.summary() near the end of your script to get a flat dict of diagnosis statuses and average metrics. Keep final_summary.json when you want the full run artifact or an input for traceml compare.


Catching Regressions (Compare Mode)

Compare a slow run against a known good baseline to identify which metrics changed:

traceml compare input_slow/final_summary.json input_fixed/final_summary.json
+--------------------------------------------------------------------------------------+
|  TraceML Compare                                                                     |
+--------------------------------------------------------------------------------------+
|  Verdict: IMPROVEMENT                                                                |
|  Why: Step time decreased by 95.6%.                                                  |
|                                                                                      |
|  Metric                         A                B                Delta              |
|  Total step                     294.0 ms         13.0 ms          -280.9 ms (-95.6%) |
|  Input                          66.4 ms          2.7 ms           -63.7 ms (-95.9%)  |
+--------------------------------------------------------------------------------------+

See Compare Runs for the full report format.

Display Modes

TraceML controls what you see during training with the --mode flag, without changing the final saved artifacts.

Mode flag Experience during training Supported topology
--mode=summary (default) Silent execution Single-node and multi-node multi-GPU
--mode=cli Live terminal display Single-node, including multi-GPU
--mode=dashboard Live browser display Single-node; requires pip install "traceml-ai[dashboard]"

Current support

Works today:

  • Single GPU training
  • Single-node multi-GPU DDP / FSDP
  • Multi-node DDP summary reports
  • Run-to-run comparison from final_summary.json
  • Custom PyTorch loops, Hugging Face, PyTorch Lightning, and Ray Train

On the roadmap:

  • Slurm launch examples
  • Multi-node live CLI / browser dashboard
  • Explicit collective / NCCL timing

Overhead

Overhead: In our benchmark runs, TraceML adds <2% overhead on single GPU and <1% on single-node multi-GPU at default settings.


Learn More


Feedback

For bugs, unexpected results, or feature requests, open a GitHub issue and use the matching issue template. The templates ask for the details we need to reproduce training-environment problems, including hardware, topology, launch command, TraceML version, PyTorch/CUDA versions, and redacted summary output.

GitHub issues: open an issue

If TraceML helped you find a real bottleneck, use the "I found a bottleneck" issue template. These reports help other training teams recognize similar problems.

Security reports: see SECURITY.md

Email: support@traceopt.ai


Contributing

Contributions are welcome, especially:

  • real slowdown examples and repros
  • distributed training edge cases
  • docs improvements
  • framework integrations

See CONTRIBUTING.md for development setup and contribution guidelines.


License

Apache 2.0. See LICENSE.

TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.3.1.tar.gz (318.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.3.1-py3-none-any.whl (465.1 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.3.1.tar.gz.

File metadata

  • Download URL: traceml_ai-0.3.1.tar.gz
  • Upload date:
  • Size: 318.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.1.tar.gz
Algorithm Hash digest
SHA256 27e069ddb1c20fb0635029f2d935e6cfbbd502ab7552ec4d76f7d9d9eeff5994
MD5 9ee480f8309a35d1e2435b70991673d2
BLAKE2b-256 245bd868aafe1a6c9723c65f3fdf3eaaff646baabf31bde4f0724c91040a0c66

See more details on using hashes here.

File details

Details for the file traceml_ai-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 465.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 90d457b89ddaa2d05a5d256d92e11bc24a48a079cbd723d2e6c7b0ea47980a45
MD5 29fdca6d7b980c6a172cb83cb5d4df86
BLAKE2b-256 d11d8cd37b6558b4757e60bbc1bf6655f90b310b597fa09a9ad1811927b9dac7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page