TraceML: Lightweight runtime bottleneck diagnostics for PyTorch training.
Project description
TraceML
Runtime bottleneck detection for PyTorch training jobs.
Quickstart • Compare Runs • How to Read Output • Use With Your Stack • FAQ • Security • Issues • Discussions
TraceML gives every PyTorch training run a structured performance fingerprint with low overhead (<2% in our current benchmark runs). It answers the questions that usually come before heavyweight operator-level profiling:
- Are my GPUs waiting on a slow dataloader (input-bound)?
- Is one distributed rank consistently slower than the others (straggler)?
- Is memory usage silently creeping upward during the run (memory creep)?
- Did a recent code or infrastructure change slow training down (regression)?
Where TraceML Fits in the Stack
TraceML does not replace torch.profiler. It is the low-overhead, always-on
first pass that tells you where to aim heavier profiling tools.
| Tool | Best used for | Output | Cost / overhead |
|---|---|---|---|
| TraceML | Classifying high-level bottlenecks: input, compute, wait, memory, rank skew | JSON fingerprint, text summary, live views | <2% in current benchmark runs; small code wrapper |
torch.profiler |
Inspecting expensive ops, kernels, and CUDA activity | Profiler trace | Higher overhead; requires profiler context |
| Nsight Systems | Debugging low-level CUDA and kernel behavior | GPU timeline | Separate profiler run |
| W&B / MLflow | Tracking training metrics and experiment history | Metrics dashboard / run history | Logging integration |
nvidia-smi |
Checking machine-level GPU health and utilization | Terminal metrics | No code changes |
3-Minute Quickstart
1. Install the package
pip install traceml-ai
2. Wrap your training step
import traceml_ai as traceml
traceml.init(mode="auto")
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
3. Run your script
traceml run train.py
For DDP, FSDP, and multi-node runs, see Distributed Training.
What You Get: The Output
TraceML writes two end-of-run artifacts:
logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt
Instead of guessing why training feels slow, you get a compact diagnosis of where step time and memory went:
+----------------------------------------------------------------------------+
| Step Time |
| - Diagnosis: INPUT STRAGGLER |
| - Scope: compared over last 460 aligned steps across 4 global ranks |
| - Stats: total 303.7ms | input 254.5ms | compute 259.5ms | wait 40.5ms |
| - Why: r0 input was slower than median global rank (254.5/3.8ms). |
+----------------------------------------------------------------------------+
In this example, rank 0 is the slow input rank, which can hold back the aligned distributed step.
For experiment trackers, call traceml.summary() near the end of your script
to get a flat dict of diagnosis statuses and average metrics. Keep
final_summary.json when you want the full run artifact or an input for
traceml compare.
Catching Regressions (Compare Mode)
Compare a slow run against a known good baseline to identify which metrics changed:
traceml compare input_slow/final_summary.json input_fixed/final_summary.json
+--------------------------------------------------------------------------------------+
| TraceML Compare |
+--------------------------------------------------------------------------------------+
| Verdict: IMPROVEMENT |
| Why: Step time decreased by 95.6%. |
| |
| Metric A B Delta |
| Total step 294.0 ms 13.0 ms -280.9 ms (-95.6%) |
| Input 66.4 ms 2.7 ms -63.7 ms (-95.9%) |
+--------------------------------------------------------------------------------------+
See Compare Runs for the full report format.
Display Modes
TraceML controls what you see during training with the --mode flag, without
changing the final saved artifacts.
| Mode flag | Experience during training | Supported topology |
|---|---|---|
--mode=summary (default) |
Silent execution | Single-node and multi-node multi-GPU |
--mode=cli |
Live terminal display | Single-node, including multi-GPU |
--mode=dashboard |
Live browser display | Single-node; requires pip install "traceml-ai[dashboard]" |
Current support
Works today:
- Single GPU training
- Single-node multi-GPU DDP / FSDP
- Multi-node DDP summary reports
- Run-to-run comparison from
final_summary.json - Custom PyTorch loops, Hugging Face, PyTorch Lightning, and Ray Train
On the roadmap:
- Slurm launch examples
- Multi-node live CLI / browser dashboard
- Explicit collective / NCCL timing
Overhead
Overhead: In our benchmark runs, TraceML adds <2% overhead on single GPU and <1% on single-node multi-GPU at default settings.
Learn More
Feedback
For bugs, unexpected results, or feature requests, open a GitHub issue and use the matching issue template. The templates ask for the details we need to reproduce training-environment problems, including hardware, topology, launch command, TraceML version, PyTorch/CUDA versions, and redacted summary output.
GitHub issues: open an issue
If TraceML helped you find a real bottleneck, use the "I found a bottleneck" issue template. These reports help other training teams recognize similar problems.
Security reports: see SECURITY.md
Email: support@traceopt.ai
Contributing
Contributions are welcome, especially:
- real slowdown examples and repros
- distributed training edge cases
- docs improvements
- framework integrations
See CONTRIBUTING.md for development setup and contribution guidelines.
License
Apache 2.0. See LICENSE.
TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceml_ai-0.3.1.tar.gz.
File metadata
- Download URL: traceml_ai-0.3.1.tar.gz
- Upload date:
- Size: 318.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27e069ddb1c20fb0635029f2d935e6cfbbd502ab7552ec4d76f7d9d9eeff5994
|
|
| MD5 |
9ee480f8309a35d1e2435b70991673d2
|
|
| BLAKE2b-256 |
245bd868aafe1a6c9723c65f3fdf3eaaff646baabf31bde4f0724c91040a0c66
|
File details
Details for the file traceml_ai-0.3.1-py3-none-any.whl.
File metadata
- Download URL: traceml_ai-0.3.1-py3-none-any.whl
- Upload date:
- Size: 465.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90d457b89ddaa2d05a5d256d92e11bc24a48a079cbd723d2e6c7b0ea47980a45
|
|
| MD5 |
29fdca6d7b980c6a172cb83cb5d4df86
|
|
| BLAKE2b-256 |
d11d8cd37b6558b4757e60bbc1bf6655f90b310b597fa09a9ad1811927b9dac7
|