TraceML: Lightweight runtime bottleneck diagnostics for PyTorch training.
Project description
TraceML
Runtime bottleneck detection for PyTorch training jobs.
Quickstart • Compare Runs • How to Read Output • Ray Train • W&B / MLflow • FAQ • Issues • Discussions
TraceML gives every PyTorch training run a structured performance fingerprint: where time went, whether ranks skewed, and whether memory drifted. It answers the questions that usually come before operator-level profiling:
- Is the run input-bound, compute-bound, wait-heavy, or memory-constrained?
- How much time is spent in dataloader, forward, backward, and optimizer?
- Are some distributed ranks consistently slower than others?
- Did memory usage drift upward during the run?
- Did a recent change cause a regression?
How TraceML Fits
TraceML fits between experiment tracking and heavyweight profiling. It gives you a first-pass diagnosis of where a training run is likely wasting time.
| Tool | Setup cost | Output | Best for | When to use |
|---|---|---|---|---|
| TraceML | Small training-step wrapper | Live step breakdown + final_summary.json |
Classifying input, compute, wait, memory, and rank-skew issues | First pass on normal training jobs |
torch.profiler |
Profiler schedule/context | Operator and CUDA activity traces | Finding expensive PyTorch ops/kernels | When compute/model path needs deep inspection |
| Nsight Systems / Compute | External profiler run | CUDA timeline / kernel-level detail | Kernel scheduling, CUDA stalls, low-level GPU analysis | Deep dive on a specific GPU performance issue |
| W&B / MLflow / TensorBoard | Metric logging/integration | Loss, accuracy, throughput, experiment history | Tracking outcomes across runs | Experiment management and dashboards |
nvidia-smi / cluster dashboards |
No code changes | GPU/CPU utilization and memory | Machine-level health and capacity signals | Sanity checks and cluster monitoring |
TraceML does not replace these tools. It is the cheap first pass that tells you where to look.
Quickstart
Install:
pip install traceml-ai
Initialize TraceML and wrap your training step:
import traceml_ai as tml
tml.init()
for batch in dataloader:
with tml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
Run your script with the traceml CLI:
traceml run train.py
The CLI command is
traceml. New Python code should useimport traceml_ai as tml. The oldimport tracemlpath still works for now, but emits aFutureWarningand will be removed in a future release.
TraceML writes two end-of-run artifacts:
logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt
Example Output
End-of-run summary
At the end of training, TraceML prints the same compact text report written to
final_summary.txt.
Example from a 4-rank DDP run configured as 2 nodes x 2 GPUs:
+----------------------------------------------------------------------------+
| TraceML Run Summary | duration 122.5s |
+----------------------------------------------------------------------------+
| |
| System |
| - Diagnosis: NORMAL |
| - Scope: nodes 2/2 | samples 124 |
| - Stats: CPU med/worst 3%/3% n0 | RAM med/worst 4%/4% n1 | GPU util |
| med/worst 74%/74% n0 | GPU temp med/worst 47.9C/47.9C n1 |
| - Why: CPU, RAM, and GPU showed no system pressure. |
| |
| Process |
| - Diagnosis: GPU MEMORY RESERVED OVERHANG |
| - Stats: global ranks 4 | CPU avg 75% | RSS peak 1.3 / 540.7 GB | GPU |
| reserved peak 1% |
| - Why: Reserved GPU memory was 1.70x active use. |
| |
| Step Time |
| - Diagnosis: INPUT STRAGGLER |
| - Scope: compared over last 460 aligned steps across 4 global ranks |
| - Stats: median/worst | total 303.7/303.7ms | input 3.8/254.5ms | |
| compute 259.5/259.5ms | wait 40.5/40.5ms |
| - Ranks: median/worst | total r3/r2 | input r2/r0 | compute r3/r1 | wait |
| r2/r1 |
| - Why: r0 input was slower than median global rank (254.5/3.8ms). |
| |
| Step Memory |
| - Diagnosis: BALANCED |
| - Scope: last 460 aligned steps |
| - Stats: peak reserved worst 192 MB on r0 | skew 0.0% |
| - Why: No clear pressure, imbalance, or creep signal. |
+----------------------------------------------------------------------------+
For experiment trackers, call tml.summary() near the end of your script
to get a flat dict of diagnosis statuses and average metrics. Keep
final_summary.json when you want the full run artifact or an input for
traceml compare.
Compare two runs
Compare a slow or suspicious run against a baseline or fixed run:
traceml compare input_slow/final_summary.json input_fixed/final_summary.json
The compact text report shows the verdict first, then the changed metrics:
+--------------------------------------------------------------------------------------+
| TraceML Compare |
+--------------------------------------------------------------------------------------+
| |
| A: input_slow |
| B: input_fixed |
| Delta: B - A |
| |
| Verdict: IMPROVEMENT |
| Why: Step time decreased by 95.6%. |
| |
| Step Time |
| Metric A B Delta |
| Step time diagnosis INPUT STRAGGLER BALANCED changed |
| Total step 294.0 ms 13.0 ms -280.9 ms (-95.6%) |
| Input 66.4 ms 2.7 ms -63.7 ms (-95.9%) |
| Compute 197.2 ms 8.6 ms -188.6 ms (-95.6%) |
| Wait 30.4 ms 1.7 ms -28.6 ms (-94.3%) |
| Forward 45.0 ms 2.1 ms -42.9 ms (-95.3%) |
| Backward 130.0 ms 5.4 ms -124.6 ms (-95.8%) |
| Optimizer 22.2 ms 1.1 ms -21.1 ms (-95.0%) |
+--------------------------------------------------------------------------------------+
The full compare report also includes Step Memory, Process, and System sections when those signals are available. TraceML writes both a structured compare JSON and a compact text report.
See Compare Runs.
Live CLI view
Live CLI view while TraceML collects the same signals used for final_summary.json.
Modes
All modes write final_summary.json and final_summary.txt at the end of the run. The mode controls only what you see during training.
| Mode | During training | Topology |
|---|---|---|
--mode=summary |
Silent | single-node and multi-node multi-GPU |
--mode=cli |
Live terminal display | single-node, including multi-GPU |
--mode=dashboard |
Live browser display | single-node, including multi-GPU |
Summary mode is the default and works across all topologies. Use --mode=cli or --mode=dashboard when you want live feedback on a single-node job.
Dashboard mode requires the optional dashboard extra:
pip install "traceml-ai[dashboard]"
Multi-node live views are on the roadmap.
For very long jobs, tune the final-summary window with
--summary-window-rows N. TraceML analyzes the latest N rows per node or
rank and retains a small alignment buffer internally.
Common Workflows
Diagnose one run
traceml run train.py
Multi-node distributed run
On node 0:
traceml run train.py \
--nnodes=2 \
--node-rank=0 \
--nproc-per-node=4 \
--master-addr=<node0-ip> \
--run-name=my-run
On node 1:
traceml run train.py \
--nnodes=2 \
--node-rank=1 \
--nproc-per-node=4 \
--master-addr=<node0-ip> \
--run-name=my-run
Use the same --run-name, --nnodes, --nproc-per-node, and
--master-addr on every node. Node 0 starts the TraceML aggregator. Other
nodes connect to <node0-ip>:29765 by default. If workers need a different
reachable address or port for TraceML telemetry, add
--aggregator-host=<host> or --aggregator-port=<port> on every node. For
multi-node runs, node 0 binds the aggregator to 0.0.0.0 by default; override
that only when needed with --aggregator-bind-host=<bind-host>.
--session-id remains accepted as a backward-compatible alias for
--run-name.
Watch mode (no code changes)
traceml watch train.py
System and process telemetry only. No step instrumentation needed.
Compare two runs
traceml compare before/final_summary.json after/final_summary.json
What TraceML measures
| Signal | What it means |
|---|---|
| Input-bound | Dataloader is the bottleneck — GPU is waiting on data |
| Compute-bound | GPU is saturated — expected in a healthy run |
| Wait-heavy | Unattributed step time outside the traced phases |
| Rank imbalance | One rank consistently slower — straggler or uneven data |
| Memory creep | Peak allocation growing step-over-step |
| High pressure | Memory near capacity — risk of OOM |
wait is residual step time, not direct NCCL or all-reduce timing. In DDP, communication may overlap with backward. Use PyTorch Profiler or Nsight when you need explicit collective or kernel-level timing.
Current support
Works today:
- Single GPU training
- Single-node multi-GPU DDP / FSDP training
- Multi-node DDP summary reports
- Ray Train through a thin
TorchTrainerwrapper - Step Time, Step Memory, System, and Process diagnostics
- Run-to-run comparison from
final_summary.json - Custom PyTorch loops, Hugging Face, and PyTorch Lightning
Next:
- Slurm launch examples
- Broader multi-node FSDP validation
- Multi-node live CLI / dashboard
- Explicit collective / NCCL timing
Overhead
Overhead: In our benchmark runs, TraceML adds <2% overhead on single GPU and <1% on single-node multi-GPU at default settings.
Learn more
- Quickstart
- Compare Runs
- How to Read TraceML Output
- Examples
- FAQ
- Use TraceML with W&B / MLflow
- Hugging Face integration
- PyTorch Lightning integration
- Ray Train integration
Feedback
If TraceML helped, a GitHub star helps others find it.
If you hit a problem or unexpected result, open an issue and include:
- hardware / CUDA / PyTorch versions
- single GPU or multi-GPU setup
- training framework
- the end-of-run summary
- a minimal repro if possible
GitHub issues: open an issue
Email: support@traceopt.ai
Contributing
Contributions are welcome, especially:
- real slowdown examples and repros
- distributed training edge cases
- docs improvements
- framework integrations
License
Apache 2.0. See LICENSE.
TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceml_ai-0.3.0.tar.gz.
File metadata
- Download URL: traceml_ai-0.3.0.tar.gz
- Upload date:
- Size: 313.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8587db28c0711307f61058c2c4d0709095243d106c436cdd68d9ac3b95c001c
|
|
| MD5 |
8a5a2c83de8e8b6535a0dde6969ab55b
|
|
| BLAKE2b-256 |
6b79ee477514d6c27da81157d6f9bf35045a3a427180e7d52cb5f96bcdf9fe30
|
File details
Details for the file traceml_ai-0.3.0-py3-none-any.whl.
File metadata
- Download URL: traceml_ai-0.3.0-py3-none-any.whl
- Upload date:
- Size: 448.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9348626e8a321850fc474e7bfd4f0b75730226f7f9c1b6fb8dce48b9cab0e494
|
|
| MD5 |
7c8d8b217d1007880942434be04674e0
|
|
| BLAKE2b-256 |
6c518ade521f70501096c56f8ca5fed34442a6214f7031e3d298c1d9d1edf908
|