Skip to main content

TraceML: Lightweight runtime bottleneck diagnostics for PyTorch training.

Project description

TraceML

Runtime bottleneck detection for PyTorch training jobs.

PyPI version CI Python 3.10+ License GitHub stars

QuickstartCompare RunsHow to Read OutputRay TrainW&B / MLflowFAQIssuesDiscussions

TraceML gives every PyTorch training run a structured performance fingerprint: where time went, whether ranks skewed, and whether memory drifted. It answers the questions that usually come before operator-level profiling:

  • Is the run input-bound, compute-bound, wait-heavy, or memory-constrained?
  • How much time is spent in dataloader, forward, backward, and optimizer?
  • Are some distributed ranks consistently slower than others?
  • Did memory usage drift upward during the run?
  • Did a recent change cause a regression?

How TraceML Fits

TraceML fits between experiment tracking and heavyweight profiling. It gives you a first-pass diagnosis of where a training run is likely wasting time.

Tool Setup cost Output Best for When to use
TraceML Small training-step wrapper Live step breakdown + final_summary.json Classifying input, compute, wait, memory, and rank-skew issues First pass on normal training jobs
torch.profiler Profiler schedule/context Operator and CUDA activity traces Finding expensive PyTorch ops/kernels When compute/model path needs deep inspection
Nsight Systems / Compute External profiler run CUDA timeline / kernel-level detail Kernel scheduling, CUDA stalls, low-level GPU analysis Deep dive on a specific GPU performance issue
W&B / MLflow / TensorBoard Metric logging/integration Loss, accuracy, throughput, experiment history Tracking outcomes across runs Experiment management and dashboards
nvidia-smi / cluster dashboards No code changes GPU/CPU utilization and memory Machine-level health and capacity signals Sanity checks and cluster monitoring

TraceML does not replace these tools. It is the cheap first pass that tells you where to look.


Quickstart

Install:

pip install traceml-ai

Initialize TraceML and wrap your training step:

import traceml_ai as tml

tml.init()

for batch in dataloader:
    with tml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

Run your script with the traceml CLI:

traceml run train.py

The CLI command is traceml. New Python code should use import traceml_ai as tml. The old import traceml path still works for now, but emits a FutureWarning and will be removed in a future release.

TraceML writes two end-of-run artifacts:

logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt

Example Output

End-of-run summary

At the end of training, TraceML prints the same compact text report written to final_summary.txt.

Example from a 4-rank DDP run configured as 2 nodes x 2 GPUs:

+----------------------------------------------------------------------------+
|  TraceML Run Summary | duration 122.5s                                     |
+----------------------------------------------------------------------------+
|                                                                            |
|  System                                                                    |
|  - Diagnosis: NORMAL                                                       |
|  - Scope: nodes 2/2 | samples 124                                          |
|  - Stats: CPU med/worst 3%/3% n0 | RAM med/worst 4%/4% n1 | GPU util       |
|  med/worst 74%/74% n0 | GPU temp med/worst 47.9C/47.9C n1                  |
|  - Why: CPU, RAM, and GPU showed no system pressure.                       |
|                                                                            |
|  Process                                                                   |
|  - Diagnosis: GPU MEMORY RESERVED OVERHANG                                 |
|  - Stats: global ranks 4 | CPU avg 75% | RSS peak 1.3 / 540.7 GB | GPU     |
|  reserved peak 1%                                                          |
|  - Why: Reserved GPU memory was 1.70x active use.                          |
|                                                                            |
|  Step Time                                                                 |
|  - Diagnosis: INPUT STRAGGLER                                              |
|  - Scope: compared over last 460 aligned steps across 4 global ranks       |
|  - Stats: median/worst | total 303.7/303.7ms | input 3.8/254.5ms |         |
|  compute 259.5/259.5ms | wait 40.5/40.5ms                                  |
|  - Ranks: median/worst | total r3/r2 | input r2/r0 | compute r3/r1 | wait  |
|  r2/r1                                                                     |
|  - Why: r0 input was slower than median global rank (254.5/3.8ms).         |
|                                                                            |
|  Step Memory                                                               |
|  - Diagnosis: BALANCED                                                     |
|  - Scope: last 460 aligned steps                                           |
|  - Stats: peak reserved worst 192 MB on r0 | skew 0.0%                     |
|  - Why: No clear pressure, imbalance, or creep signal.                     |
+----------------------------------------------------------------------------+

For experiment trackers, call tml.summary() near the end of your script to get a flat dict of diagnosis statuses and average metrics. Keep final_summary.json when you want the full run artifact or an input for traceml compare.


Compare two runs

Compare a slow or suspicious run against a baseline or fixed run:

traceml compare input_slow/final_summary.json input_fixed/final_summary.json

The compact text report shows the verdict first, then the changed metrics:

+--------------------------------------------------------------------------------------+
|  TraceML Compare                                                                     |
+--------------------------------------------------------------------------------------+
|                                                                                      |
|  A: input_slow                                                                       |
|  B: input_fixed                                                                      |
|  Delta: B - A                                                                        |
|                                                                                      |
|  Verdict: IMPROVEMENT                                                                |
|  Why: Step time decreased by 95.6%.                                                  |
|                                                                                      |
|  Step Time                                                                           |
|  Metric                         A                B                Delta              |
|  Step time diagnosis            INPUT STRAGGLER  BALANCED         changed            |
|  Total step                     294.0 ms         13.0 ms          -280.9 ms (-95.6%) |
|  Input                          66.4 ms          2.7 ms           -63.7 ms (-95.9%)  |
|  Compute                        197.2 ms         8.6 ms           -188.6 ms (-95.6%) |
|  Wait                           30.4 ms          1.7 ms           -28.6 ms (-94.3%)  |
|  Forward                        45.0 ms          2.1 ms           -42.9 ms (-95.3%)  |
|  Backward                       130.0 ms         5.4 ms           -124.6 ms (-95.8%) |
|  Optimizer                      22.2 ms          1.1 ms           -21.1 ms (-95.0%)  |
+--------------------------------------------------------------------------------------+

The full compare report also includes Step Memory, Process, and System sections when those signals are available. TraceML writes both a structured compare JSON and a compact text report.

See Compare Runs.

Live CLI view

TraceML live CLI view

Live CLI view while TraceML collects the same signals used for final_summary.json.


Modes

All modes write final_summary.json and final_summary.txt at the end of the run. The mode controls only what you see during training.

Mode During training Topology
--mode=summary Silent single-node and multi-node multi-GPU
--mode=cli Live terminal display single-node, including multi-GPU
--mode=dashboard Live browser display single-node, including multi-GPU

Summary mode is the default and works across all topologies. Use --mode=cli or --mode=dashboard when you want live feedback on a single-node job. Dashboard mode requires the optional dashboard extra:

pip install "traceml-ai[dashboard]"

Multi-node live views are on the roadmap.

For very long jobs, tune the final-summary window with --summary-window-rows N. TraceML analyzes the latest N rows per node or rank and retains a small alignment buffer internally.


Common Workflows

Diagnose one run

traceml run train.py

Multi-node distributed run

On node 0:

traceml run train.py \
  --nnodes=2 \
  --node-rank=0 \
  --nproc-per-node=4 \
  --master-addr=<node0-ip> \
  --run-name=my-run

On node 1:

traceml run train.py \
  --nnodes=2 \
  --node-rank=1 \
  --nproc-per-node=4 \
  --master-addr=<node0-ip> \
  --run-name=my-run

Use the same --run-name, --nnodes, --nproc-per-node, and --master-addr on every node. Node 0 starts the TraceML aggregator. Other nodes connect to <node0-ip>:29765 by default. If workers need a different reachable address or port for TraceML telemetry, add --aggregator-host=<host> or --aggregator-port=<port> on every node. For multi-node runs, node 0 binds the aggregator to 0.0.0.0 by default; override that only when needed with --aggregator-bind-host=<bind-host>.

--session-id remains accepted as a backward-compatible alias for --run-name.

Watch mode (no code changes)

traceml watch train.py

System and process telemetry only. No step instrumentation needed.

Compare two runs

traceml compare before/final_summary.json after/final_summary.json

What TraceML measures

Signal What it means
Input-bound Dataloader is the bottleneck — GPU is waiting on data
Compute-bound GPU is saturated — expected in a healthy run
Wait-heavy Unattributed step time outside the traced phases
Rank imbalance One rank consistently slower — straggler or uneven data
Memory creep Peak allocation growing step-over-step
High pressure Memory near capacity — risk of OOM

wait is residual step time, not direct NCCL or all-reduce timing. In DDP, communication may overlap with backward. Use PyTorch Profiler or Nsight when you need explicit collective or kernel-level timing.


Current support

Works today:

  • Single GPU training
  • Single-node multi-GPU DDP / FSDP training
  • Multi-node DDP summary reports
  • Ray Train through a thin TorchTrainer wrapper
  • Step Time, Step Memory, System, and Process diagnostics
  • Run-to-run comparison from final_summary.json
  • Custom PyTorch loops, Hugging Face, and PyTorch Lightning

Next:

  • Slurm launch examples
  • Broader multi-node FSDP validation
  • Multi-node live CLI / dashboard
  • Explicit collective / NCCL timing

Overhead

Overhead: In our benchmark runs, TraceML adds <2% overhead on single GPU and <1% on single-node multi-GPU at default settings.


Learn more


Feedback

If TraceML helped, a GitHub star helps others find it.

If you hit a problem or unexpected result, open an issue and include:

  • hardware / CUDA / PyTorch versions
  • single GPU or multi-GPU setup
  • training framework
  • the end-of-run summary
  • a minimal repro if possible

GitHub issues: open an issue

Email: support@traceopt.ai


Contributing

Contributions are welcome, especially:

  • real slowdown examples and repros
  • distributed training edge cases
  • docs improvements
  • framework integrations

License

Apache 2.0. See LICENSE.

TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.3.0.tar.gz (313.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.3.0-py3-none-any.whl (448.9 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.3.0.tar.gz.

File metadata

  • Download URL: traceml_ai-0.3.0.tar.gz
  • Upload date:
  • Size: 313.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f8587db28c0711307f61058c2c4d0709095243d106c436cdd68d9ac3b95c001c
MD5 8a5a2c83de8e8b6535a0dde6969ab55b
BLAKE2b-256 6b79ee477514d6c27da81157d6f9bf35045a3a427180e7d52cb5f96bcdf9fe30

See more details on using hashes here.

File details

Details for the file traceml_ai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 448.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9348626e8a321850fc474e7bfd4f0b75730226f7f9c1b6fb8dce48b9cab0e494
MD5 7c8d8b217d1007880942434be04674e0
BLAKE2b-256 6c518ade521f70501096c56f8ca5fed34442a6214f7031e3d298c1d9d1edf908

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page