TraceML: Lightweight runtime bottleneck diagnostics for PyTorch training.

These details have not been verified by PyPI

Project description

TraceML

Runtime bottleneck detection for PyTorch training jobs.

Quickstart • Compare Runs • How to Read Output • W&B / MLflow • FAQ • Issues

TraceML records lightweight signals during a PyTorch training run and produces a structured end-of-run summary. It answers the questions that usually come before operator-level profiling:

Is the run input-bound, compute-bound, wait-heavy, or memory-constrained?
Where is time going across dataloader, forward, backward, and optimizer?
Are some distributed ranks consistently slower than others?
Did memory usage drift upward during the run?
Did a recent change cause a regression?

Quickstart

Install:

pip install traceml-ai

Initialize TraceML and wrap your training step:

import traceml

traceml.init()

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

Run your script with TraceML:

traceml run train.py

TraceML writes two end-of-run artifacts:

logs/<session_id>/final_summary.json
logs/<session_id>/final_summary.txt

Example Output

End-of-run summary

At the end of training, TraceML prints the same compact text report written to final_summary.txt.

Example from a 4-rank DDP run configured as 2 nodes x 2 GPUs:

+----------------------------------------------------------------------------+
|  TraceML Run Summary | duration 122.5s                                     |
+----------------------------------------------------------------------------+
|                                                                            |
|  System                                                                    |
|  - Diagnosis: NORMAL                                                       |
|  - Scope: nodes 2/2 | samples 124                                          |
|  - Stats: CPU med/worst 3%/3% n0 | RAM med/worst 4%/4% n1 | GPU util       |
|  med/worst 74%/74% n0 | GPU temp med/worst 47.9C/47.9C n1                  |
|  - Why: CPU, RAM, and GPU showed no system pressure.                       |
|                                                                            |
|  Process                                                                   |
|  - Diagnosis: GPU MEMORY RESERVED OVERHANG                                 |
|  - Stats: global ranks 4 | CPU avg 75% | RSS peak 1.3 / 540.7 GB | GPU     |
|  reserved peak 1%                                                          |
|  - Why: Reserved GPU memory was 1.70x active use.                          |
|                                                                            |
|  Step Time                                                                 |
|  - Diagnosis: INPUT STRAGGLER                                              |
|  - Scope: compared over last 460 aligned steps across 4 global ranks       |
|  - Stats: median/worst | total 303.7/303.7ms | input 3.8/254.5ms |         |
|  compute 259.5/259.5ms | wait 40.5/40.5ms                                  |
|  - Ranks: median/worst | total r3/r2 | input r2/r0 | compute r3/r1 | wait  |
|  r2/r1                                                                      |
|  - Why: r0 input was slower than median global rank (254.5/3.8ms).         |
|                                                                            |
|  Step Memory                                                               |
|  - Diagnosis: BALANCED                                                     |
|  - Scope: last 460 aligned steps                                           |
|  - Stats: peak reserved worst 192 MB on r0 | skew 0.0%                     |
|  - Why: No clear pressure, imbalance, or creep signal.                     |
+----------------------------------------------------------------------------+

The final_summary.json is machine-readable and designed for logging to W&B or MLflow, storing as a run artifact, or comparing against another run.

Compare two runs

Compare a slow or suspicious run against a baseline or fixed run:

traceml compare input_slow/final_summary.json input_fixed/final_summary.json

The compact text report shows the verdict first, then the changed metrics:

+--------------------------------------------------------------------------------------+
|  TraceML Compare                                                                     |
+--------------------------------------------------------------------------------------+
|                                                                                      |
|  A: input_slow                                                                       |
|  B: input_fixed                                                                      |
|  Delta: B - A                                                                        |
|                                                                                      |
|  Verdict: IMPROVEMENT                                                                |
|  Why: Step time decreased by 95.6%.                                                  |
|                                                                                      |
|  Step Time                                                                           |
|  Metric                         A                B                Delta              |
|  Step time diagnosis            INPUT STRAGGLER  BALANCED         changed            |
|  Total step                     294.0 ms         13.0 ms          -280.9 ms (-95.6%) |
|  Input                          66.4 ms          2.7 ms           -63.7 ms (-95.9%)  |
|  Compute                        197.2 ms         8.6 ms           -188.6 ms (-95.6%) |
|  Wait                           30.4 ms          1.7 ms           -28.6 ms (-94.3%)  |
|  Forward                        45.0 ms          2.1 ms           -42.9 ms (-95.3%)  |
|  Backward                       130.0 ms         5.4 ms           -124.6 ms (-95.8%) |
|  Optimizer                      22.2 ms          1.1 ms           -21.1 ms (-95.0%)  |
+--------------------------------------------------------------------------------------+

The full compare report also includes Step Memory, Process, and System sections when those signals are available. TraceML writes both a structured compare JSON and a compact text report.

See Compare Runs.

Modes

All modes write final_summary.json and final_summary.txt at the end of the run. The mode controls only what you see during training.

Mode	During training	Topology
`--mode=summary`	Silent	single-node and multi-node multi-GPU
`--mode=cli`	Live terminal display	single-node, including multi-GPU
`--mode=dashboard`	Live browser display	single-node, including multi-GPU

Summary mode is the default and works across all topologies. Use --mode=cli or --mode=dashboard when you want live feedback on a single-node job.

Deep/layer profiling has been removed from the public CLI for now.

Multi-node live views are on the roadmap.

For very long jobs, tune the final-summary window with --summary-window-rows N. TraceML analyzes the latest N rows per node or rank and retains a small alignment buffer internally.

Common Workflows

Diagnose one run

traceml run train.py

Multi-node distributed run

On node 0:

traceml run train.py \
  --nnodes=2 \
  --node-rank=0 \
  --nproc-per-node=4 \
  --master-addr=<node0-ip> \
  --session-id=my-run

On node 1:

traceml run train.py \
  --nnodes=2 \
  --node-rank=1 \
  --nproc-per-node=4 \
  --master-addr=<node0-ip> \
  --session-id=my-run

Use the same --session-id, --nnodes, --nproc-per-node, and --master-addr on every node. Node 0 starts the TraceML aggregator. Other nodes connect to <node0-ip>:29765 by default. If workers need a different reachable address or port for TraceML telemetry, add --aggregator-host=<host> or --aggregator-port=<port> on every node. For multi-node runs, node 0 binds the aggregator to 0.0.0.0 by default; override that only when needed with --aggregator-bind-host=<bind-host>.

Zero-code first look

traceml watch train.py

System and process telemetry only. No step instrumentation needed.

Compare two runs

traceml compare before/final_summary.json after/final_summary.json

What TraceML measures

Signal	What it means
Input-bound	Dataloader is the bottleneck — GPU is waiting on data
Compute-bound	GPU is saturated — expected in a healthy run
Wait-heavy	Unattributed step time outside the traced phases
Rank imbalance	One rank consistently slower — straggler or uneven data
Memory creep	Peak allocation growing step-over-step
High pressure	Memory near capacity — risk of OOM

wait is residual step time, not direct NCCL or all-reduce timing. In DDP, communication may overlap with backward. Use PyTorch Profiler or Nsight when you need explicit collective or kernel-level timing.

When to use TraceML

Use TraceML when you want a lightweight performance fingerprint for a PyTorch training run:

keep a small final_summary.json you can share, store, diff, or log
see where step time went across dataloader, forward, backward, optimizer, and wait time
compare a new run against a previous baseline
check whether ranks, nodes, process memory, or system resources look imbalanced
collect enough evidence before opening PyTorch Profiler or Nsight

When not to use TraceML: If you already need operator, kernel, or collective-level timing, go straight to torch.profiler or Nsight. TraceML is the cheap first pass that tells you where to look.

How it fits with your stack

TraceML sits between experiment tracking and heavyweight profiling.

Run PyTorch training with TraceML
        ↓
Save final_summary.json as a lightweight performance fingerprint
        ↓
Review final_summary.txt for the likely bottleneck
        ↓
Compare against a previous summary when behavior changes
        ↓
Open torch.profiler or Nsight only if you need operator/kernel detail

Use W&B, MLflow, or TensorBoard for experiment tracking, metrics, and dashboards. Use TraceML for bottleneck diagnosis, distributed run summaries, and run-to-run performance comparison.

See Use TraceML with W&B / MLflow.

Current support

Works today:

Single GPU training
Single-node multi-GPU DDP / FSDP training
Multi-node DDP summary reports
Step Time, Step Memory, System, and Process diagnostics
Run-to-run comparison from final_summary.json
Custom PyTorch loops, Hugging Face, and PyTorch Lightning

Next:

Ray Train integration
Slurm launch examples
Broader multi-node FSDP validation
Multi-node live CLI / dashboard
Explicit collective / NCCL timing

Overhead

TraceML adds fixed per-step instrumentation overhead. Relative overhead is highest when training steps are very short. In larger jobs the fixed cost is amortized over longer step time.

In our early DDP benchmarks, TraceML did not produce a measurable slowdown beyond normal run-to-run variation.

Learn more

Feedback

If TraceML helped you catch a slowdown, please open an issue and include:

hardware / CUDA / PyTorch versions
single GPU or multi-GPU setup
training framework
the end-of-run summary
a minimal repro if possible

GitHub issues: open an issue

Email: support@traceopt.ai

Contributing

If TraceML helped you catch a slowdown, a GitHub star helps others find it.

Contributions are welcome, especially:

real slowdown examples and repros
distributed training edge cases
docs improvements
framework integrations

License

Apache 2.0. See LICENSE.

TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).

Upcoming rename: traceml-ai will be renamed to traceopt-ai in a future release. Python imports will change from traceml to traceopt. The active package today remains traceml-ai.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

May 26, 2026

This version

0.2.15

May 19, 2026

0.2.14

May 7, 2026

0.2.13

Apr 30, 2026

0.2.12

Apr 27, 2026

0.2.11

Apr 23, 2026

0.2.10

Apr 22, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 13, 2026

0.2.7

Apr 7, 2026

0.2.6

Apr 4, 2026

0.2.5

Mar 20, 2026

0.2.4

Mar 15, 2026

0.2.3

Mar 7, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 9, 2026

0.2.0a0 pre-release

Jan 27, 2026

0.1.9

Jan 3, 2026

0.1.8

Dec 25, 2025

0.1.6

Dec 11, 2025

0.1.5

Dec 10, 2025

0.1.3

Oct 8, 2025

0.1.1

Oct 2, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.15.tar.gz (303.5 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceml_ai-0.2.15-py3-none-any.whl (434.1 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file traceml_ai-0.2.15.tar.gz.

File metadata

Download URL: traceml_ai-0.2.15.tar.gz
Upload date: May 19, 2026
Size: 303.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.2.15.tar.gz
Algorithm	Hash digest
SHA256	`ab70c2d429c9b6e7238de5448a149936726461123f6d707e344433c85c61b8d3`
MD5	`5c1bfcb06a3f590d97a6df04c2a4863e`
BLAKE2b-256	`4e929c5252cee6cc19cffcc145f3a354f1812643022696ce66ee9d8775705bb1`

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.15-py3-none-any.whl.

File metadata

Download URL: traceml_ai-0.2.15-py3-none-any.whl
Upload date: May 19, 2026
Size: 434.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.2.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b8d5ca7dac5be961a04b4c9e1016003501dbb4be44e156d8330ec58c478cbb42`
MD5	`0ae28cb54c1c4435eb75042cfe9740fd`
BLAKE2b-256	`6e59e1da8de7a9a2efa23276a14e812ecfc6db9d7a4af674e3ffe0a39c484f2d`

See more details on using hashes here.

traceml-ai 0.2.15

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TraceML

Quickstart

Example Output

End-of-run summary

Compare two runs

Modes

Common Workflows

Diagnose one run

Multi-node distributed run

Zero-code first look

Compare two runs

What TraceML measures

When to use TraceML

How it fits with your stack

Current support

Overhead

Learn more

Feedback

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes