TraceML: Lightweight runtime bottleneck diagnostics for PyTorch training.

These details have not been verified by PyPI

Project description

TraceML

Runtime bottleneck detection for PyTorch training jobs.

Quickstart • Compare Runs • How to Read Output • Ray Train • W&B / MLflow • FAQ • Issues • Discussions

TraceML gives every PyTorch training run a structured performance fingerprint: where time went, whether ranks skewed, and whether memory drifted. It answers the questions that usually come before operator-level profiling:

Is the run input-bound, compute-bound, wait-heavy, or memory-constrained?
How much time is spent in dataloader, forward, backward, and optimizer?
Are some distributed ranks consistently slower than others?
Did memory usage drift upward during the run?
Did a recent change cause a regression?

How TraceML Fits

TraceML fits between experiment tracking and heavyweight profiling. It gives you a first-pass diagnosis of where a training run is likely wasting time.

Tool	Setup cost	Output	Best for	When to use
TraceML	Small training-step wrapper	Live step breakdown + `final_summary.json`	Classifying input, compute, wait, memory, and rank-skew issues	First pass on normal training jobs
`torch.profiler`	Profiler schedule/context	Operator and CUDA activity traces	Finding expensive PyTorch ops/kernels	When compute/model path needs deep inspection
Nsight Systems / Compute	External profiler run	CUDA timeline / kernel-level detail	Kernel scheduling, CUDA stalls, low-level GPU analysis	Deep dive on a specific GPU performance issue
W&B / MLflow / TensorBoard	Metric logging/integration	Loss, accuracy, throughput, experiment history	Tracking outcomes across runs	Experiment management and dashboards
`nvidia-smi` / cluster dashboards	No code changes	GPU/CPU utilization and memory	Machine-level health and capacity signals	Sanity checks and cluster monitoring

TraceML does not replace these tools. It is the cheap first pass that tells you where to look.

Quickstart

Install:

pip install traceml-ai

Initialize TraceML and wrap your training step:

import traceml_ai as tml

tml.init()

for batch in dataloader:
    with tml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

Run your script with the traceml CLI:

traceml run train.py

The CLI command is traceml. New Python code should use import traceml_ai as tml. The old import traceml path still works for now, but emits a FutureWarning and will be removed in a future release.

TraceML writes two end-of-run artifacts:

logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt

Example Output

End-of-run summary

At the end of training, TraceML prints the same compact text report written to final_summary.txt.

Example from a 4-rank DDP run configured as 2 nodes x 2 GPUs:

+----------------------------------------------------------------------------+
|  TraceML Run Summary | duration 122.5s                                     |
+----------------------------------------------------------------------------+
|                                                                            |
|  System                                                                    |
|  - Diagnosis: NORMAL                                                       |
|  - Scope: nodes 2/2 | samples 124                                          |
|  - Stats: CPU med/worst 3%/3% n0 | RAM med/worst 4%/4% n1 | GPU util       |
|  med/worst 74%/74% n0 | GPU temp med/worst 47.9C/47.9C n1                  |
|  - Why: CPU, RAM, and GPU showed no system pressure.                       |
|                                                                            |
|  Process                                                                   |
|  - Diagnosis: GPU MEMORY RESERVED OVERHANG                                 |
|  - Stats: global ranks 4 | CPU avg 75% | RSS peak 1.3 / 540.7 GB | GPU     |
|  reserved peak 1%                                                          |
|  - Why: Reserved GPU memory was 1.70x active use.                          |
|                                                                            |
|  Step Time                                                                 |
|  - Diagnosis: INPUT STRAGGLER                                              |
|  - Scope: compared over last 460 aligned steps across 4 global ranks       |
|  - Stats: median/worst | total 303.7/303.7ms | input 3.8/254.5ms |         |
|  compute 259.5/259.5ms | wait 40.5/40.5ms                                  |
|  - Ranks: median/worst | total r3/r2 | input r2/r0 | compute r3/r1 | wait  |
|  r2/r1                                                                     |
|  - Why: r0 input was slower than median global rank (254.5/3.8ms).         |
|                                                                            |
|  Step Memory                                                               |
|  - Diagnosis: BALANCED                                                     |
|  - Scope: last 460 aligned steps                                           |
|  - Stats: peak reserved worst 192 MB on r0 | skew 0.0%                     |
|  - Why: No clear pressure, imbalance, or creep signal.                     |
+----------------------------------------------------------------------------+

For experiment trackers, call tml.summary() near the end of your script to get a flat dict of diagnosis statuses and average metrics. Keep final_summary.json when you want the full run artifact or an input for traceml compare.

Compare two runs

Compare a slow or suspicious run against a baseline or fixed run:

traceml compare input_slow/final_summary.json input_fixed/final_summary.json

The compact text report shows the verdict first, then the changed metrics:

+--------------------------------------------------------------------------------------+
|  TraceML Compare                                                                     |
+--------------------------------------------------------------------------------------+
|                                                                                      |
|  A: input_slow                                                                       |
|  B: input_fixed                                                                      |
|  Delta: B - A                                                                        |
|                                                                                      |
|  Verdict: IMPROVEMENT                                                                |
|  Why: Step time decreased by 95.6%.                                                  |
|                                                                                      |
|  Step Time                                                                           |
|  Metric                         A                B                Delta              |
|  Step time diagnosis            INPUT STRAGGLER  BALANCED         changed            |
|  Total step                     294.0 ms         13.0 ms          -280.9 ms (-95.6%) |
|  Input                          66.4 ms          2.7 ms           -63.7 ms (-95.9%)  |
|  Compute                        197.2 ms         8.6 ms           -188.6 ms (-95.6%) |
|  Wait                           30.4 ms          1.7 ms           -28.6 ms (-94.3%)  |
|  Forward                        45.0 ms          2.1 ms           -42.9 ms (-95.3%)  |
|  Backward                       130.0 ms         5.4 ms           -124.6 ms (-95.8%) |
|  Optimizer                      22.2 ms          1.1 ms           -21.1 ms (-95.0%)  |
+--------------------------------------------------------------------------------------+

The full compare report also includes Step Memory, Process, and System sections when those signals are available. TraceML writes both a structured compare JSON and a compact text report.

See Compare Runs.

Live CLI view

TraceML live CLI view

Live CLI view while TraceML collects the same signals used for final_summary.json.

Modes

All modes write final_summary.json and final_summary.txt at the end of the run. The mode controls only what you see during training.

Mode	During training	Topology
`--mode=summary`	Silent	single-node and multi-node multi-GPU
`--mode=cli`	Live terminal display	single-node, including multi-GPU
`--mode=dashboard`	Live browser display	single-node, including multi-GPU

Summary mode is the default and works across all topologies. Use --mode=cli or --mode=dashboard when you want live feedback on a single-node job. Dashboard mode requires the optional dashboard extra:

pip install "traceml-ai[dashboard]"

Multi-node live views are on the roadmap.

For very long jobs, tune the final-summary window with --summary-window-rows N. TraceML analyzes the latest N rows per node or rank and retains a small alignment buffer internally.

Common Workflows

Diagnose one run

traceml run train.py

Multi-node distributed run

On node 0:

traceml run train.py \
  --nnodes=2 \
  --node-rank=0 \
  --nproc-per-node=4 \
  --master-addr=<node0-ip> \
  --run-name=my-run

On node 1:

traceml run train.py \
  --nnodes=2 \
  --node-rank=1 \
  --nproc-per-node=4 \
  --master-addr=<node0-ip> \
  --run-name=my-run

Use the same --run-name, --nnodes, --nproc-per-node, and --master-addr on every node. Node 0 starts the TraceML aggregator. Other nodes connect to <node0-ip>:29765 by default. If workers need a different reachable address or port for TraceML telemetry, add --aggregator-host=<host> or --aggregator-port=<port> on every node. For multi-node runs, node 0 binds the aggregator to 0.0.0.0 by default; override that only when needed with --aggregator-bind-host=<bind-host>.

--session-id remains accepted as a backward-compatible alias for --run-name.

Watch mode (no code changes)

traceml watch train.py

System and process telemetry only. No step instrumentation needed.

Compare two runs

traceml compare before/final_summary.json after/final_summary.json

What TraceML measures

Signal	What it means
Input-bound	Dataloader is the bottleneck — GPU is waiting on data
Compute-bound	GPU is saturated — expected in a healthy run
Wait-heavy	Unattributed step time outside the traced phases
Rank imbalance	One rank consistently slower — straggler or uneven data
Memory creep	Peak allocation growing step-over-step
High pressure	Memory near capacity — risk of OOM

wait is residual step time, not direct NCCL or all-reduce timing. In DDP, communication may overlap with backward. Use PyTorch Profiler or Nsight when you need explicit collective or kernel-level timing.

Current support

Works today:

Single GPU training
Single-node multi-GPU DDP / FSDP training
Multi-node DDP summary reports
Ray Train through a thin TorchTrainer wrapper
Step Time, Step Memory, System, and Process diagnostics
Run-to-run comparison from final_summary.json
Custom PyTorch loops, Hugging Face, and PyTorch Lightning

Next:

Slurm launch examples
Broader multi-node FSDP validation
Multi-node live CLI / dashboard
Explicit collective / NCCL timing

Overhead

Overhead: In our benchmark runs, TraceML adds <2% overhead on single GPU and <1% on single-node multi-GPU at default settings.

Learn more

Feedback

If TraceML helped, a GitHub star helps others find it.

If you hit a problem or unexpected result, open an issue and include:

hardware / CUDA / PyTorch versions
single GPU or multi-GPU setup
training framework
the end-of-run summary
a minimal repro if possible

GitHub issues: open an issue

Email: support@traceopt.ai

Contributing

Contributions are welcome, especially:

real slowdown examples and repros
distributed training edge cases
docs improvements
framework integrations

License

Apache 2.0. See LICENSE.

TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

May 26, 2026

0.2.15

May 19, 2026

0.2.14

May 7, 2026

0.2.13

Apr 30, 2026

0.2.12

Apr 27, 2026

0.2.11

Apr 23, 2026

0.2.10

Apr 22, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 13, 2026

0.2.7

Apr 7, 2026

0.2.6

Apr 4, 2026

0.2.5

Mar 20, 2026

0.2.4

Mar 15, 2026

0.2.3

Mar 7, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 9, 2026

0.2.0a0 pre-release

Jan 27, 2026

0.1.9

Jan 3, 2026

0.1.8

Dec 25, 2025

0.1.6

Dec 11, 2025

0.1.5

Dec 10, 2025

0.1.3

Oct 8, 2025

0.1.1

Oct 2, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.3.0.tar.gz (313.1 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceml_ai-0.3.0-py3-none-any.whl (448.9 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file traceml_ai-0.3.0.tar.gz.

File metadata

Download URL: traceml_ai-0.3.0.tar.gz
Upload date: May 26, 2026
Size: 313.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`f8587db28c0711307f61058c2c4d0709095243d106c436cdd68d9ac3b95c001c`
MD5	`8a5a2c83de8e8b6535a0dde6969ab55b`
BLAKE2b-256	`6b79ee477514d6c27da81157d6f9bf35045a3a427180e7d52cb5f96bcdf9fe30`

See more details on using hashes here.

File details

Details for the file traceml_ai-0.3.0-py3-none-any.whl.

File metadata

Download URL: traceml_ai-0.3.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 448.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9348626e8a321850fc474e7bfd4f0b75730226f7f9c1b6fb8dce48b9cab0e494`
MD5	`7c8d8b217d1007880942434be04674e0`
BLAKE2b-256	`6c518ade521f70501096c56f8ca5fed34442a6214f7031e3d298c1d9d1edf908`

See more details on using hashes here.

traceml-ai 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TraceML

How TraceML Fits

Quickstart

Example Output

End-of-run summary

Compare two runs

Live CLI view

Modes

Common Workflows

Diagnose one run

Multi-node distributed run

Watch mode (no code changes)

Compare two runs

What TraceML measures

Current support

Overhead

Learn more

Feedback

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes