TraceML: Lightweight runtime bottleneck diagnostics for PyTorch training.

These details have not been verified by PyPI

Project description

TraceML

Find out why your PyTorch training is slow—before it wastes GPU hours.

Quickstart • Integrations • Compare Runs • Distributed Training • Documentation • Discord

TraceML live browser dashboard

_{Live performance diagnostics for single-node PyTorch training. Multi-node jobs are supported through summary mode.}

TraceML is open-source performance observability for PyTorch training. It runs alongside your training loop and identifies where training time is going across the full job—not just a small window of profiled steps.

It helps you answer:

Is the GPU computing or waiting for the input pipeline?
Which phase is making each training step slower?
Is one distributed rank holding back the others?
Is memory usage silently growing?
Did a code, data, or infrastructure change make the run slower?

TraceML produces actionable diagnostics with under 1% overhead in current benchmarks.

Quickstart

1. Install TraceML

For the live browser dashboard:

pip install traceml-ai

Using Hugging Face Trainer, PyTorch Lightning, Ray Train, W&B, or MLflow? Start with the native integration path in Use With Your Stack.

2. Instrument the training step

Add TraceML around the core training step. You do not need to change your model, optimizer, loss function, or dataloader.

import traceml_ai as traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

3. Run your training

Start your training with the live browser dashboard:

traceml run train.py

TraceML prints the dashboard URL, usually http://127.0.0.1:8765. Open it to see live bottleneck diagnostics while the job runs.

Or try the self-contained example first:

traceml run examples/quickstart.py

Running on a remote server?

SSH into the server and start TraceML there:

traceml run train.py

TraceML prints a tunnel command like this:

ssh -L 8765:127.0.0.1:8765 user@remote-host

Copy that tunnel command into a local terminal on your laptop. Leave the training command running on the server, then open http://127.0.0.1:8765 locally.

If you want a live view without a browser or SSH tunnel, use terminal mode:

traceml run train.py --mode=cli

Use summary mode when no live display is needed, such as headless jobs, CI, DDP, FSDP, Slurm, or multi-node runs:

traceml run train.py --mode=summary

For DDP, FSDP, Slurm, and multi-node runs, see Distributed Training.

Example Diagnosis

Instead of showing only utilization charts, TraceML explains what is slowing the job, presents the supporting evidence, and tells you where to investigate.

INPUT STRAGGLER / CRITICAL

Rank 0 spent 254.5 ms waiting for input versus 3.8 ms on the median rank.

Next: inspect the dataloader, preprocessing, collate_fn, storage, and worker configuration on rank 0.

View the complete terminal report

+----------------------------------------------------------------------------+
|  TraceML Run Summary | duration 40.1s                                      |
+----------------------------------------------------------------------------+
|                                                                            |
|  TraceML Verdict: INPUT STRAGGLER / CRITICAL                               |
|  Why: Rank r0 input wait was 254.5ms vs median rank r1 at 3.8ms.           |
|  Next: Inspect dataloader, collate_fn, preprocessing, and storage on the   |
|  slow rank.                                                                |
|                                                                            |
|  Section Status                                                            |
|  Section       Status                  Severity                            |
|  ------------------------------------------------                          |
|  Step Time     INPUT STRAGGLER         CRITICAL                            |
|  System        LOW GPU UTIL            INFO                                |
|  Process       NORMAL                  INFO                                |
|  Step Memory   BALANCED                INFO                                |
|                                                                            |
|  System Evidence                                                           |
|  Metric          Median        Worst         Skew        Scope             |
|  --------------------------------------------------------------------------|
|  CPU Util        18.4%         71.2%         52.8pp      node=n1           |
|  GPU Util        14.0%         0.0%          14.0pp      node=n0           |
|  GPU Memory      6.20GB        8.90GB        43.5%       node=n1           |
|  GPU Temp        42C           58C           16C         node=n1           |
|                                                                            |
|  Step Time Evidence                                                        |
|  Phase           Median        Worst         Skew        Scope             |
|  --------------------------------------------------------------------------|
|  Total           303.7ms       304.1ms       0.1%        rank=r0 node=n0   |
|  Input Wait      3.8ms         254.5ms       6597.4%     rank=r0 node=n0   |
|  Compute         259.5ms       261.0ms       0.6%        rank=r2 node=n1   |
+----------------------------------------------------------------------------+

In this example, rank 0 is the slow input rank and can hold back the aligned distributed step.

Want to reproduce a specific bottleneck? See examples/ for self-contained demos covering dataloader bottlenecks, H2D timing, DDP rank stragglers, Lightning, Hugging Face, Ray, and tracker-friendly summary logging.

What TraceML Helps You Triage

Use TraceML as the first check before opening a heavier profiler. It surfaces the likely bottleneck category so you know where to look next.

Area	What TraceML surfaces	What to inspect next
Input pipeline	High input time or a slow input rank	`num_workers`, `pin_memory`, transforms, tokenization, `collate_fn`, dataset and storage latency
GPU utilization	Step time split across input, compute, and residual time	input pipeline, CPU/GPU handoff, synchronization, distributed coordination
Distributed skew	One DDP or FSDP rank slower than the others	rank-local dataloading, data imbalance, node variance, storage, and network differences
Memory creep	Memory usage growing during the run	retained tensors, logging references, loss accumulation, cached activations
Run regression	Changed metrics versus a known-good run	code, data, batch size, container, driver, hardware, and infrastructure changes
Compute-heavy runs	Most time is spent in compute	`torch.profiler`, Kineto, or Nsight for operator- and kernel-level detail

Display Modes

Choose the interface that fits the environment without changing the saved end-of-run artifacts.

Mode	Experience during training	Supported topology
`--mode=dashboard`	Live browser dashboard	Single-node; requires `pip install "traceml-ai[dashboard]"`
`--mode=cli`	Live terminal diagnostics	Single-node, including multi-GPU
`--mode=summary`	Silent execution with end-of-run report	Single-node and multi-node multi-GPU
`mode="auto"`	Selects an appropriate runtime display	Use when embedding TraceML in training code

Headless, CI, or capturing stdout? Use --mode=summary. TraceML still writes the same .json and .txt artifacts at the end of the run.

TraceML live terminal view

_{--mode=cli — live terminal diagnostics for local and SSH workflows.}

Saved Run Artifacts

TraceML writes two end-of-run artifacts:

logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt

Reprint a saved summary without rerunning training:

traceml view logs/<run_name>/final_summary.json

Create a self-contained HTML report during the run:

traceml run train.py --html-report

Or render one later from a saved summary:

traceml view logs/<run_name>/final_summary.json --html

For experiment trackers, call traceml.summary() near the end of your script to get a flat dictionary of diagnosis statuses and average metrics. Keep final_summary.json when you want the complete run artifact or an input for traceml compare.

Compare Runs and Catch Regressions

Compare a slow run against a known-good baseline:

traceml compare input_slow/final_summary.json input_fixed/final_summary.json

+--------------------------------------------------------------------------------------+
|  TraceML Compare                                                                     |
+--------------------------------------------------------------------------------------+
|  Verdict: IMPROVEMENT                                                                |
|  Why: Step time decreased by 95.6%.                                                  |
|                                                                                      |
|  Metric                         A                B                Delta              |
|  Total step                     294.0 ms         13.0 ms          -280.9 ms (-95.6%) |
|  Input                          66.4 ms          2.7 ms           -63.7 ms (-95.9%)  |
+--------------------------------------------------------------------------------------+

See Compare Runs for the full report format.

Use With Your Stack

TraceML supports:

Custom PyTorch training loops
Hugging Face Trainer
PyTorch Lightning
Ray Train
W&B and MLflow summary logging
DDP and FSDP
Slurm and multi-node summary reports

See Use With Your Stack for integration examples.

Where TraceML Fits

Tool	Use it for	Not for
TraceML	Full-run bottlenecks, runtime diagnostics, rank skew, and run regressions	Kernel- or operator-level timelines
`torch.profiler` / Kineto	Operator and CUDA traces for selected steps	Always-on full-run summaries
Nsight Systems	Deep GPU and kernel timeline debugging	Everyday training triage
Holistic Trace Analysis	Analyzing collected profiler traces	Live or full-run collection
W&B / MLflow	Experiment tracking, metrics, and run history	Runtime bottleneck diagnosis

Start with TraceML to identify the bottleneck category. Open a deeper profiler when you need operator- or kernel-level detail.

Current Support

Works today:

Single-GPU training
Single-node multi-GPU DDP and FSDP
Multi-node DDP summary reports
Multi-node runs on Slurm
Run-to-run comparison from final_summary.json
Custom PyTorch loops, Hugging Face, PyTorch Lightning, and Ray Train

On the roadmap:

Multi-node live CLI and browser dashboard
Explicit collective and NCCL timing

Troubleshooting Guides

Feedback

For bugs, unexpected results, or feature requests, open a GitHub issue using the matching issue template.

The templates ask for the information needed to reproduce training-environment problems, including hardware, topology, launch command, TraceML version, PyTorch and CUDA versions, and redacted summary output.

If TraceML helped you find a real bottleneck, use the I found a bottleneck issue template. These reports help other training teams recognize similar problems.

Contributing

Contributions are welcome, especially:

Real slowdown examples and reproductions
Distributed training edge cases
Documentation improvements
Framework integrations

See CONTRIBUTING.md for development setup and contribution guidelines.

License

Apache 2.0. See LICENSE.

TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.4

Jul 14, 2026

0.3.3

Jul 3, 2026

0.3.2

Jun 30, 2026

0.3.1

Jun 2, 2026

0.3.0

May 26, 2026

0.2.15

May 19, 2026

0.2.14

May 7, 2026

0.2.13

Apr 30, 2026

0.2.12

Apr 27, 2026

0.2.11

Apr 23, 2026

0.2.10

Apr 22, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 13, 2026

0.2.7

Apr 7, 2026

0.2.6

Apr 4, 2026

0.2.5

Mar 20, 2026

0.2.4

Mar 15, 2026

0.2.3

Mar 7, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 9, 2026

0.2.0a0 pre-release

Jan 27, 2026

0.1.9

Jan 3, 2026

0.1.8

Dec 25, 2025

0.1.6

Dec 11, 2025

0.1.5

Dec 10, 2025

0.1.3

Oct 8, 2025

0.1.1

Oct 2, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.3.4.tar.gz (392.9 kB view details)

Uploaded Jul 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceml_ai-0.3.4-py3-none-any.whl (521.3 kB view details)

Uploaded Jul 14, 2026 Python 3

File details

Details for the file traceml_ai-0.3.4.tar.gz.

File metadata

Download URL: traceml_ai-0.3.4.tar.gz
Upload date: Jul 14, 2026
Size: 392.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`ac85f9df9c37309319040ffee4632b6ea8b27af00f0ebfe040eb5765f6ddaccf`
MD5	`a581654df2f4c5f11de9b57e8252135e`
BLAKE2b-256	`1303fb71f6d3a452a6885c67157f9c79f83a3790d9c840dce7a01e24c3c35069`

See more details on using hashes here.

File details

Details for the file traceml_ai-0.3.4-py3-none-any.whl.

File metadata

Download URL: traceml_ai-0.3.4-py3-none-any.whl
Upload date: Jul 14, 2026
Size: 521.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58f5b3d8f2bf0293545ff57d150c8c66627e633b64d050b704071edc51ef9042`
MD5	`547612e52b1714655bc28ff49809b067`
BLAKE2b-256	`98853f4b9a5796ebdac911f0a851a848f9e6282b93ffebdd4490076469d19a73`

See more details on using hashes here.

traceml-ai 0.3.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

TraceML

Quickstart

1. Install TraceML

2. Instrument the training step

3. Run your training

Example Diagnosis

What TraceML Helps You Triage

Display Modes

Saved Run Artifacts

Compare Runs and Catch Regressions

Use With Your Stack

Where TraceML Fits

Current Support

Troubleshooting Guides

Feedback

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes