TraceML: Lightweight runtime bottleneck diagnostics for PyTorch training.

These details have not been verified by PyPI

Project description

TraceML

Runtime bottleneck detection for PyTorch training jobs.

Quickstart • Compare Runs • How to Read Output • Use With Your Stack • FAQ • Security • Issues • Discussions

TraceML gives every PyTorch training run a structured performance fingerprint with low overhead (<2% in our current benchmark runs). It answers the questions that usually come before heavyweight operator-level profiling:

Are my GPUs waiting on a slow dataloader (input-bound)?
Is one distributed rank consistently slower than the others (straggler)?
Is memory usage silently creeping upward during the run (memory creep)?
Did a recent code or infrastructure change slow training down (regression)?

Where TraceML Fits in the Stack

TraceML does not replace torch.profiler. It is the low-overhead, always-on first pass that tells you where to aim heavier profiling tools.

Tool	Best used for	Output	Cost / overhead
TraceML	Classifying high-level bottlenecks: input, compute, wait, memory, rank skew	JSON fingerprint, text summary, live views	<2% in current benchmark runs; small code wrapper
`torch.profiler`	Inspecting expensive ops, kernels, and CUDA activity	Profiler trace	Higher overhead; requires profiler context
Nsight Systems	Debugging low-level CUDA and kernel behavior	GPU timeline	Separate profiler run
W&B / MLflow	Tracking training metrics and experiment history	Metrics dashboard / run history	Logging integration
`nvidia-smi`	Checking machine-level GPU health and utilization	Terminal metrics	No code changes

3-Minute Quickstart

1. Install the package

pip install traceml-ai

2. Wrap your training step

import traceml_ai as traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        optimizer.zero_grad(set_to_none=True)
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()

3. Run your script

traceml run train.py

For DDP, FSDP, and multi-node runs, see Distributed Training.

What You Get: The Output

TraceML writes two end-of-run artifacts:

logs/<run_name>/final_summary.json
logs/<run_name>/final_summary.txt

Instead of guessing why training feels slow, you get a compact diagnosis of where step time and memory went:

+----------------------------------------------------------------------------+
|  Step Time                                                                 |
|  - Diagnosis: INPUT STRAGGLER                                              |
|  - Scope: compared over last 460 aligned steps across 4 global ranks       |
|  - Stats: total 303.7ms | input 254.5ms | compute 259.5ms | wait 40.5ms    |
|  - Why: r0 input was slower than median global rank (254.5/3.8ms).         |
+----------------------------------------------------------------------------+

In this example, rank 0 is the slow input rank, which can hold back the aligned distributed step.

For experiment trackers, call traceml.summary() near the end of your script to get a flat dict of diagnosis statuses and average metrics. Keep final_summary.json when you want the full run artifact or an input for traceml compare.

Catching Regressions (Compare Mode)

Compare a slow run against a known good baseline to identify which metrics changed:

traceml compare input_slow/final_summary.json input_fixed/final_summary.json

+--------------------------------------------------------------------------------------+
|  TraceML Compare                                                                     |
+--------------------------------------------------------------------------------------+
|  Verdict: IMPROVEMENT                                                                |
|  Why: Step time decreased by 95.6%.                                                  |
|                                                                                      |
|  Metric                         A                B                Delta              |
|  Total step                     294.0 ms         13.0 ms          -280.9 ms (-95.6%) |
|  Input                          66.4 ms          2.7 ms           -63.7 ms (-95.9%)  |
+--------------------------------------------------------------------------------------+

See Compare Runs for the full report format.

Display Modes

TraceML controls what you see during training with the --mode flag, without changing the final saved artifacts.

Mode flag	Experience during training	Supported topology
`--mode=summary` (default)	Silent execution	Single-node and multi-node multi-GPU
`--mode=cli`	Live terminal display	Single-node, including multi-GPU
`--mode=dashboard`	Live browser display	Single-node; requires `pip install "traceml-ai[dashboard]"`

Current support

Works today:

Single GPU training
Single-node multi-GPU DDP / FSDP
Multi-node DDP summary reports
Run-to-run comparison from final_summary.json
Custom PyTorch loops, Hugging Face, PyTorch Lightning, and Ray Train

On the roadmap:

Slurm launch examples
Multi-node live CLI / browser dashboard
Explicit collective / NCCL timing

Overhead

Overhead: In our benchmark runs, TraceML adds <2% overhead on single GPU and <1% on single-node multi-GPU at default settings.

Learn More

Feedback

For bugs, unexpected results, or feature requests, open a GitHub issue and use the matching issue template. The templates ask for the details we need to reproduce training-environment problems, including hardware, topology, launch command, TraceML version, PyTorch/CUDA versions, and redacted summary output.

GitHub issues: open an issue

If TraceML helped you find a real bottleneck, use the "I found a bottleneck" issue template. These reports help other training teams recognize similar problems.

Security reports: see SECURITY.md

Email: support@traceopt.ai

Contributing

Contributions are welcome, especially:

real slowdown examples and repros
distributed training edge cases
docs improvements
framework integrations

See CONTRIBUTING.md for development setup and contribution guidelines.

License

Apache 2.0. See LICENSE.

TraceOpt is a trademark of OptAI UG (haftungsbeschränkt).

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.1

Jun 2, 2026

0.3.0

May 26, 2026

0.2.15

May 19, 2026

0.2.14

May 7, 2026

0.2.13

Apr 30, 2026

0.2.12

Apr 27, 2026

0.2.11

Apr 23, 2026

0.2.10

Apr 22, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 13, 2026

0.2.7

Apr 7, 2026

0.2.6

Apr 4, 2026

0.2.5

Mar 20, 2026

0.2.4

Mar 15, 2026

0.2.3

Mar 7, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 9, 2026

0.2.0a0 pre-release

Jan 27, 2026

0.1.9

Jan 3, 2026

0.1.8

Dec 25, 2025

0.1.6

Dec 11, 2025

0.1.5

Dec 10, 2025

0.1.3

Oct 8, 2025

0.1.1

Oct 2, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.3.1.tar.gz (318.5 kB view details)

Uploaded Jun 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceml_ai-0.3.1-py3-none-any.whl (465.1 kB view details)

Uploaded Jun 2, 2026 Python 3

File details

Details for the file traceml_ai-0.3.1.tar.gz.

File metadata

Download URL: traceml_ai-0.3.1.tar.gz
Upload date: Jun 2, 2026
Size: 318.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`27e069ddb1c20fb0635029f2d935e6cfbbd502ab7552ec4d76f7d9d9eeff5994`
MD5	`9ee480f8309a35d1e2435b70991673d2`
BLAKE2b-256	`245bd868aafe1a6c9723c65f3fdf3eaaff646baabf31bde4f0724c91040a0c66`

See more details on using hashes here.

File details

Details for the file traceml_ai-0.3.1-py3-none-any.whl.

File metadata

Download URL: traceml_ai-0.3.1-py3-none-any.whl
Upload date: Jun 2, 2026
Size: 465.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for traceml_ai-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`90d457b89ddaa2d05a5d256d92e11bc24a48a079cbd723d2e6c7b0ea47980a45`
MD5	`29fdca6d7b980c6a172cb83cb5d4df86`
BLAKE2b-256	`d11d8cd37b6558b4757e60bbc1bf6655f90b310b597fa09a9ad1811927b9dac7`

See more details on using hashes here.

traceml-ai 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TraceML

Where TraceML Fits in the Stack

3-Minute Quickstart

1. Install the package

2. Wrap your training step

3. Run your script

What You Get: The Output

Catching Regressions (Compare Mode)

Display Modes

Current support

Overhead

Learn More

Feedback

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes