TraceML: Lightweight training runtime health monitor.

These details have not been verified by PyPI

Project description

TraceML

Catch wasted GPU time during live PyTorch training

TraceML is a lightweight bottleneck finder for PyTorch training. It helps you catch input stalls, DDP rank imbalance, unstable step times, and memory drift while the run is still in progress.

Works today: Single GPU, single-node DDP, Hugging Face Trainer, PyTorch Lightning

Not yet: Multi-node DDP, FSDP / TP / PP

Why TraceML

When training feels slow, a wall-clock timer tells you that it is slow. TraceML helps show where the time is going and what looks wrong while the job is still running.

Use it to answer:

Is the input pipeline starving the GPU?
Are step times drifting or jittering?
Is one DDP rank lagging behind the others?
Is memory creeping up over time?
How much time is going into forward, backward, optimizer, and overhead?

TraceML is designed for real runs, not only postmortem profiling.

What TraceML gives you

Live during training

step-time breakdown
dataloader / input wait visibility
forward / backward / optimizer / overhead timing
step jitter and drift
GPU memory trend
CPU / RAM / GPU signals

At the end of the run

a compact summary you can review quickly
something easy to paste into an issue or share with a teammate
a clearer starting point before using heavier profilers

Quick Start

Install:

pip install traceml-ai

Wrap your training step:

from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Run your script through TraceML:

traceml run train.py

During training, TraceML opens a live terminal view alongside your logs.

TraceML terminal dashboard

At run end, it prints a compact summary.

TraceML summary

If you want a richer view, TraceML also includes a local UI for reviewing runs and comparing them locally.

TraceML local UI

See docs/quickstart.md for more setup details.

Why not just use timers?

Simple timers are useful, but they usually do not show:

which part of the training step is growing
whether the slowdown is coming from input, compute, optimizer, or overhead
whether one DDP rank is slower than the others
whether memory is drifting over time
what the run looked like before it fully finished

TraceML is built to make those patterns visible with minimal code changes.

Works with your training stack

Plain PyTorch

Use trace_step(model) around your training step.

Hugging Face Trainer

Replace Trainer with TraceMLTrainer:

from traceml.hf_decorators import TraceMLTrainer

trainer = TraceMLTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    traceml_enabled=True,
)

See docs/huggingface.md.

PyTorch Lightning

Add TraceMLCallback() to your trainer:

import lightning as L
from traceml.utils.lightning import TraceMLCallback

trainer = L.Trainer(callbacks=[TraceMLCallback()])

See the Lightning docs for the full setup.

What TraceML surfaces

Step-level breakdown

TraceML tracks:

dataloader -> forward -> backward -> optimizer -> overhead
step time
GPU memory (allocated + peak)
CPU / RAM / GPU signals

DDP imbalance

In single-node DDP, TraceML surfaces:

median rank
worst rank
skew (%)

This makes stragglers easier to spot without extra instrumentation.

Optional model-level hooks

If you want extra model-level context, enable lightweight hooks:

from traceml.decorators import trace_model_instance

trace_model_instance(model)

Use this together with trace_step(model) to add optional per-layer timing and memory signals. The core step-level view works without it.

Scope

TraceML focuses on lightweight diagnosis during real PyTorch training runs.

It is not:

a kernel-level tracer
an auto-tuner
a replacement for deep profiling tools
a full observability platform

Safe to try on real runs

TraceML is built for practical training workflows:

lightweight enough to use during real runs
compact terminal output during training
end-of-run summary for quick review and sharing
fail-open behavior so instrumentation does not become the center of your training script

Start with examples

If you want to see what TraceML is good at, start with example cases such as:

input / dataloader stall
DDP straggler / rank skew
memory drift over time

See the examples folder for runnable cases and expected output.

Feedback

If TraceML caught a slowdown for you, please open an issue and include:

hardware / CUDA / PyTorch versions
single GPU or DDP
whether you used core step tracing only or model hooks
the TraceML end-of-run summary
a minimal repro if possible

Useful bug reports, slowdown cases, and integration feedback are especially valuable right now.

📧 Email: abhinav@traceopt.ai
📋 User Survey: https://forms.gle/KwPSLaPmJnJjoVXSA

Contributing

Contributions are welcome.

Examples, reproducible slowdown cases, integration feedback, and bug reports are especially helpful.

License

TraceML is released under the Apache 2.0. See LICENSE for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

May 26, 2026

0.2.15

May 19, 2026

0.2.14

May 7, 2026

0.2.13

Apr 30, 2026

0.2.12

Apr 27, 2026

0.2.11

Apr 23, 2026

0.2.10

Apr 22, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 13, 2026

0.2.7

Apr 7, 2026

0.2.6

Apr 4, 2026

0.2.5

Mar 20, 2026

This version

0.2.4

Mar 15, 2026

0.2.3

Mar 7, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 9, 2026

0.2.0a0 pre-release

Jan 27, 2026

0.1.9

Jan 3, 2026

0.1.8

Dec 25, 2025

0.1.6

Dec 11, 2025

0.1.5

Dec 10, 2025

0.1.3

Oct 8, 2025

0.1.1

Oct 2, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.4.tar.gz (166.5 kB view details)

Uploaded Mar 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceml_ai-0.2.4-py3-none-any.whl (227.5 kB view details)

Uploaded Mar 15, 2026 Python 3

File details

Details for the file traceml_ai-0.2.4.tar.gz.

File metadata

Download URL: traceml_ai-0.2.4.tar.gz
Upload date: Mar 15, 2026
Size: 166.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`18f72e0d2af666c913b2a160366ccea00e69599cb94930f7cd6b4f90f85eaf59`
MD5	`a4ce5a19bb86751fab1dc8d755c6d4a1`
BLAKE2b-256	`50709453e114d141018453fe004b29bd3b8759ca5ba5a5030d09a2cf09c5efcc`

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.4-py3-none-any.whl.

File metadata

Download URL: traceml_ai-0.2.4-py3-none-any.whl
Upload date: Mar 15, 2026
Size: 227.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d5c4e1ad1d7b7cf7f91f81963111a2187ad2e1aab898e2262f78c88cde66559`
MD5	`2112d1c8c8286e229e9b75eadfaf2a57`
BLAKE2b-256	`ae38be51e53ac1f4faee4af1e2a68eae2689cc720c465683a8f5cb6d812e64a8`

See more details on using hashes here.

traceml-ai 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TraceML

Why TraceML

What TraceML gives you

Live during training

At the end of the run

Quick Start

Why not just use timers?

Works with your training stack

Plain PyTorch

Hugging Face Trainer

PyTorch Lightning

What TraceML surfaces

Step-level breakdown

DDP imbalance

Optional model-level hooks

Scope

Safe to try on real runs

Start with examples

Feedback

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes