TraceML: Lightweight training runtime health monitor.

These details have not been verified by PyPI

Project description

TraceML

Find why training is slow, while it is still running.

Quickstart • Examples • Contributing

TraceML is a lightweight bottleneck finder for PyTorch training. It helps you catch:

input stalls
unstable or drifting step times
DDP rank stragglers
memory creep over time

without jumping straight to heavyweight profiling.

The gap it fills: system dashboards show utilization over time. TraceML shows what happens during training steps and, in distributed settings, which rank is slowing the run down.

Works today: Single GPU, Single-node DDP/FSDP

Not yet: Multi-node, TP, PP

With minimal setup observe system and process behaviour during training

pip install traceml-ai
traceml watch train.py

When to use TraceML

Use it when training feels:

slower than expected
jittery from step to step
imbalanced across distributed ranks
stable in dashboards but still underperforming

Start with TraceML when you need a fast answer in the terminal. Reach for torch.profiler once you know where to dig.

Quick start

Zero-code first look

traceml watch train.py

Use watch for a zero-code live view of system and process behavior while training is running.

Step-aware bottleneck diagnosis

Wrap your training step to see where time goes:

from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Run through TraceML:

traceml run train.py

During training, TraceML opens a live CLI view alongside your logs.

TraceML terminal dashboard

At the end of the run, it prints a compact summary.

TraceML summary

TraceML also includes a local UI. See docs/quickstart.md for setup details.

Run modes

`traceml watch train.py`

Zero-code live visibility for system and process behavior.

`traceml run train.py`

Default mode for live bottleneck diagnosis.

`traceml deep train.py`

Adds per-layer timing and memory signals for deeper inspection (experimental).

Start with watch for fast visibility. Use run when you need step-aware diagnosis. Use deep only when you need layer-level root cause.

What TraceML shows

CPU / RAM / GPU signals
step time and its breakdown
dataloader / input wait
forward / backward / optimizer / overhead timing
step jitter and drift
GPU memory trend
in distributed settings: worst-rank vs median-rank timing and skew

This helps you tell whether the slowdown is coming from input, compute, optimizer work, or rank imbalance.

Supported stacks

Standard PyTorch loop

Use trace_step(model) around your training step.

Hugging Face Trainer

from traceml.integrations.huggingface import TraceMLTrainer

trainer = TraceMLTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    traceml_enabled=True,
)

See docs/huggingface.md for the full setup.

PyTorch Lightning

import lightning as L
from traceml.integrations.lightning import TraceMLCallback

trainer = L.Trainer(callbacks=[TraceMLCallback()])

See docs/lightning.md for the full setup.

Optional model hooks (experimental)

from traceml.decorators import trace_model_instance

trace_model_instance(model)

Use this with trace_step(model) when you want optional per-layer timing and memory signals. The core step-level view works without it.

This is experimental and may not work with torch.compile, especially with full-graph compilation. The core step-level view works without model hooks.

Scope

TraceML is for lightweight diagnosis during real PyTorch training runs.

It is not:

a kernel-level tracer
an auto-tuner
a replacement for deep profilers
a full observability platform

Example cases

Start with examples such as:

basic example
input / dataloader stall
DDP straggler / rank skew

See Examples for runnable cases.

Feedback

If TraceML caught a slowdown for you, please open an issue and include:

hardware / CUDA / PyTorch versions
single or multi GPU
whether you used watch, run, or deep
whether you used core tracing only or model hooks
the end-of-run summary
a minimal repro if possible

📧 Email: support@traceopt.ai

📋 User Survey: https://forms.gle/KwPSLaPmJnJjoVXSA

Contributing

Contributions are welcome, especially:

reproducible slowdown cases
integrations
bug reports
examples

License

Apache 2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

May 26, 2026

0.2.15

May 19, 2026

0.2.14

May 7, 2026

0.2.13

Apr 30, 2026

0.2.12

Apr 27, 2026

0.2.11

Apr 23, 2026

0.2.10

Apr 22, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 13, 2026

0.2.7

Apr 7, 2026

This version

0.2.6

Apr 4, 2026

0.2.5

Mar 20, 2026

0.2.4

Mar 15, 2026

0.2.3

Mar 7, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 9, 2026

0.2.0a0 pre-release

Jan 27, 2026

0.1.9

Jan 3, 2026

0.1.8

Dec 25, 2025

0.1.6

Dec 11, 2025

0.1.5

Dec 10, 2025

0.1.3

Oct 8, 2025

0.1.1

Oct 2, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.6.tar.gz (212.0 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceml_ai-0.2.6-py3-none-any.whl (297.5 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file traceml_ai-0.2.6.tar.gz.

File metadata

Download URL: traceml_ai-0.2.6.tar.gz
Upload date: Apr 4, 2026
Size: 212.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.6.tar.gz
Algorithm	Hash digest
SHA256	`6c930cf5d4551fb9c1332a600f64fba07f8c7b00c1e827d1ff0274d405c74cab`
MD5	`43bc421bdb08e29f41e7a3fff018d53a`
BLAKE2b-256	`abf51f20c862f759b932eb99aec350aa1000d470fb6010def20128e00452555d`

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.6-py3-none-any.whl.

File metadata

Download URL: traceml_ai-0.2.6-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 297.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bd5b2cff8ef51c128828780627448733e8867f43c990121cf8fd216830523359`
MD5	`c0bcb597295c3db081b18af6429bcfaf`
BLAKE2b-256	`8b15ecffe7997d47eb11ce7155b3d7223d4f6b2ea3b04dac05baa6f88fe114bb`

See more details on using hashes here.

traceml-ai 0.2.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TraceML

When to use TraceML

Quick start

Zero-code first look

Step-aware bottleneck diagnosis

Run modes

traceml watch train.py

traceml run train.py

traceml deep train.py

What TraceML shows

Supported stacks

Standard PyTorch loop

Hugging Face Trainer

PyTorch Lightning

Optional model hooks (experimental)

Scope

Example cases

Feedback

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`traceml watch train.py`

`traceml run train.py`

`traceml deep train.py`