TraceML: Lightweight training runtime health monitor.
Project description
TraceML is a lightweight bottleneck finder for PyTorch training. It helps you catch:
- input stalls
- unstable or drifting step times
- DDP rank stragglers
- memory creep over time
without jumping straight to heavyweight profiling.
The gap it fills: system dashboards show utilization over time. TraceML shows what happens during training steps and, in distributed settings, which rank is slowing the run down.
Works today: Single GPU, Single-node DDP/FSDP
Not yet: Multi-node, TP, PP
With minimal setup observe system and process behaviour during training
pip install traceml-ai
traceml watch train.py
When to use TraceML
Use it when training feels:
- slower than expected
- jittery from step to step
- imbalanced across distributed ranks
- stable in dashboards but still underperforming
Start with TraceML when you need a fast answer in the terminal. Reach for torch.profiler once you know where to dig.
Quick start
Zero-code first look
traceml watch train.py
Use watch for a zero-code live view of system and process behavior while training is running.
Step-aware bottleneck diagnosis
Wrap your training step to see where time goes:
from traceml.decorators import trace_step
for batch in dataloader:
with trace_step(model):
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Run through TraceML:
traceml run train.py
During training, TraceML opens a live CLI view alongside your logs.
At the end of the run, it prints a compact summary.
TraceML also includes a local UI. See docs/quickstart.md for setup details.
Run modes
traceml watch train.py
Zero-code live visibility for system and process behavior.
traceml run train.py
Default mode for live bottleneck diagnosis.
traceml deep train.py
Adds per-layer timing and memory signals for deeper inspection (experimental).
Start with watch for fast visibility. Use run when you need step-aware diagnosis. Use deep only when you need layer-level root cause.
What TraceML shows
- CPU / RAM / GPU signals
- step time and its breakdown
- dataloader / input wait
- forward / backward / optimizer / overhead timing
- step jitter and drift
- GPU memory trend
- in distributed settings: worst-rank vs median-rank timing and skew
This helps you tell whether the slowdown is coming from input, compute, optimizer work, or rank imbalance.
Supported stacks
Standard PyTorch loop
Use trace_step(model) around your training step.
Hugging Face Trainer
from traceml.integrations.huggingface import TraceMLTrainer
trainer = TraceMLTrainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=eval_ds,
traceml_enabled=True,
)
See docs/huggingface.md for the full setup.
PyTorch Lightning
import lightning as L
from traceml.integrations.lightning import TraceMLCallback
trainer = L.Trainer(callbacks=[TraceMLCallback()])
See docs/lightning.md for the full setup.
Optional model hooks (experimental)
from traceml.decorators import trace_model_instance
trace_model_instance(model)
Use this with trace_step(model) when you want optional per-layer timing and memory signals. The core step-level view works without it.
This is experimental and may not work with torch.compile, especially with full-graph compilation. The core step-level view works without model hooks.
Scope
TraceML is for lightweight diagnosis during real PyTorch training runs.
It is not:
- a kernel-level tracer
- an auto-tuner
- a replacement for deep profilers
- a full observability platform
Example cases
Start with examples such as:
- basic example
- input / dataloader stall
- DDP straggler / rank skew
See Examples for runnable cases.
Feedback
If TraceML caught a slowdown for you, please open an issue and include:
- hardware / CUDA / PyTorch versions
- single or multi GPU
- whether you used
watch,run, ordeep - whether you used core tracing only or model hooks
- the end-of-run summary
- a minimal repro if possible
📧 Email: support@traceopt.ai
📋 User Survey: https://forms.gle/KwPSLaPmJnJjoVXSA
Contributing
Contributions are welcome, especially:
- reproducible slowdown cases
- integrations
- bug reports
- examples
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceml_ai-0.2.6.tar.gz.
File metadata
- Download URL: traceml_ai-0.2.6.tar.gz
- Upload date:
- Size: 212.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c930cf5d4551fb9c1332a600f64fba07f8c7b00c1e827d1ff0274d405c74cab
|
|
| MD5 |
43bc421bdb08e29f41e7a3fff018d53a
|
|
| BLAKE2b-256 |
abf51f20c862f759b932eb99aec350aa1000d470fb6010def20128e00452555d
|
File details
Details for the file traceml_ai-0.2.6-py3-none-any.whl.
File metadata
- Download URL: traceml_ai-0.2.6-py3-none-any.whl
- Upload date:
- Size: 297.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd5b2cff8ef51c128828780627448733e8867f43c990121cf8fd216830523359
|
|
| MD5 |
c0bcb597295c3db081b18af6429bcfaf
|
|
| BLAKE2b-256 |
8b15ecffe7997d47eb11ce7155b3d7223d4f6b2ea3b04dac05baa6f88fe114bb
|