TraceML: Lightweight training runtime health monitor.
Project description
TraceML
Catch wasted GPU time during live PyTorch training
TraceML is a lightweight bottleneck finder for PyTorch training. It helps you catch input stalls, DDP rank imbalance, unstable step times, and memory drift while the run is still in progress.
Works today: Single GPU, single-node DDP, Hugging Face Trainer, PyTorch Lightning
Not yet: Multi-node DDP, FSDP / TP / PP
Why TraceML
When training feels slow, a wall-clock timer tells you that it is slow. TraceML helps show where the time is going and what looks wrong while the job is still running.
Use it to answer:
- Is the input pipeline starving the GPU?
- Are step times drifting or jittering?
- Is one DDP rank lagging behind the others?
- Is memory creeping up over time?
- How much time is going into forward, backward, optimizer, and overhead?
TraceML is designed for real runs, not only postmortem profiling.
What TraceML gives you
Live during training
- step-time breakdown
- dataloader / input wait visibility
- forward / backward / optimizer / overhead timing
- step jitter and drift
- GPU memory trend
- CPU / RAM / GPU signals
At the end of the run
- a compact summary you can review quickly
- something easy to paste into an issue or share with a teammate
- a clearer starting point before using heavier profilers
Quick Start
Install:
pip install traceml-ai
Wrap your training step:
from traceml.decorators import trace_step
for batch in dataloader:
with trace_step(model):
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Run your script through TraceML:
traceml run train.py
During training, TraceML opens a live terminal view alongside your logs.
At run end, it prints a compact summary.
If you want a richer view, TraceML also includes a local UI for reviewing runs and comparing them locally.
See docs/quickstart.md for more setup details.
Why not just use timers?
Simple timers are useful, but they usually do not show:
- which part of the training step is growing
- whether the slowdown is coming from input, compute, optimizer, or overhead
- whether one DDP rank is slower than the others
- whether memory is drifting over time
- what the run looked like before it fully finished
TraceML is built to make those patterns visible with minimal code changes.
Works with your training stack
Plain PyTorch
Use trace_step(model) around your training step.
Hugging Face Trainer
Replace Trainer with TraceMLTrainer:
from traceml.hf_decorators import TraceMLTrainer
trainer = TraceMLTrainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=eval_ds,
traceml_enabled=True,
)
See docs/huggingface.md.
PyTorch Lightning
Add TraceMLCallback() to your trainer:
import lightning as L
from traceml.utils.lightning import TraceMLCallback
trainer = L.Trainer(callbacks=[TraceMLCallback()])
See the Lightning docs for the full setup.
What TraceML surfaces
Step-level breakdown
TraceML tracks:
dataloader -> forward -> backward -> optimizer -> overhead- step time
- GPU memory (allocated + peak)
- CPU / RAM / GPU signals
DDP imbalance
In single-node DDP, TraceML surfaces:
- median rank
- worst rank
- skew (%)
This makes stragglers easier to spot without extra instrumentation.
Optional model-level hooks
If you want extra model-level context, enable lightweight hooks:
from traceml.decorators import trace_model_instance
trace_model_instance(model)
Use this together with trace_step(model) to add optional per-layer timing and memory signals.
The core step-level view works without it.
Scope
TraceML focuses on lightweight diagnosis during real PyTorch training runs.
It is not:
- a kernel-level tracer
- an auto-tuner
- a replacement for deep profiling tools
- a full observability platform
Safe to try on real runs
TraceML is built for practical training workflows:
- lightweight enough to use during real runs
- compact terminal output during training
- end-of-run summary for quick review and sharing
- fail-open behavior so instrumentation does not become the center of your training script
Start with examples
If you want to see what TraceML is good at, start with example cases such as:
- input / dataloader stall
- DDP straggler / rank skew
- memory drift over time
See the examples folder for runnable cases and expected output.
Feedback
If TraceML caught a slowdown for you, please open an issue and include:
- hardware / CUDA / PyTorch versions
- single GPU or DDP
- whether you used core step tracing only or model hooks
- the TraceML end-of-run summary
- a minimal repro if possible
Useful bug reports, slowdown cases, and integration feedback are especially valuable right now.
- 📧 Email: abhinav@traceopt.ai
- 📋 User Survey: https://forms.gle/KwPSLaPmJnJjoVXSA
Contributing
Contributions are welcome.
Examples, reproducible slowdown cases, integration feedback, and bug reports are especially helpful.
License
TraceML is released under the Apache 2.0. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceml_ai-0.2.4.tar.gz.
File metadata
- Download URL: traceml_ai-0.2.4.tar.gz
- Upload date:
- Size: 166.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18f72e0d2af666c913b2a160366ccea00e69599cb94930f7cd6b4f90f85eaf59
|
|
| MD5 |
a4ce5a19bb86751fab1dc8d755c6d4a1
|
|
| BLAKE2b-256 |
50709453e114d141018453fe004b29bd3b8759ca5ba5a5030d09a2cf09c5efcc
|
File details
Details for the file traceml_ai-0.2.4-py3-none-any.whl.
File metadata
- Download URL: traceml_ai-0.2.4-py3-none-any.whl
- Upload date:
- Size: 227.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d5c4e1ad1d7b7cf7f91f81963111a2187ad2e1aab898e2262f78c88cde66559
|
|
| MD5 |
2112d1c8c8286e229e9b75eadfaf2a57
|
|
| BLAKE2b-256 |
ae38be51e53ac1f4faee4af1e2a68eae2689cc720c465683a8f5cb6d812e64a8
|