TraceML: Lightweight training runtime health monitor.
Project description
TraceML
Find why PyTorch training is slow while the job is still running.
Quickstart • How to Read Output • FAQ • Use with W&B / MLflow • Issues
TraceML helps you find training bottlenecks in PyTorch while the job is still running. It helps you catch:
- input bottlenecks
- compute-bound steps
- DDP stragglers
- wait-heavy training
- memory creep over time
without jumping straight to a heavyweight profiler.
Why this exists: dashboards show utilization and curves. TraceML shows why throughput is poor inside the training step.
The fastest way to try it
Install:
pip install traceml-ai
Wrap your training step:
import traceml
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()
Run:
traceml run train.py
During training, TraceML opens a live terminal view alongside your logs.
At the end of the run, it prints a compact summary you can review or share.
If you want a low-noise run and a structured summary you can log into W&B or
MLflow, launch in summary mode and call traceml.final_summary() near the end
of your script:
traceml run train.py --mode=summary
For full setup details, see docs/quickstart.md.
Not sure how to interpret the output? Read How to Read TraceML Output.
What TraceML tells you
TraceML helps answer questions like:
- Is training input-bound or compute-bound?
- Is one DDP rank slower than the others?
- Is the job wait-heavy because of uneven progress?
- Is memory drifting upward over time?
- Is the slowdown coming from dataloader, forward, backward, or optimizer work?
When to use TraceML
Use TraceML when training feels:
- slower than expected
- unstable from step to step
- imbalanced across distributed ranks
- fine in dashboards but still underperforming
Start with TraceML when you need a fast answer in the terminal.
Reach for torch.profiler once you know where to dig deeper.
How it fits with your stack
TraceML is designed to work alongside tools like W&B, MLflow, and TensorBoard.
Use those for:
- experiment tracking
- artifacts
- dashboards
- team reporting
Use TraceML for:
- bottleneck diagnosis
- rank imbalance / straggler detection
- memory trend debugging
- structured final summaries you can forward into W&B or MLflow
See Use TraceML with W&B / MLflow.
Current support
Works today:
- single GPU
- single-node DDP/FSDP
Not yet:
- multi-node
- tensor parallel
- pipeline parallel
Next steps
- Quickstart
- Examples
- How to Read TraceML Output
- FAQ
- Use TraceML with W&B / MLflow
- Hugging Face integration:
docs/huggingface.md - PyTorch Lightning integration:
docs/lightning.md
Feedback
If TraceML helped you find a slowdown, please open an issue and include:
- hardware / CUDA / PyTorch versions
- single GPU or multi-GPU
- whether you used
run,watch, ordeep - the end-of-run summary
- a minimal repro if possible
GitHub issues: https://github.com/traceopt-ai/traceml/issues
Email: support@traceopt.ai
Contributing
Contributions are welcome, especially:
- reproducible slowdown cases
- bug reports
- docs improvements
- integrations
- examples
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceml_ai-0.2.8.tar.gz.
File metadata
- Download URL: traceml_ai-0.2.8.tar.gz
- Upload date:
- Size: 217.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6b001a78e168877491ef46959cbb4cbce6aa9b6c6e738dc6d01815ee4ac32e6
|
|
| MD5 |
8191de1ed3721b6d47a931d2e5b7a5af
|
|
| BLAKE2b-256 |
fdbd159fe7632b8abde0eba002af17415b057c7dbcb85b809fb26073be9e96f6
|
File details
Details for the file traceml_ai-0.2.8-py3-none-any.whl.
File metadata
- Download URL: traceml_ai-0.2.8-py3-none-any.whl
- Upload date:
- Size: 299.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ae58ba2afe11729368ad8c10b2dbcbbe98ca67bd88353edfd2c2fc329443e27
|
|
| MD5 |
91910bf6c6a59bdf06b129b4df1dc090
|
|
| BLAKE2b-256 |
79284274afa758498aff6f87e6558c965196a47f334be7a3490ba27aa6048610
|