Skip to main content

TraceML: Lightweight training runtime health monitor.

Project description

TraceML

Know what’s slowing your (PyTorch) training, while it runs

PyPI version Downloads GitHub stars Python 3.9-3.13 License

TraceML provides step-level training visibility for PyTorch workloads. It shows where time and memory go inside each training step so you can quickly understand performance behavior across single-GPU and single-node DDP runs.

Current support

  • ✅ Single GPU
  • ✅ Single-node multi-GPU (DDP)
  • ❌ Multi-node DDP (not yet)
  • ❌ FSDP / TP / PP (not yet)

What You See in Minutes

  • System signals (CPU, RAM, GPU)
  • Breakdown of each training step:
    • dataloader → forward → backward → optimizer → overhead
  • Median vs worst rank (in case of DDP)
  • Skew (%) to surface imbalance
  • GPU memory (allocated + peak)

Healthy runs are clearly stable. Unstable runs reveal drift, imbalance, or memory creep early.


Quick Start

Install:

pip install traceml-ai

Wrap your training step:

from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Run with cli:

traceml run train.py 

The terminal dashboard opens alongside your logs. TraceML terminal dashboard

Optional web UI:

traceml run train.py --mode=dashboard

TraceML web dashboard


What TraceML Surfaces

Step-Level Signals

  • Dataloader fetch time
  • Step time (low-overhead, GPU-aware)
  • Step GPU memory (allocated + peak)

Across ranks:

  • Median (typical behavior)
  • Worst rank (slowest / highest memory)
  • Skew (% difference)

This makes rank imbalance and straggler behavior immediately visible.


Deep-Dive Mode (Optional)

Enable model-level hooks for diagnostic context:

from traceml.decorators import trace_model_instance
trace_model_instance(model)

Use together with trace_step(model) to enable:

  • Per-layer memory signals
  • Per-layer forward/backward timing
  • Lightweight failure attribution (experimental)

If not enabled, ESSENTIAL signals remain unchanged.


What It Is Not

  • Not a replacement for PyTorch Profiler or Nsight
  • Not an auto-tuner
  • Not a kernel-level tracer

TraceML focuses on step-level visibility that is practical during real training runs.


Supported Environments

  • Python 3.9--3.13
  • PyTorch 1.12+
  • macOS (Intel/ARM), Linux
  • Single GPU
  • Single-node DDP

Known limitations: With gradient accumulation enabled, step-level metrics may be unreliable (micro-step vs optimizer-step). Fix in progress.


Hugging Face Integration

TraceML provides a seamless integration with Hugging Face transformers via TraceMLTrainer.

Usage

Replace transformers.Trainer with traceml.hf_decorators.TraceMLTrainer.

from traceml.hf_decorators import TraceMLTrainer

trainer = TraceMLTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    traceml_enabled=True,         
)

Roadmap

Near-term: - Single-node DDP hardening - Disk run logging - Compatibility validation (gradient accumulation, torch.compile) - Accelerate / Lightning wrappers

Next: - Multi-node DDP - Initial FSDP support

Later: - Tensor / Pipeline parallel awareness


Contributing

Contributions are welcome.

When opening issues, include: - Minimal repro script - Hardware + CUDA + PyTorch versions - ESSENTIAL vs DEEP-DIVE - Single GPU vs DDP


Community & Support

Founding Engineer / Co-Founder track (Berlin/Germany): We are looking for a senior systems+ML builder to help grow TraceML into a sustainable AI infra product. See the GitHub Discussion https://github.com/traceopt-ai/traceml/discussions/36

Stars help more teams find the project. 🌟


License

TraceML is released under the Apache 2.0.

See LICENSE for details.


Citation

If TraceML helps your research, please cite:

@software{traceml2024,
  author = {TraceOpt},
  title = {TraceML: Real-time Training Observability for PyTorch},
  year = {2024},
  url = {https://github.com/traceopt-ai/traceml}
}

Made with ❤️ by TraceOpt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.1.tar.gz (134.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.2.1-py3-none-any.whl (189.4 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.2.1.tar.gz.

File metadata

  • Download URL: traceml_ai-0.2.1.tar.gz
  • Upload date:
  • Size: 134.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.1.tar.gz
Algorithm Hash digest
SHA256 1b6fcac5783b2e9a506e72151937613ee87d9462ba19b48bca7910193ea1ed0a
MD5 83ccef9d942f6d700c9e87fca79b3953
BLAKE2b-256 30bf7271b13f3885e94cc7d1f3e116d67b0bda03f00e785f3f73c073033a4fda

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 189.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a01dc2b8a1fd89ce6a8e61d101dae7ad5c970ff33b5edebce5e587c1a00ca71c
MD5 3e1bad10677fc2155dbb5477f6edfa09
BLAKE2b-256 e41e7b7ff8d3feaabc28154a2c10fa6cd6935b3ba90d3fc274c90216e7b4aa6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page