Skip to main content

TraceML: Lightweight training runtime health monitor.

Project description

TraceML

Know what’s slowing your (PyTorch) training, while it runs

PyPI version Downloads GitHub stars Python 3.9-3.13 License

TraceML provides step-level training visibility for PyTorch workloads. It shows where time and memory go inside each training step so you can quickly understand performance behavior across single-GPU and single-node DDP runs.

Current support

  • ✅ Single GPU
  • ✅ Single-node multi-GPU (DDP)
  • ❌ Multi-node DDP (not yet)
  • ❌ FSDP / TP / PP (not yet)

What You See in Minutes

  • System signals (CPU, RAM, GPU)
  • Breakdown of each training step:
    • dataloader → forward → backward → optimizer → overhead
  • Median vs worst rank (in case of DDP)
  • Skew (%) to surface imbalance
  • GPU memory (allocated + peak)

Healthy runs are clearly stable. Unstable runs reveal drift, imbalance, or memory creep early.


Quick Start

Install:

pip install traceml-ai

Wrap your training step:

from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Run with cli:

traceml run train.py 

The terminal dashboard opens alongside your logs. TraceML terminal dashboard

Optional web UI:

traceml run train.py --mode=dashboard

TraceML web dashboard


What TraceML Surfaces

Step-Level Signals

  • Dataloader fetch time
  • Step time (low-overhead, GPU-aware)
  • Step GPU memory (allocated + peak)

Across ranks:

  • Median (typical behavior)
  • Worst rank (slowest / highest memory)
  • Skew (% difference)

This makes rank imbalance and straggler behavior immediately visible.


Deep-Dive Mode (Optional)

Enable model-level hooks for diagnostic context:

from traceml.decorators import trace_model_instance
trace_model_instance(model)

Use together with trace_step(model) to enable:

  • Per-layer memory signals
  • Per-layer forward/backward timing
  • Lightweight failure attribution (experimental)

If not enabled, ESSENTIAL signals remain unchanged.


What It Is Not

  • Not a replacement for PyTorch Profiler or Nsight
  • Not an auto-tuner
  • Not a kernel-level tracer

TraceML focuses on step-level visibility that is practical during real training runs.


Supported Environments

  • Python 3.9--3.13
  • PyTorch 1.12+
  • macOS (Intel/ARM), Linux
  • Single GPU
  • Single-node DDP

Hugging Face Integration

TraceML provides a seamless integration with Hugging Face transformers via TraceMLTrainer.

Usage

Replace transformers.Trainer with traceml.hf_decorators.TraceMLTrainer.

from traceml.hf_decorators import TraceMLTrainer

trainer = TraceMLTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    traceml_enabled=True,         
)

Roadmap

Near-term: - Single-node DDP hardening - Disk run logging - Compatibility validation (gradient accumulation, torch.compile) - Accelerate / Lightning wrappers

Next: - Multi-node DDP - Initial FSDP support

Later: - Tensor / Pipeline parallel awareness


Contributing

Contributions are welcome.

When opening issues, include: - Minimal repro script - Hardware + CUDA + PyTorch versions - ESSENTIAL vs DEEP-DIVE - Single GPU vs DDP


Community & Support

Founding Engineer / Co-Founder track (Berlin/Germany): We are looking for a senior systems+ML builder to help grow TraceML into a sustainable AI infra product. See the GitHub Discussion https://github.com/traceopt-ai/traceml/discussions/36

Stars help more teams find the project. 🌟


License

TraceML is released under the Apache 2.0.

See LICENSE for details.


Citation

If TraceML helps your research, please cite:

@software{traceml2024,
  author = {TraceOpt},
  title = {TraceML: Real-time Training Observability for PyTorch},
  year = {2024},
  url = {https://github.com/traceopt-ai/traceml}
}

Made with ❤️ by TraceOpt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.2.tar.gz (136.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.2.2-py3-none-any.whl (193.5 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.2.2.tar.gz.

File metadata

  • Download URL: traceml_ai-0.2.2.tar.gz
  • Upload date:
  • Size: 136.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.2.tar.gz
Algorithm Hash digest
SHA256 7c9ca7cf62aafd9e75711f35b95b320b2c28e5ae777e0a8d855f151d687a8eb9
MD5 238e9cab5dff5101732c4c77dcb57cc0
BLAKE2b-256 4d7e955c8d62df16caa0331e58b3cc429c39b0d07901eb223f2d81436c8c419d

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 193.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 62e6028f27dd02b67cb53bf7a4f2d84f003362783b8f9d80f8aead181a730046
MD5 1a64ad8a7a7a66e53cd9db81831459b1
BLAKE2b-256 2db0c9af859cd0c23d36959f4e4a987cf6f3dfe213b948e2d7eda5bc934dc859

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page