TraceML: Lightweight ML Profiler

Project description

TraceML

Always-on, live observability and failure attribution for distributed PyTorch training (Alpha)

TraceML is a lightweight runtime observability tool for distributed PyTorch training.
It makes training behavior visible while it runs using semantic, step-level signals that are typically missing from infrastructure metrics and too expensive to keep enabled with full profilers.

Status: Alpha
Current focus: single-node DDP stability, signal accuracy, and overhead optimization (Python/GIL behavior, communication paths, synchronization strategy, and UI/collector performance).
Multi-node distributed training (DDP/FSDP) is planned.

Why TraceML

Training deep learning models often becomes a black box once you scale beyond toy workloads.

Common pain points:

Slow / unstable steps without knowing whether the bottleneck is dataloader, compute, communication, or optimizer
CUDA OOM errors with limited attribution to the responsible layer
Layer-level opacity: unclear memory and compute hotspots
Heavy profilers that are too intrusive to keep enabled during real training

TraceML is designed to be always-on, giving you actionable attribution during long-running jobs.

What TraceML Shows (Core Signals)

TraceML focuses on the signals you actually debug with:

Step-aware signals (synchronized across ranks)

For each training step (in single-node DDP):

Dataloader fetch time
Training step time (GPU-aware via CUDA events)
Step GPU memory (allocated + peak)

Across ranks, TraceML reports:

Median rank (typical behavior)
Worst rank (straggler / bottleneck)

This makes it easy to catch cases like “8 GPUs slower than 1” as it happens, and understand whether you’re bottlenecked by input pipeline, compute, or rank-level stragglers.

Failure attribution

OOM attribution (Deep-Dive mode): surface the layer most likely responsible during forward/backward

What TraceML Is Not

TraceML is not an auto-tuner or a profiler replacement.

It does not automatically optimize your batch size
It does not always “find a problem”
It does not replace Nsight or PyTorch Profiler

Instead, TraceML answers a more basic question:

“Which part of my training step is responsible for what I’m seeing — or is everything behaving normally?”

If your run is healthy, TraceML will tell you that explicitly.

Views

TraceML supports two ways to consume runtime signals:

🖥️ Terminal dashboard — live updates in your console
🌐 Web dashboard — local browser at http://localhost:8765

Note: Notebook is temporarily disabled in alpha

Tracking Profiles

TraceML provides two tracking profiles so you can choose the right trade-off between insight and overhead.

ESSENTIAL mode (always-on runtime signals)

Designed for day-to-day training and long-running jobs.

Tracks:

Dataloader fetch time
Training step time (GPU-aware)
Step-level GPU memory (allocated and peak)
System metrics (CPU, RAM, GPU)
Basic failure signals

This mode is intended to run continuously during real training.

DEEP-DIVE mode (diagnostic)

Designed for performance pathology debugging and OOM investigations.

Includes everything in ESSENTIAL, plus:

Per-layer memory (parameters, activations, gradients)
Per-layer forward and backward compute time
OOM layer attribution (forward/backward)

Installation

pip install traceml-ai

For development:

git clone https://github.com/traceopt-ai/traceml.git
cd traceml
pip install -e '.[dev]'

Requirements: Python 3.9–3.13, PyTorch 1.12+
Platform support: macOS (Intel/ARM), Linux
Training support: Single GPU and single-node DDP (alpha)

Quick Start

1) Step-level tracking (required)

TraceML computes step timing / memory only inside a trace_step() scope.

from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step(model):
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Without trace_step():

Step timing is not computed
Step memory is not recorded
Live dashboards will not update

2) Optional: Time specific code regions

Use @trace_time to time specific functions.
This works in all modes and is designed to have low overhead.

from traceml.decorators import trace_time

@trace_time("backward", use_gpu=True)
def backward_pass(loss):
    loss.backward()

Notes:

use_gpu=True uses CUDA events (correct for async GPU work)
use_gpu=False uses CPU wall-clock time

Deprecation (Breaking change)

@trace_timestep is deprecated — use @trace_time instead

3) Deep-Dive: model registration (only for Deep-Dive)

from traceml.decorators import trace_model_instance

trace_model_instance(model)

Enables forward/backward hooks required for:

per-layer memory and timing (layerwise worst across ranks)
OOM layer attribution (experimental, work-in-progress)

Running TraceML

traceml run train.py --nproc-per-node=2

You’ll see a live terminal dashboard tracking:

System resources (CPU, RAM, GPU)
Dataloader fetch time, step time, step GPU memory
(Deep-Dive only) per-layer memory + compute time

Tip: for DDP, run TraceML on rank 0 and collect rank signals via the TraceML runtime.

Web Dashboard

traceml run train.py --nproc-per-node=2 --mode=dashboard

Opens http://localhost:8765 with interactive charts and real-time updates.

Roadmap

TraceML prioritizes clear attribution and low overhead over exhaustive tracing.

Near-term:

Optimize single-node DDP: reduce overhead, improve rank synchronization accuracy, improve comm + GIL behavior
Broaden workload coverage: validated examples + benchmarks for representative workloads:
- CV (e.g., ResNet / ViT)
- NLP / LLM fine-tuning (e.g., BERT / small decoder models)
- Diffusion / vision-language (as time permits)
Documentation improvements: clearer docs + examples (targeting beta)

Multi-node distributed support (DDP → FSDP)
Integrations: PyTorch Lightning / Hugging Face Accelerate (as optional wrappers)
Advanced diagnostics: leak detection, regression attribution, and automated “why is my step slower?” summaries

Contributing

Contributions are welcome.

⭐ Star the repo
🐛 Report bugs via GitHub Issues
💡 Request features / workloads you want supported
🔧 Submit PRs (small focused PRs are ideal)

If you hit an issue, please open a GitHub Issue with:

minimal repro script
hardware + CUDA + PyTorch versions
whether you used ESSENTIAL or DEEP-DIVE
single GPU vs DDP

We’ll try to respond and resolve quickly.

Community & Support

📧 Email: abhinav@traceopt.ai
🐙 LinkedIn: Abhinav Srivastav
📋 User Survey: Help shape the roadmap (2 minutes) https://forms.gle/KwPSLaPmJnJjoVXSA
Stars help the project grow and makes it easier for other to find our work.🌟

---

License

TraceML is released under the MIT License with Commons Clause.

Summary:

✅ Free for personal use
✅ Free for research and academic use
✅ Free for internal company use
❌ Not allowed for resale or SaaS products

See LICENSE for full details.
For commercial licensing, contact: abhinav@traceopt.ai

Citation

If TraceML helps your research, please cite:

@software{traceml2024,
  author = {TraceOpt AI},
  title = {TraceML: Real-time Training Observability for PyTorch},
  year = {2024},
  url = {https://github.com/traceopt-ai/traceml}
}

TraceML — Stop guessing. Start attributing.

Made with ❤️ by TraceOpt AI

Project details

Release history Release notifications | RSS feed

0.3.0

May 26, 2026

0.2.15

May 19, 2026

0.2.14

May 7, 2026

0.2.13

Apr 30, 2026

0.2.12

Apr 27, 2026

0.2.11

Apr 23, 2026

0.2.10

Apr 22, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 13, 2026

0.2.7

Apr 7, 2026

0.2.6

Apr 4, 2026

0.2.5

Mar 20, 2026

0.2.4

Mar 15, 2026

0.2.3

Mar 7, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 9, 2026

This version

0.2.0a0 pre-release

Jan 27, 2026

0.1.9

Jan 3, 2026

0.1.8

Dec 25, 2025

0.1.6

Dec 11, 2025

0.1.5

Dec 10, 2025

0.1.3

Oct 8, 2025

0.1.1

Oct 2, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.2.0a0.tar.gz (104.3 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceml_ai-0.2.0a0-py3-none-any.whl (142.5 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file traceml_ai-0.2.0a0.tar.gz.

File metadata

Download URL: traceml_ai-0.2.0a0.tar.gz
Upload date: Jan 27, 2026
Size: 104.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.0a0.tar.gz
Algorithm	Hash digest
SHA256	`f0fa5983e0ad3a831dbe51cf9c8a269c245d52c97b0a4f6f59a660e2b8ed8666`
MD5	`cc50781a8957bc1c7ee853d9277c2c4e`
BLAKE2b-256	`47d2bd6b4187cac0ec6d7f2efb4bd5424ed29aee84ea9ccfdd19d9e86631fad5`

See more details on using hashes here.

File details

Details for the file traceml_ai-0.2.0a0-py3-none-any.whl.

File metadata

Download URL: traceml_ai-0.2.0a0-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 142.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.2.0a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e90fb3cebaa34ce9a552a1d52a5a1a9750509b2bf7f315e11c8df5b78b259b5`
MD5	`71c21cc332068b57c71b3ec2a416b03d`
BLAKE2b-256	`ce316def01a1fedc597eb3fc8769aa4f9b28dcdc861e629c8ff6be4fd086e7fe`

See more details on using hashes here.

traceml-ai 0.2.0a0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TraceML

Why TraceML

What TraceML Shows (Core Signals)

Step-aware signals (synchronized across ranks)

Failure attribution

What TraceML Is Not

Views

Tracking Profiles

ESSENTIAL mode (always-on runtime signals)

DEEP-DIVE mode (diagnostic)

Installation

Quick Start

1) Step-level tracking (required)

2) Optional: Time specific code regions

Deprecation (Breaking change)

3) Deep-Dive: model registration (only for Deep-Dive)

Running TraceML

Web Dashboard

Roadmap

Contributing

Community & Support

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes