TraceML: Lightweight ML Profiler

Project description

TraceML

Real-time training observability and failure attribution tool for PyTorch — lightweight, always-on, and actionable.

The Problem TraceML Solves

Training deep learning models shouldn't feel like debugging a black box. Yet we constantly face:

💥 CUDA OOM errors with no insight into which layer caused the memory spike
🐌 Slow training without knowing if the bottleneck is data loading, forward pass, backward pass, or optimizer
🔍 Layer-level mysteries — which layers consume the most memory? Which are slowest?
📊 Heavy profilers that are impractical to keep running during actual training

TraceML changes this with continuous, low-overhead visibility while your training runs.

What TraceML Does

TraceML answers the questions you actually need answered:

Question	TraceML Answer
Which layer caused OOM ?	Automatic detection of the failing layer during forward or backward pass
What's slowing down my training step?	Step-level timing: dataloader → forward → backward → optimizer
Where did that memory spike happen?	Step-level (batch-level) memory tracking with peak attribution
Which layer is eating my GPU memory?	Per-layer memory breakdown (params + forward + backward)
Which layer is slow?	Per-layer compute time (forward + backward)

Three ways to view the results:

🖥️ Terminal — live updates in your console
🌐 Web UI — local browser at localhost:8765
📓 Jupyter notebooks — inline visualizations

Tracking Profiles (New)

TraceML supports two tracking profiles so you can choose the right trade-off between insight and overhead.

ESSENTIAL mode (lightweight, always-on)

Best for day-to-day training and long runs.

Tracks:

Dataloader fetch time
Training step time (GPU-aware)
Step GPU memory (allocated + peak)
System stats (CPU, RAM, GPU)

DEEP-DIVE mode (diagnostic)

Best for debugging OOMs and performance pathologies.

Tracks everything in Essential, plus:

Per-layer memory (parameters, activations, gradients)
Per-layer forward and backward time

Timed Regions (optional, very low overhead)

Optional instrumentation for specific code blocks.
Executed once per step per decorated function.

Tracks:

Custom timing blocks (e.g. dataloader, forward, backward, optimizer)
CPU or GPU time (via CUDA events)
Low (one timing measurement per step)

Installation

pip install traceml-ai

For development:

git clone https://github.com/traceopt-ai/traceml.git
cd traceml
pip install -e '.[dev]'

Requirements: Python 3.9-3.13, PyTorch 1.12+

Platform support: macOS (Intel/ARM), Linux. Single-GPU training (DDP support coming soon).

Quick Start (Important)

Step-level tracking (required for all modes)

from traceml.decorators import trace_step

for batch in dataloader:
    with trace_step():
        outputs = model(batch["x"])
        loss = criterion(outputs, batch["y"])
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)

Without trace_step(...):

Step timing is not computed
Step memory is not recorded
Live dashboards will not update

Optional: Timing Specific Code Regions

Use @trace_time to time specific functions. This works in all modes and has low overhead.

from traceml.decorators import trace_time

@trace_time("backward", use_gpu=True)
def backward_pass(loss, scaler=None):
    loss.backward()

Notes:

use_gpu=True uses CUDA events (correct for async GPU work)
use_gpu=False uses CPU wall-clock time

Deprecation (⚠️ Breaking change)

@trace_timestep is deprecated, use @trace_time instead

Deep-Dive: Model Registration

Required only for Deep-Dive mode.

from traceml.decorators import trace_model_instance

trace_model_instance(model)

Notes:

Enables forward/backward hooks
Required for per-layer memory and timing
Required for OOM layer attribution

Running TraceML

traceml run train.py

You'll immediately see a live terminal dashboard tracking:

System resources (CPU, RAM, GPU)
Dataloader fetch time, training step time and training step GPU memory
(Deep-Dive only) Per-layer memory and compute time

TraceML CLI Demo

🌐 Web Dashboard

traceml run train.py --mode=dashboard

Opens http://localhost:8765 with interactive charts and real-time updates.

📓 Jupyter Notebooks

Please see the notebook example for inline visualizations.

Roadmap

Performance & Stability Improvements: Continuous reduction of tracing overhead, improved robustness for long-running training jobs, and better defaults for production-scale workloads.
Distributed Training Support: Support for multi-GPU training (DDP / FSDP) and, over time, multi-node distributed setups with clear failure and performance attribution.
Framework Integrations: Native integrations with popular training frameworks such as PyTorch Lightning and Hugging Face Accelerate.
Advanced Diagnostics: Memory leak detection, clearer attribution of performance regressions, and richer debugging signals for complex training runs.
Actionable Insights & Automation: Smarter summaries and recommendations to help users identify bottlenecks and optimize training configurations.

Contributing

We welcome contributions! Here's how to help:

⭐ Star the repo to show support
🐛 Report bugs via GitHub Issues
💡 Request features we should prioritize
🔧 Submit PRs for improvements

Community & Support

📧 Email: abhinav@traceopt.ai
🐙 LinkedIn: Abhinav Srivastav
📋 User Survey: Help shape the roadmap (2 minutes)

License

TraceML is released under the MIT License with Commons Clause.

What this means:

✅ Free for personal use
✅ Free for research and academic use
✅ Free for internal company use
❌ Not allowed for resale or SaaS products

For commercial licensing inquiries, contact abhinav@traceopt.ai.

See LICENSE for full details.

Citation

If TraceML helps your research, please cite:

@software{traceml2024,
  author = {TraceOpt AI},
  title = {TraceML: Real-time Training Observability for PyTorch},
  year = {2024},
  url = {https://github.com/traceopt-ai/traceml}
}

TraceML — Stop guessing. Start profiling.

Made with ❤️ by TraceOpt AI

Project details

Release history Release notifications | RSS feed

0.3.0

May 26, 2026

0.2.15

May 19, 2026

0.2.14

May 7, 2026

0.2.13

Apr 30, 2026

0.2.12

Apr 27, 2026

0.2.11

Apr 23, 2026

0.2.10

Apr 22, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 13, 2026

0.2.7

Apr 7, 2026

0.2.6

Apr 4, 2026

0.2.5

Mar 20, 2026

0.2.4

Mar 15, 2026

0.2.3

Mar 7, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 9, 2026

0.2.0a0 pre-release

Jan 27, 2026

This version

0.1.9

Jan 3, 2026

0.1.8

Dec 25, 2025

0.1.6

Dec 11, 2025

0.1.5

Dec 10, 2025

0.1.3

Oct 8, 2025

0.1.1

Oct 2, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.1.9.tar.gz (70.0 kB view details)

Uploaded Jan 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceml_ai-0.1.9-py3-none-any.whl (103.9 kB view details)

Uploaded Jan 3, 2026 Python 3

File details

Details for the file traceml_ai-0.1.9.tar.gz.

File metadata

Download URL: traceml_ai-0.1.9.tar.gz
Upload date: Jan 3, 2026
Size: 70.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`9d695859507010593c2a54b365045e2f54d7760d790220d08dff30cb459dfc21`
MD5	`f1ebfe2e6e412afda4cafd94149b5e91`
BLAKE2b-256	`28d39325578e66452476a734447b454a52078fcbfd24a4e6a6e54fc1b4b5174b`

See more details on using hashes here.

File details

Details for the file traceml_ai-0.1.9-py3-none-any.whl.

File metadata

Download URL: traceml_ai-0.1.9-py3-none-any.whl
Upload date: Jan 3, 2026
Size: 103.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d0af8b99d1f6fdd9bfb1d598029935757e91b9ea31512e9e6c8a131777b8ca45`
MD5	`1a168e94aeeceed8a2d4816bbba6dad7`
BLAKE2b-256	`bd213378922698270247330846d367426ca8324c0f3e373e82bde4577c66d452`

See more details on using hashes here.

traceml-ai 0.1.9

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TraceML

The Problem TraceML Solves

What TraceML Does

Tracking Profiles (New)

ESSENTIAL mode (lightweight, always-on)

DEEP-DIVE mode (diagnostic)

Timed Regions (optional, very low overhead)

Installation

Quick Start (Important)

Step-level tracking (required for all modes)

Optional: Timing Specific Code Regions

Deprecation (⚠️ Breaking change)

Deep-Dive: Model Registration

Running TraceML

🌐 Web Dashboard

📓 Jupyter Notebooks

Roadmap

Contributing

Community & Support

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes