TraceML: Lightweight ML Profiler

Project description

TraceML

Real-time profiling for PyTorch training — lightweight, always-on, and actionable.

The Problem TraceML Solves

Training deep learning models shouldn't feel like debugging a black box. Yet we constantly face:

💥 CUDA OOM errors with no insight into which layer caused the memory spike
🐌 Slow training without knowing if the bottleneck is data loading, forward pass, backward pass, or optimizer
🔍 Layer-level mysteries — which layers consume the most memory? Which are slowest?
📊 Heavy profilers that are impractical to keep running during actual training

TraceML changes this. It provides continuous, low-overhead visibility into your training process while it's running — no restarts, no heavy tooling, no guesswork.

What TraceML Does

TraceML answers the questions you actually need answered:

Question	TraceML Answer
Which layer is eating my GPU memory?	Per-layer memory breakdown (params + activations + gradients)
Where did that memory spike happen?	Real-time memory tracking during forward/backward passes
Which layer is slow?	Per-layer compute time (forward + backward)
What's slowing down my training step?	Step-level timing: dataloader → forward → backward → optimizer

Three ways to view your data:

🖥️ Terminal dashboard — live updates in your console
📓 Jupyter notebooks — inline visualizations
🌐 Web dashboard — local browser UI at localhost:8765

Installation

pip install traceml-ai

For development:

git clone https://github.com/traceopt-ai/traceml.git
cd traceml
pip install -e '.[dev]'

Requirements: Python 3.9-3.13, PyTorch 1.12+

Platform support: macOS (Intel/ARM), Linux. Single-GPU training (DDP support coming soon).

Quick Start

Step 1: Add One Decorator to Your Model

TraceML works by attaching lightweight hooks to your PyTorch model. Choose your preferred method:

Option A: Class decorator (recommended)

from traceml.decorators import trace_model
import torch.nn as nn

@trace_model()
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.TransformerEncoder(...)
        self.decoder = nn.Linear(512, 1000)
    
    def forward(self, x):
        x = self.encoder(x)
        return self.decoder(x)

Option B: Instance registration

from traceml.decorators import trace_model_instance

model = torchvision.models.resnet50()
trace_model_instance(model)

That's it. No other code changes needed.

Step 2: Run Your Training Script

traceml run train.py

You'll immediately see a live terminal dashboard tracking:

System resources (CPU, RAM, GPU)
Per-layer memory usage and compute time
Training step breakdowns

TraceML CLI Demo

Best for: Training on remote servers, quick debugging, CI/CD environments.

🌐 Web Dashboard

traceml run train.py --mode=dashboard

Opens http://localhost:8765 with interactive charts and real-time updates.

Best for: Local development, detailed analysis, sharing results with teammates.

📓 Jupyter Notebooks

from traceml.decorators import trace_model_instance
from traceml.manager.tracker_manager import TrackerManager

# Register your model
trace_model_instance(model)

# Start tracking
tracker = TrackerManager(interval_sec=1.0, mode="notebook")
tracker.start()

# Run your training
for epoch in range(num_epochs):
    train_one_epoch(model, dataloader)

# Stop and view results
tracker.stop()
tracker.log_summaries()

Best for: Experimentation, teaching, sharing results in notebooks.

Advanced: Step Timing

Track specific operations in your training loop:

from traceml.decorators import trace_timestep

@trace_timestep("dataloader", use_gpu=True)
def load_batch(dataloader):
    return next(iter(dataloader))

@trace_timestep("forward", use_gpu=True)
def forward_pass(model, batch):
    return model(batch)

@trace_timestep("backward", use_gpu=True)
def backward_pass(loss, optimizer):
    loss.backward()
    optimizer.step()

Timings automatically appear in all dashboards and logs, helping you identify bottlenecks at a glance.

Exporting Data

Enable JSON logging for offline analysis:

traceml run train.py --enable-logging

Logs are saved to ./logs/ with timestamps, ready for plotting or integration with your own monitoring tools.

Current Features

✅ Real-time system monitoring (CPU, RAM, GPU)
✅ Per-layer memory tracking (parameters, activations, gradients)
✅ Per-layer compute time (forward + backward)
✅ Training step timing (dataloader, forward, backward, optimizer)
✅ Terminal UI with live updates
✅ Jupyter notebook integration
✅ Local web dashboard
✅ JSON export for offline analysis
✅ Minimal overhead ✅ Zero code changes (beyond registration)

Roadmap

🔜 Multi-GPU distributed training (DDP, FSDP)
🔜 PyTorch Lightning integration
🔜 Hugging Face Accelerate support
🔜 Memory leak detection
🔜 Automatic optimization suggestions
🔜 Cloud dashboard (optional)

Examples

Explore complete examples in the repository:

BERT fine-tuning with TraceML

Contributing

We welcome contributions! Here's how to help:

⭐ Star the repo to show support
🐛 Report bugs via GitHub Issues
💡 Request features we should prioritize
🔧 Submit PRs for improvements

Development setup:

git clone https://github.com/traceopt-ai/traceml.git
cd traceml
pip install -e '.[dev]'
pytest tests/

Community & Support

📧 Email: abhinav@traceopt.ai
🐙 GitHub: traceopt-ai/traceml
📋 User Survey: Help shape the roadmap (2 minutes)

License

TraceML is released under the MIT License with Commons Clause.

What this means:

✅ Free for personal use
✅ Free for research and academic use
✅ Free for internal company use
❌ Not allowed for resale or SaaS products

For commercial licensing inquiries, contact abhinav@traceopt.ai.

See LICENSE for full details.

Citation

If TraceML helps your research, please cite:

@software{traceml2024,
  author = {TraceOpt AI},
  title = {TraceML: Real-time Profiling for PyTorch Training},
  year = {2024},
  url = {https://github.com/traceopt-ai/traceml}
}

TraceML — Stop guessing. Start profiling.

Made with ❤️ by TraceOpt AI

Project details

Release history Release notifications | RSS feed

0.3.0

May 26, 2026

0.2.15

May 19, 2026

0.2.14

May 7, 2026

0.2.13

Apr 30, 2026

0.2.12

Apr 27, 2026

0.2.11

Apr 23, 2026

0.2.10

Apr 22, 2026

0.2.9

Apr 17, 2026

0.2.8

Apr 13, 2026

0.2.7

Apr 7, 2026

0.2.6

Apr 4, 2026

0.2.5

Mar 20, 2026

0.2.4

Mar 15, 2026

0.2.3

Mar 7, 2026

0.2.2

Feb 28, 2026

0.2.1

Feb 26, 2026

0.2.0

Feb 9, 2026

0.2.0a0 pre-release

Jan 27, 2026

0.1.9

Jan 3, 2026

This version

0.1.8

Dec 25, 2025

0.1.6

Dec 11, 2025

0.1.5

Dec 10, 2025

0.1.3

Oct 8, 2025

0.1.1

Oct 2, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.1.8.tar.gz (60.5 kB view details)

Uploaded Dec 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceml_ai-0.1.8-py3-none-any.whl (88.9 kB view details)

Uploaded Dec 25, 2025 Python 3

File details

Details for the file traceml_ai-0.1.8.tar.gz.

File metadata

Download URL: traceml_ai-0.1.8.tar.gz
Upload date: Dec 25, 2025
Size: 60.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`0b2de939d51ca786236365d29a6fc92b875d1d4ec8928f1ff035aba083e1a6f1`
MD5	`95640619048a9f8711b4ac30da95edfb`
BLAKE2b-256	`bb54c0757dad2fb15b9021657ba11e84d47a3bc2c1262a20519bfcd62cd14b1b`

See more details on using hashes here.

File details

Details for the file traceml_ai-0.1.8-py3-none-any.whl.

File metadata

Download URL: traceml_ai-0.1.8-py3-none-any.whl
Upload date: Dec 25, 2025
Size: 88.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`956a84c264f2d4ee3bc5d6c86a3af080643883d146fe1a396f07df1a99bf2339`
MD5	`645e699f572752de62a47e35e1dfd920`
BLAKE2b-256	`869a4e16e3e87103390bd6a76238472285a61b8e8c51896b7e5e2622ad90b0c2`

See more details on using hashes here.

traceml-ai 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

TraceML

The Problem TraceML Solves

What TraceML Does

Installation

Quick Start

Step 1: Add One Decorator to Your Model

Step 2: Run Your Training Script

🌐 Web Dashboard

📓 Jupyter Notebooks

Advanced: Step Timing

Exporting Data

Current Features

Roadmap

Examples

Contributing

Community & Support

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes