TraceML: Lightweight ML Profiler
Project description
TraceML
Real-time profiling for PyTorch training โ lightweight, always-on, and actionable.
The Problem TraceML Solves
Training deep learning models shouldn't feel like debugging a black box. Yet we constantly face:
- ๐ฅ CUDA OOM errors with no insight into which layer caused the memory spike
- ๐ Slow training without knowing if the bottleneck is data loading, forward pass, backward pass, or optimizer
- ๐ Layer-level mysteries โ which layers consume the most memory? Which are slowest?
- ๐ Heavy profilers that are impractical to keep running during actual training
TraceML changes this. It provides continuous, low-overhead visibility into your training process while it's running โ no restarts, no heavy tooling, no guesswork.
What TraceML Does
TraceML answers the questions you actually need answered:
| Question | TraceML Answer |
|---|---|
| Which layer is eating my GPU memory? | Per-layer memory breakdown (params + activations + gradients) |
| Where did that memory spike happen? | Real-time memory tracking during forward/backward passes |
| Which layer is slow? | Per-layer compute time (forward + backward) |
| What's slowing down my training step? | Step-level timing: dataloader โ forward โ backward โ optimizer |
Three ways to view your data:
- ๐ฅ๏ธ Terminal dashboard โ live updates in your console
- ๐ Jupyter notebooks โ inline visualizations
- ๐ Web dashboard โ local browser UI at
localhost:8765
Installation
pip install traceml-ai
For development:
git clone https://github.com/traceopt-ai/traceml.git
cd traceml
pip install -e '.[dev]'
Requirements: Python 3.9-3.13, PyTorch 1.12+
Platform support: macOS (Intel/ARM), Linux. Single-GPU training (DDP support coming soon).
Quick Start
Step 1: Add One Decorator to Your Model
TraceML works by attaching lightweight hooks to your PyTorch model. Choose your preferred method:
Option A: Class decorator (recommended)
from traceml.decorators import trace_model
import torch.nn as nn
@trace_model()
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.TransformerEncoder(...)
self.decoder = nn.Linear(512, 1000)
def forward(self, x):
x = self.encoder(x)
return self.decoder(x)
Option B: Instance registration
from traceml.decorators import trace_model_instance
model = torchvision.models.resnet50()
trace_model_instance(model)
That's it. No other code changes needed.
Step 2: Run Your Training Script
traceml run train.py
You'll immediately see a live terminal dashboard tracking:
- System resources (CPU, RAM, GPU)
- Per-layer memory usage and compute time
- Training step breakdowns
Best for: Training on remote servers, quick debugging, CI/CD environments.
๐ Web Dashboard
traceml run train.py --mode=dashboard
Opens http://localhost:8765 with interactive charts and real-time updates.
Best for: Local development, detailed analysis, sharing results with teammates.
๐ Jupyter Notebooks
from traceml.decorators import trace_model_instance
from traceml.manager.tracker_manager import TrackerManager
# Register your model
trace_model_instance(model)
# Start tracking
tracker = TrackerManager(interval_sec=1.0, mode="notebook")
tracker.start()
# Run your training
for epoch in range(num_epochs):
train_one_epoch(model, dataloader)
# Stop and view results
tracker.stop()
tracker.log_summaries()
Best for: Experimentation, teaching, sharing results in notebooks.
Advanced: Step Timing
Track specific operations in your training loop:
from traceml.decorators import trace_timestep
@trace_timestep("dataloader", use_gpu=True)
def load_batch(dataloader):
return next(iter(dataloader))
@trace_timestep("forward", use_gpu=True)
def forward_pass(model, batch):
return model(batch)
@trace_timestep("backward", use_gpu=True)
def backward_pass(loss, optimizer):
loss.backward()
optimizer.step()
Timings automatically appear in all dashboards and logs, helping you identify bottlenecks at a glance.
Exporting Data
Enable JSON logging for offline analysis:
traceml run train.py --enable-logging
Logs are saved to ./logs/ with timestamps, ready for plotting or integration with your own monitoring tools.
Current Features
โ
Real-time system monitoring (CPU, RAM, GPU)
โ
Per-layer memory tracking (parameters, activations, gradients)
โ
Per-layer compute time (forward + backward)
โ
Training step timing (dataloader, forward, backward, optimizer)
โ
Terminal UI with live updates
โ
Jupyter notebook integration
โ
Local web dashboard
โ
JSON export for offline analysis
โ
Minimal overhead
โ
Zero code changes (beyond registration)
Roadmap
๐ Multi-GPU distributed training (DDP, FSDP)
๐ PyTorch Lightning integration
๐ Hugging Face Accelerate support
๐ Memory leak detection
๐ Automatic optimization suggestions
๐ Cloud dashboard (optional)
Examples
Explore complete examples in the repository:
Contributing
We welcome contributions! Here's how to help:
- โญ Star the repo to show support
- ๐ Report bugs via GitHub Issues
- ๐ก Request features we should prioritize
- ๐ง Submit PRs for improvements
Development setup:
git clone https://github.com/traceopt-ai/traceml.git
cd traceml
pip install -e '.[dev]'
pytest tests/
Community & Support
- ๐ง Email: abhinav@traceopt.ai
- ๐ GitHub: traceopt-ai/traceml
- ๐ User Survey: Help shape the roadmap (2 minutes)
License
TraceML is released under the MIT License with Commons Clause.
What this means:
- โ Free for personal use
- โ Free for research and academic use
- โ Free for internal company use
- โ Not allowed for resale or SaaS products
For commercial licensing inquiries, contact abhinav@traceopt.ai.
See LICENSE for full details.
Citation
If TraceML helps your research, please cite:
@software{traceml2024,
author = {TraceOpt AI},
title = {TraceML: Real-time Profiling for PyTorch Training},
year = {2024},
url = {https://github.com/traceopt-ai/traceml}
}
TraceML โ Stop guessing. Start profiling.
Made with โค๏ธ by TraceOpt AI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceml_ai-0.1.8.tar.gz.
File metadata
- Download URL: traceml_ai-0.1.8.tar.gz
- Upload date:
- Size: 60.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b2de939d51ca786236365d29a6fc92b875d1d4ec8928f1ff035aba083e1a6f1
|
|
| MD5 |
95640619048a9f8711b4ac30da95edfb
|
|
| BLAKE2b-256 |
bb54c0757dad2fb15b9021657ba11e84d47a3bc2c1262a20519bfcd62cd14b1b
|
File details
Details for the file traceml_ai-0.1.8-py3-none-any.whl.
File metadata
- Download URL: traceml_ai-0.1.8-py3-none-any.whl
- Upload date:
- Size: 88.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
956a84c264f2d4ee3bc5d6c86a3af080643883d146fe1a396f07df1a99bf2339
|
|
| MD5 |
645e699f572752de62a47e35e1dfd920
|
|
| BLAKE2b-256 |
869a4e16e3e87103390bd6a76238472285a61b8e8c51896b7e5e2622ad90b0c2
|