TraceML: Lightweight ML Profiler
Project description
TraceML
If you find it useful, consider giving it a ⭐ on GitHub — it helps others discover the project!
A lightweight, always-on profiler for PyTorch that makes memory, timing, and system usage visible in real time via:
- Terminal dashboards
- Jupyter notebooks
- A lightweight local web dashboard/server
- JSON logging for offline analysis
Minimal configuration. Minimal overhead. Plug-and-trace.
📊 Quick User Survey (2 min)
Using TraceML? Help shape the roadmap: https://forms.gle/vaDQao8L81oAoAkv9
🚨 The Problem
Training deep learning models often feels like debugging a black box:
- CUDA OOM errors appear without warning
- Step times are slow with no visibility
- Existing profilers are heavy, complicated, or lack activation/gradient memory details
TraceML provides continuous, lightweight observability without slowing down training.
💡 Why TraceML?
TraceML is designed to stay lightweight, always-on, and practical:
- Module-level memory tracking (params, activations, gradients)
- Step timing (forward, backward, optimizer, dataloader)
- Terminal + Notebook + Local Web Dashboard (port 8765)
- Minimal overhead (sampling-based — NOT full graph tracing)
A tool you can safely keep on in every training loop.
⭐ Quick Start
1. Installation
pip install .
Developer mode:
pip install '.[dev]'
🔧 2. Model Registration (Required)
TraceML needs to attach hooks to your model. Two ways:
A. Decorator (recommended)
from traceml.decorators import trace_model
import torch.nn as nn
@trace_model()
class TinyNet(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(100, 10)
def forward(self, x):
return self.fc(x)
B. Register a model instance
from traceml.decorators import trace_model_instance
import torch.nn as nn
model = nn.Sequential(
nn.Linear(100, 50),
nn.ReLU(),
nn.Linear(50, 10)
)
trace_model_instance(model)
This is all you need to enable memory + timing tracing across all workflows.
🚀 3. Running TraceML
You can run TraceML in three modes:
✅ A. CLI Mode (Terminal Dashboard — default)
traceml run your_script.py
This launches a live terminal dashboard showing:
- System metrics (CPU, RAM, GPU)
- Layer memory
- Activation + gradient memory
- Step timings
✅ B. Dashboard Mode (Local Web UI)
Run your training script with:
traceml run your_script.py --mode=dashboard
Opens a live dashboard at:
http://localhost:8765
Includes:
- Real-time charts
- Per-layer memory
- Peaks and summaries
✅ C. Notebook Mode
from traceml.decorators import trace_model_instance
from traceml.manager.tracker_manager import TrackerManager
trace_model_instance(model)
tracker = TrackerManager(interval_sec=1.0, mode="notebook")
tracker.start()
train(model)
tracker.stop()
tracker.log_summaries()
Notebook UI updates automatically.
⏱ Step Timing Example
from traceml.decorators import trace_timestep
@trace_timestep("forward", use_gpu=True)
def forward_pass(model, batch):
return model(**batch)
@trace_timestep("backward", use_gpu=True)
def backward_pass(loss, scaler):
scaler.scale(loss).backward()
Timings automatically appear in CLI, dashboard, and notebook summaries.
📤 Exporting Logs as JSON
Enable JSON logging:
traceml run your_script.py --enable-logging
Logs are stored in:
./logs/
Useful for plotting, analytics, or offline dashboards.
📊 How TraceML Works (Lightweight Samplers)
TraceML uses asynchronous samplers (NOT full tracing):
- SystemSampler — CPU, RAM, GPU
- LayerMemorySampler — Params
- ActivationMemorySampler — Forward activations
- GradientMemorySampler — Backward gradients
- StepTimeSampler — Forward/backward/optimizer timings
This keeps overhead extremely low.
📦 Current Features
- Live system usage (CPU, RAM, GPU)
- Per-layer memory tracking
- Activation & gradient memory
- Step timing
- Terminal UI
- Notebook display
- Local web dashboard
- JSON logging
🛠 Coming Soon
- Multi-node distributed tracing
- PyTorch Lightning / Accelerate integration
🤝 Contribute
- ⭐ the repo to support development
- Open issues for improvements or bugs
- Contributions welcome
📧 Contact: abhinavsriva@gmail.com
🧾 License
TraceML uses MIT License + Commons Clause:
- Free for personal, research, academic, and internal use
- Not allowed for resale, SaaS, or commercial redistribution
For commercial licensing, contact abhinavsriva@gmail.com.
TraceML — Lightweight, real-time visibility for PyTorch training.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceml_ai-0.1.5.tar.gz.
File metadata
- Download URL: traceml_ai-0.1.5.tar.gz
- Upload date:
- Size: 52.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13a2a572f565e99958a351ffd117f811058bf8729f5ab1c6be2a9a5bc8d5da07
|
|
| MD5 |
6e3b2cead14fec375cf07650bfbd0c2e
|
|
| BLAKE2b-256 |
e86e1842a1ed9ccc158cc200aafdad281c033c329eb85e871564d6c071172be6
|
File details
Details for the file traceml_ai-0.1.5-py3-none-any.whl.
File metadata
- Download URL: traceml_ai-0.1.5-py3-none-any.whl
- Upload date:
- Size: 75.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f04c5bd6d40e13e8f60d9bc932b34c6cad65cc83832bbd57693a0ca5bee62a8
|
|
| MD5 |
010de6d57057f3d1b6ec44e57b1f772a
|
|
| BLAKE2b-256 |
0fd5016603981656091807126a77a6ba2077f0f0ef6c7c931ec95ebb00a38a11
|