Skip to main content

TraceML: Lightweight ML Profiler

Project description

TraceML

Real-time profiling for PyTorch training โ€” lightweight, always-on, and actionable.

License: MIT PyPI version Python 3.9-3.13 Open In Colab


The Problem TraceML Solves

Training deep learning models shouldn't feel like debugging a black box. Yet we constantly face:

  • ๐Ÿ’ฅ CUDA OOM errors with no insight into which layer caused the memory spike
  • ๐ŸŒ Slow training without knowing if the bottleneck is data loading, forward pass, backward pass, or optimizer
  • ๐Ÿ” Layer-level mysteries โ€” which layers consume the most memory? Which are slowest?
  • ๐Ÿ“Š Heavy profilers that are impractical to keep running during actual training

TraceML changes this. It provides continuous, low-overhead visibility into your training process while it's running โ€” no restarts, no heavy tooling, no guesswork.


What TraceML Does

TraceML answers the questions you actually need answered:

Question TraceML Answer
Which layer is eating my GPU memory? Per-layer memory breakdown (params + activations + gradients)
Where did that memory spike happen? Real-time memory tracking during forward/backward passes
Which layer is slow? Per-layer compute time (forward + backward)
What's slowing down my training step? Step-level timing: dataloader โ†’ forward โ†’ backward โ†’ optimizer

Three ways to view your data:

  • ๐Ÿ–ฅ๏ธ Terminal dashboard โ€” live updates in your console
  • ๐Ÿ““ Jupyter notebooks โ€” inline visualizations
  • ๐ŸŒ Web dashboard โ€” local browser UI at localhost:8765

Installation

pip install traceml-ai

For development:

git clone https://github.com/traceopt-ai/traceml.git
cd traceml
pip install -e '.[dev]'

Requirements: Python 3.9-3.13, PyTorch 1.12+

Platform support: macOS (Intel/ARM), Linux. Single-GPU training (DDP support coming soon).


Quick Start

Step 1: Add One Decorator to Your Model

TraceML works by attaching lightweight hooks to your PyTorch model. Choose your preferred method:

Option A: Class decorator (recommended)

from traceml.decorators import trace_model
import torch.nn as nn

@trace_model()
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.TransformerEncoder(...)
        self.decoder = nn.Linear(512, 1000)
    
    def forward(self, x):
        x = self.encoder(x)
        return self.decoder(x)

Option B: Instance registration

from traceml.decorators import trace_model_instance

model = torchvision.models.resnet50()
trace_model_instance(model)

That's it. No other code changes needed.

Step 2: Run Your Training Script

traceml run train.py

You'll immediately see a live terminal dashboard tracking:

  • System resources (CPU, RAM, GPU)
  • Per-layer memory usage and compute time
  • Training step breakdowns

TraceML CLI Demo


Best for: Training on remote servers, quick debugging, CI/CD environments.

๐ŸŒ Web Dashboard

traceml run train.py --mode=dashboard

Opens http://localhost:8765 with interactive charts and real-time updates.

TraceML CLI Demo

Best for: Local development, detailed analysis, sharing results with teammates.

๐Ÿ““ Jupyter Notebooks

from traceml.decorators import trace_model_instance
from traceml.manager.tracker_manager import TrackerManager

# Register your model
trace_model_instance(model)

# Start tracking
tracker = TrackerManager(interval_sec=1.0, mode="notebook")
tracker.start()

# Run your training
for epoch in range(num_epochs):
    train_one_epoch(model, dataloader)

# Stop and view results
tracker.stop()
tracker.log_summaries()

Best for: Experimentation, teaching, sharing results in notebooks.


Advanced: Step Timing

Track specific operations in your training loop:

from traceml.decorators import trace_timestep

@trace_timestep("dataloader", use_gpu=True)
def load_batch(dataloader):
    return next(iter(dataloader))

@trace_timestep("forward", use_gpu=True)
def forward_pass(model, batch):
    return model(batch)

@trace_timestep("backward", use_gpu=True)
def backward_pass(loss, optimizer):
    loss.backward()
    optimizer.step()

Timings automatically appear in all dashboards and logs, helping you identify bottlenecks at a glance.


Exporting Data

Enable JSON logging for offline analysis:

traceml run train.py --enable-logging

Logs are saved to ./logs/ with timestamps, ready for plotting or integration with your own monitoring tools.


Current Features

โœ… Real-time system monitoring (CPU, RAM, GPU)
โœ… Per-layer memory tracking (parameters, activations, gradients)
โœ… Per-layer compute time (forward + backward)
โœ… Training step timing (dataloader, forward, backward, optimizer)
โœ… Terminal UI with live updates
โœ… Jupyter notebook integration
โœ… Local web dashboard
โœ… JSON export for offline analysis
โœ… Minimal overhead โœ… Zero code changes (beyond registration)


Roadmap

๐Ÿ”œ Multi-GPU distributed training (DDP, FSDP)
๐Ÿ”œ PyTorch Lightning integration
๐Ÿ”œ Hugging Face Accelerate support
๐Ÿ”œ Memory leak detection
๐Ÿ”œ Automatic optimization suggestions
๐Ÿ”œ Cloud dashboard (optional)


Examples

Explore complete examples in the repository:

Contributing

We welcome contributions! Here's how to help:

  1. โญ Star the repo to show support
  2. ๐Ÿ› Report bugs via GitHub Issues
  3. ๐Ÿ’ก Request features we should prioritize
  4. ๐Ÿ”ง Submit PRs for improvements

Development setup:

git clone https://github.com/traceopt-ai/traceml.git
cd traceml
pip install -e '.[dev]'
pytest tests/

Community & Support


License

TraceML is released under the MIT License with Commons Clause.

What this means:

  • โœ… Free for personal use
  • โœ… Free for research and academic use
  • โœ… Free for internal company use
  • โŒ Not allowed for resale or SaaS products

For commercial licensing inquiries, contact abhinav@traceopt.ai.

See LICENSE for full details.


Citation

If TraceML helps your research, please cite:

@software{traceml2024,
  author = {TraceOpt AI},
  title = {TraceML: Real-time Profiling for PyTorch Training},
  year = {2024},
  url = {https://github.com/traceopt-ai/traceml}
}

TraceML โ€” Stop guessing. Start profiling.

Made with โค๏ธ by TraceOpt AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.1.8.tar.gz (60.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.1.8-py3-none-any.whl (88.9 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.1.8.tar.gz.

File metadata

  • Download URL: traceml_ai-0.1.8.tar.gz
  • Upload date:
  • Size: 60.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.1.8.tar.gz
Algorithm Hash digest
SHA256 0b2de939d51ca786236365d29a6fc92b875d1d4ec8928f1ff035aba083e1a6f1
MD5 95640619048a9f8711b4ac30da95edfb
BLAKE2b-256 bb54c0757dad2fb15b9021657ba11e84d47a3bc2c1262a20519bfcd62cd14b1b

See more details on using hashes here.

File details

Details for the file traceml_ai-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 88.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 956a84c264f2d4ee3bc5d6c86a3af080643883d146fe1a396f07df1a99bf2339
MD5 645e699f572752de62a47e35e1dfd920
BLAKE2b-256 869a4e16e3e87103390bd6a76238472285a61b8e8c51896b7e5e2622ad90b0c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page