Skip to main content

TraceML: Lightweight ML Profiler

Project description

TraceML

If you find it useful, consider giving it a ⭐ on GitHub — it helps others discover the project!

License: MIT GitHub Stars Open In Colab Python Versions macOS ARM Linux PyPI version


A lightweight, always-on profiler for PyTorch that makes memory, timing, and system usage visible in real time via:

  • Terminal dashboards
  • Jupyter notebooks
  • A lightweight local web dashboard/server
  • JSON logging for offline analysis

Minimal configuration. Minimal overhead. Plug-and-trace.


📊 Quick User Survey (2 min)

Using TraceML? Help shape the roadmap: https://forms.gle/vaDQao8L81oAoAkv9

🚨 The Problem

Training deep learning models often feels like debugging a black box:

  • CUDA OOM errors appear without warning
  • Step times are slow with no visibility
  • Existing profilers are heavy, complicated, or lack activation/gradient memory details

TraceML provides continuous, lightweight observability without slowing down training.


💡 Why TraceML?

TraceML is designed to stay lightweight, always-on, and practical:

  • Module-level memory tracking (params, activations, gradients)
  • Step timing (forward, backward, optimizer, dataloader)
  • Terminal + Notebook + Local Web Dashboard (port 8765)
  • Minimal overhead (sampling-based — NOT full graph tracing)

A tool you can safely keep on in every training loop.


⭐ Quick Start

1. Installation

pip install .

Developer mode:

pip install '.[dev]'

🔧 2. Model Registration (Required)

TraceML needs to attach hooks to your model. Two ways:

A. Decorator (recommended)

from traceml.decorators import trace_model
import torch.nn as nn

@trace_model()
class TinyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(100, 10)

    def forward(self, x):
        return self.fc(x)

B. Register a model instance

from traceml.decorators import trace_model_instance
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

trace_model_instance(model)

This is all you need to enable memory + timing tracing across all workflows.


🚀 3. Running TraceML

You can run TraceML in three modes:


A. CLI Mode (Terminal Dashboard — default)

traceml run your_script.py

This launches a live terminal dashboard showing:

  • System metrics (CPU, RAM, GPU)
  • Layer memory
  • Activation + gradient memory
  • Step timings

TraceML CLI Live


B. Dashboard Mode (Local Web UI)

Run your training script with:

traceml run your_script.py --mode=dashboard

Opens a live dashboard at:

http://localhost:8765

Includes:

  • Real-time charts
  • Per-layer memory
  • Peaks and summaries

TraceML Dashboard Live


C. Notebook Mode

from traceml.decorators import trace_model_instance
from traceml.manager.tracker_manager import TrackerManager

trace_model_instance(model)

tracker = TrackerManager(interval_sec=1.0, mode="notebook")
tracker.start()

train(model)

tracker.stop()
tracker.log_summaries()

Notebook UI updates automatically.


⏱ Step Timing Example

from traceml.decorators import trace_timestep

@trace_timestep("forward", use_gpu=True)
def forward_pass(model, batch):
    return model(**batch)

@trace_timestep("backward", use_gpu=True)
def backward_pass(loss, scaler):
    scaler.scale(loss).backward()

Timings automatically appear in CLI, dashboard, and notebook summaries.


📤 Exporting Logs as JSON

Enable JSON logging:

traceml run your_script.py --enable-logging

Logs are stored in:

./logs/

Useful for plotting, analytics, or offline dashboards.


📊 How TraceML Works (Lightweight Samplers)

TraceML uses asynchronous samplers (NOT full tracing):

  • SystemSampler — CPU, RAM, GPU
  • LayerMemorySampler — Params
  • ActivationMemorySampler — Forward activations
  • GradientMemorySampler — Backward gradients
  • StepTimeSampler — Forward/backward/optimizer timings

This keeps overhead extremely low.


📦 Current Features

  • Live system usage (CPU, RAM, GPU)
  • Per-layer memory tracking
  • Activation & gradient memory
  • Step timing
  • Terminal UI
  • Notebook display
  • Local web dashboard
  • JSON logging

🛠 Coming Soon

  • Multi-node distributed tracing
  • PyTorch Lightning / Accelerate integration

🤝 Contribute

  • ⭐ the repo to support development
  • Open issues for improvements or bugs
  • Contributions welcome

📧 Contact: abhinavsriva@gmail.com


🧾 License

TraceML uses MIT License + Commons Clause:

  • Free for personal, research, academic, and internal use
  • Not allowed for resale, SaaS, or commercial redistribution

For commercial licensing, contact abhinavsriva@gmail.com.


TraceML — Lightweight, real-time visibility for PyTorch training.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.1.6.tar.gz (52.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.1.6-py3-none-any.whl (75.3 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.1.6.tar.gz.

File metadata

  • Download URL: traceml_ai-0.1.6.tar.gz
  • Upload date:
  • Size: 52.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for traceml_ai-0.1.6.tar.gz
Algorithm Hash digest
SHA256 2ad556d6854666efaa9d6c16389e762dca478091c244da4f18d5b43a722267d7
MD5 890a42d5144c9999d81462872c4a4ec9
BLAKE2b-256 13a1ea8eaaacdc0d6b5ae6b90242f9bf6dc3fa6fc628f630cf471d6649e4fc43

See more details on using hashes here.

File details

Details for the file traceml_ai-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 75.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for traceml_ai-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1aff2ceb1f27e84938c3697ab9b1f788772c6eb39f77bebd6b48b0da8b05cfd3
MD5 4cd61e39e1f657a5943e1442fbaeee60
BLAKE2b-256 d109e7a3e4144839982fd9a5e2f3ef2ae3a4773e08a49c554fa297937e2b0707

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page