Skip to main content

TraceML: Lightweight ML Profiler

Project description

TraceML

If you find it useful, consider giving it a ⭐ on GitHub — it helps others discover the project!

License: MIT GitHub Stars Open In Colab Python Versions macOS ARM Linux PyPI version


A lightweight, always-on profiler for PyTorch that makes memory, timing, and system usage visible in real time via:

  • Terminal dashboards
  • Jupyter notebooks
  • A lightweight local web dashboard/server
  • JSON logging for offline analysis

Minimal configuration. Minimal overhead. Plug-and-trace.


📊 Quick User Survey (2 min)

Using TraceML? Help shape the roadmap: https://forms.gle/vaDQao8L81oAoAkv9

🚨 The Problem

Training deep learning models often feels like debugging a black box:

  • CUDA OOM errors appear without warning
  • Step times are slow with no visibility
  • Existing profilers are heavy, complicated, or lack activation/gradient memory details

TraceML provides continuous, lightweight observability without slowing down training.


💡 Why TraceML?

TraceML is designed to stay lightweight, always-on, and practical:

  • Module-level memory tracking (params, activations, gradients)
  • Step timing (forward, backward, optimizer, dataloader)
  • Terminal + Notebook + Local Web Dashboard (port 8765)
  • Minimal overhead (sampling-based — NOT full graph tracing)

A tool you can safely keep on in every training loop.


⭐ Quick Start

1. Installation

pip install .

Developer mode:

pip install '.[dev]'

🔧 2. Model Registration (Required)

TraceML needs to attach hooks to your model. Two ways:

A. Decorator (recommended)

from traceml.decorators import trace_model
import torch.nn as nn

@trace_model()
class TinyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(100, 10)

    def forward(self, x):
        return self.fc(x)

B. Register a model instance

from traceml.decorators import trace_model_instance
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

trace_model_instance(model)

This is all you need to enable memory + timing tracing across all workflows.


🚀 3. Running TraceML

You can run TraceML in three modes:


A. CLI Mode (Terminal Dashboard — default)

traceml run your_script.py

This launches a live terminal dashboard showing:

  • System metrics (CPU, RAM, GPU)
  • Layer memory
  • Activation + gradient memory
  • Step timings

TraceML CLI Live


B. Dashboard Mode (Local Web UI)

Run your training script with:

traceml run your_script.py --mode=dashboard

Opens a live dashboard at:

http://localhost:8765

Includes:

  • Real-time charts
  • Per-layer memory
  • Peaks and summaries

TraceML Dashboard Live


C. Notebook Mode

from traceml.decorators import trace_model_instance
from traceml.manager.tracker_manager import TrackerManager

trace_model_instance(model)

tracker = TrackerManager(interval_sec=1.0, mode="notebook")
tracker.start()

train(model)

tracker.stop()
tracker.log_summaries()

Notebook UI updates automatically.


⏱ Step Timing Example

from traceml.decorators import trace_timestep

@trace_timestep("forward", use_gpu=True)
def forward_pass(model, batch):
    return model(**batch)

@trace_timestep("backward", use_gpu=True)
def backward_pass(loss, scaler):
    scaler.scale(loss).backward()

Timings automatically appear in CLI, dashboard, and notebook summaries.


📤 Exporting Logs as JSON

Enable JSON logging:

traceml run your_script.py --enable-logging

Logs are stored in:

./logs/

Useful for plotting, analytics, or offline dashboards.


📊 How TraceML Works (Lightweight Samplers)

TraceML uses asynchronous samplers (NOT full tracing):

  • SystemSampler — CPU, RAM, GPU
  • LayerMemorySampler — Params
  • ActivationMemorySampler — Forward activations
  • GradientMemorySampler — Backward gradients
  • StepTimeSampler — Forward/backward/optimizer timings

This keeps overhead extremely low.


📦 Current Features

  • Live system usage (CPU, RAM, GPU)
  • Per-layer memory tracking
  • Activation & gradient memory
  • Step timing
  • Terminal UI
  • Notebook display
  • Local web dashboard
  • JSON logging

🛠 Coming Soon

  • Multi-node distributed tracing
  • PyTorch Lightning / Accelerate integration

🤝 Contribute

  • ⭐ the repo to support development
  • Open issues for improvements or bugs
  • Contributions welcome

📧 Contact: abhinavsriva@gmail.com


🧾 License

TraceML uses MIT License + Commons Clause:

  • Free for personal, research, academic, and internal use
  • Not allowed for resale, SaaS, or commercial redistribution

For commercial licensing, contact abhinavsriva@gmail.com.


TraceML — Lightweight, real-time visibility for PyTorch training.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceml_ai-0.1.5.tar.gz (52.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceml_ai-0.1.5-py3-none-any.whl (75.3 kB view details)

Uploaded Python 3

File details

Details for the file traceml_ai-0.1.5.tar.gz.

File metadata

  • Download URL: traceml_ai-0.1.5.tar.gz
  • Upload date:
  • Size: 52.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.1.5.tar.gz
Algorithm Hash digest
SHA256 13a2a572f565e99958a351ffd117f811058bf8729f5ab1c6be2a9a5bc8d5da07
MD5 6e3b2cead14fec375cf07650bfbd0c2e
BLAKE2b-256 e86e1842a1ed9ccc158cc200aafdad281c033c329eb85e871564d6c071172be6

See more details on using hashes here.

File details

Details for the file traceml_ai-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: traceml_ai-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 75.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for traceml_ai-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6f04c5bd6d40e13e8f60d9bc932b34c6cad65cc83832bbd57693a0ca5bee62a8
MD5 010de6d57057f3d1b6ec44e57b1f772a
BLAKE2b-256 0fd5016603981656091807126a77a6ba2077f0f0ef6c7c931ec95ebb00a38a11

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page