Skip to main content

Cross-platform GPU profiling and replay system

Project description

TraceSmith Logo

Open-source, cross-platform GPU Profiling & Replay System
Designed for AI compilers, deep learning frameworks, and GPU driver engineers

Build Status License PyPI version


Features

  • High-Performance Event Capture: Collect 10,000+ GPU instruction-level call stacks without interrupting execution
  • Lock-Free Ring Buffer: Minimal overhead event collection using SPSC (Single Producer Single Consumer) design
  • SBT Binary Trace Format: Compact, efficient binary format with string interning and delta timestamp encoding
  • Multi-Platform Support: NVIDIA CUDA (via CUPTI/nsys), AMD ROCm, Apple Metal + Instruments (xctrace), MetaX MACA (via MCPTI)
  • Multi-GPU & Multi-Stream: Full support for complex GPU topologies and async execution
  • Multi-GPU Cluster Profiling (v0.7.x): GPUTopology discovery, TimeSync (NTP/PTP/CUDA), NCCLTracker for distributed training
  • Perfetto SDK Integration: Native protobuf export (85% smaller files) + JSON fallback
  • Real-time Tracing: Thread-safe TracingSession with lock-free buffers (9K+ events/sec)
  • Kineto-Compatible Schema: PyTorch profiler compatibility with thread tracking, flexible metadata, and structured flows
  • Memory & Counter Profiling: MemoryEvent and CounterEvent for detailed resource tracking
  • LLVM XRay Support: Import compiler-instrumented function traces
  • eBPF Types (Linux): Kernel-level GPU event tracing support
  • RenderDoc-style Frame Capture: F12-trigger capture with resource state snapshots
  • GPU Memory Profiler: Allocation tracking, leak detection, peak usage monitoring
  • CLI Tools: Easy-to-use command-line interface for recording and viewing traces

Architecture

TraceSmith Architecture

Core Modules:

Module Description
Capture GPU profiling backends (CUPTI, Metal, BPF, Memory)
Common Core types, lock-free ring buffer, stack capture, XRay import
Format SBT binary trace format (read/write)
State GPU state machine, timeline builder, Perfetto exporters
Replay Trace replay engine, stream scheduler, determinism checker
Cluster Multi-GPU profiling, time sync, NCCL tracking (v0.7.x)

Supported Backends:

Platform Backend Status
NVIDIA CUPTI SDK ✅ Production
Apple Metal API ✅ Production
Apple Instruments (xctrace) ✅ Production
MetaX MCPTI SDK ✅ Production
AMD ROCm 🔜 Coming Soon
Linux eBPF ✅ Available

Output Formats:

  • .sbt - TraceSmith Binary Trace (compact, indexed)
  • .json - Perfetto JSON (chrome://tracing)
  • .perfetto - Perfetto Protobuf (85% smaller)
  • .dot - Graphviz dependency graph
  • ASCII Timeline - Terminal visualization

Prerequisites & Dependencies

Before installing TraceSmith, install the required dependencies for your platform.

Linux (Ubuntu/Debian)

# Core build tools
sudo apt update
sudo apt install -y cmake g++ make git

# Optional: libunwind for call stack capture
sudo apt install -y libunwind-dev

# Optional: Python development headers (for Python bindings)
sudo apt install -y python3-dev python3-pip

Linux with NVIDIA CUDA

# 1. Install CUDA Toolkit (includes CUPTI)
# Download from: https://developer.nvidia.com/cuda-downloads
# Or use package manager:
sudo apt install -y nvidia-cuda-toolkit

# 2. Install Nsight Systems (required for `tracesmith profile --nsys`)
# Option A: Install from CUDA repository (recommended)
sudo apt install -y cuda-nsight-systems-12-8  # Match your CUDA version

# Option B: Install standalone
sudo apt install -y nsight-systems

# 3. Verify installation
nvcc --version           # CUDA compiler
nsys --version           # Nsight Systems profiler (2024.x recommended)
ls /usr/local/cuda/include/cupti.h  # CUPTI headers

# 4. Set environment variables (add to ~/.bashrc)
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Linux with MetaX MACA

# Install MACA SDK (includes MCPTI)
# Contact MetaX for SDK access: https://www.metax-tech.com

# Set MACA_ROOT environment variable
export MACA_ROOT=/opt/maca-3.0.0

# Install mcTracer (required for `tracesmith profile --mctracer`)
# mcTracer is included in MACA SDK

macOS

# Install Xcode Command Line Tools (includes Metal framework)
xcode-select --install

# Install CMake via Homebrew
brew install cmake

# Verify Metal support
xcrun metal --version

# Note: Instruments (xctrace) is included with Xcode
# Used for `tracesmith profile --xctrace`

Windows

# Install Visual Studio 2019+ with C++ workload
# Install CMake: https://cmake.org/download/

# For CUDA support:
# Install CUDA Toolkit: https://developer.nvidia.com/cuda-downloads
# Install Nsight Systems: https://developer.nvidia.com/nsight-systems

Version Requirements

Dependency Minimum Version Recommended Notes
CMake 3.16 3.22+ Build system
GCC 8.0 11+ C++17 support
Clang 8.0 14+ C++17 support
CUDA Toolkit 11.0 12.x CUPTI included
Nsight Systems 2022.x 2024.x GPU profiling
libunwind 1.3 1.6+ Stack capture
Python 3.7 3.10+ Python bindings

Quick Start

Installation

Python (Recommended)

# Install from PyPI (auto-detects GPU platform)
pip install tracesmith

# Platform-specific installation:
# CUDA/CUPTI (NVIDIA GPU)
TRACESMITH_CUDA=1 pip install tracesmith

# ROCm (AMD GPU)
TRACESMITH_ROCM=1 pip install tracesmith

# Metal (Apple GPU)
TRACESMITH_METAL=1 pip install tracesmith

# Verify installation
python -c "import tracesmith; print(tracesmith.__version__, tracesmith.detect_platform())"

# Or install from source
git clone https://github.com/chenxingqiang/TraceSmith.git
cd TraceSmith
TRACESMITH_CUDA=1 pip install .  # with CUDA support

C++ from Source

Prerequisites:

  • CMake 3.16+
  • C++17 compatible compiler (GCC 8+, Clang 8+, MSVC 2019+)
  • Python 3.7+ (for Python bindings)
  • (Optional) NVIDIA CUDA Toolkit with CUPTI
  • (Optional) Nsight Systems for --nsys profiling
  • (Optional) Xcode Command Line Tools (for Metal on macOS)

Basic Build:

git clone https://github.com/chenxingqiang/TraceSmith.git
cd TraceSmith
mkdir build && cd build
cmake ..
cmake --build . -j$(nproc)

CMake Build Options:

Option Default Description
CMAKE_BUILD_TYPE Debug Build type: Debug, Release, RelWithDebInfo
TRACESMITH_ENABLE_CUDA OFF Enable NVIDIA CUDA/CUPTI support
TRACESMITH_ENABLE_ROCM OFF Enable AMD ROCm support
TRACESMITH_ENABLE_METAL OFF Enable Apple Metal support
TRACESMITH_ENABLE_MACA OFF Enable MetaX MACA/MCPTI support
TRACESMITH_BUILD_PYTHON OFF Build Python bindings (pybind11)
TRACESMITH_BUILD_TESTS ON Build unit tests (Google Test)
TRACESMITH_BUILD_EXAMPLES ON Build example programs
TRACESMITH_BUILD_CLI ON Build command-line interface
TRACESMITH_USE_PERFETTO_SDK OFF Use Perfetto SDK for protobuf export

Build Examples:

# Release build with Metal support (macOS)
cmake .. -DCMAKE_BUILD_TYPE=Release -DTRACESMITH_ENABLE_METAL=ON
cmake --build . -j$(nproc)

# CUDA build (Linux/Windows with NVIDIA GPU)
cmake .. -DCMAKE_BUILD_TYPE=Release -DTRACESMITH_ENABLE_CUDA=ON
cmake --build . -j$(nproc)

# Full build with all features
cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DTRACESMITH_ENABLE_METAL=ON \
         -DTRACESMITH_BUILD_PYTHON=ON \
         -DTRACESMITH_USE_PERFETTO_SDK=ON
cmake --build . -j$(nproc)

# Minimal build (library only, no tests/examples/CLI)
cmake .. -DTRACESMITH_BUILD_TESTS=OFF \
         -DTRACESMITH_BUILD_EXAMPLES=OFF \
         -DTRACESMITH_BUILD_CLI=OFF
cmake --build . -j$(nproc)

Install:

# Install to default location (/usr/local)
sudo cmake --install .

# Install to custom prefix
cmake --install . --prefix /path/to/install

# Installed files:
#   bin/tracesmith          - CLI executable
#   include/tracesmith/     - Header files
#   lib/libtracesmith-*.a   - Static libraries

Run Tests:

# Run all tests
ctest --output-on-failure

# Run specific test
./bin/tracesmith_tests --gtest_filter="RingBuffer*"

Docker

docker build -t tracesmith .
docker run -it tracesmith

Usage

Python API (Recommended)

import tracesmith as ts

# Create profiler for your GPU platform
profiler = ts.create_profiler(ts.PlatformType.CUDA)  # or ROCm, Metal

# Configure and capture
config = ts.ProfilerConfig()
config.capture_kernels = True
config.capture_memcpy = True
profiler.initialize(config)

profiler.start_capture()
# ... your GPU code here (CUDA kernels, etc.) ...
profiler.stop_capture()

# Get captured events
events = profiler.get_events()
print(f"Captured {len(events)} events")

# Build timeline and analyze
timeline = ts.build_timeline(events)
print(f"GPU Utilization: {timeline.gpu_utilization * 100:.1f}%")
print(f"Max Concurrent Ops: {timeline.max_concurrent_ops}")

# Export to Perfetto (chrome://tracing or ui.perfetto.dev)
ts.export_perfetto(events, "trace.json")

# Save to TraceSmith binary format
writer = ts.SBTWriter("trace.sbt")
writer.write_events(events)
writer.finalize()

Real-time Tracing (v0.3.0+)

import tracesmith as ts

# Create tracing session with custom config
config = ts.TracingConfig()
config.buffer_size_kb = 8192  # 8MB buffer
config.enable_counter_tracks = True

session = ts.TracingSession()
session.start(config)

# Emit events from your application (thread-safe!)
event = ts.TraceEvent()
event.type = ts.EventType.KernelLaunch
event.name = "my_kernel"
event.thread_id = 12345
event.metadata["grid_dim"] = "256x256x1"
session.emit(event)

# Emit counter metrics
session.emit_counter("GPU Memory (MB)", 1024.5)
session.emit_counter("SM Occupancy %", 85.2)

# Stop and export
session.stop()
session.export_to_file("realtime_trace.perfetto-trace")

# Get statistics
stats = session.get_statistics()
print(f"Duration: {stats.duration_ms():.1f}ms")
print(f"Events: {stats.events_emitted} emitted, {stats.events_dropped} dropped")

Command Line Interface

TraceSmith provides a comprehensive CLI with ASCII banner and colored output:

████████╗██████╗  █████╗  ██████╗███████╗███████╗███╗   ███╗██╗████████╗██╗  ██╗
╚══██╔══╝██╔══██╗██╔══██╗██╔════╝██╔════╝██╔════╝████╗ ████║██║╚══██╔══╝██║  ██║
   ██║   ██████╔╝███████║██║     █████╗  ███████╗██╔████╔██║██║   ██║   ███████║
   ██║   ██╔══██╗██╔══██║██║     ██╔══╝  ╚════██║██║╚██╔╝██║██║   ██║   ██╔══██║
   ██║   ██║  ██║██║  ██║╚██████╗███████╗███████║██║ ╚═╝ ██║██║   ██║   ██║  ██║
   ╚═╝   ╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝╚══════╝╚══════╝╚═╝     ╚═╝╚═╝   ╚═╝   ╚═╝  ╚═╝

                    GPU Profiling & Replay System v0.8.3

Available Commands:

Command Description
profile Profile a command (record + execute in one step)
record Record GPU events to a trace file
view View contents of a trace file
info Show detailed information about a trace file
export Export trace to Perfetto or other formats
analyze Analyze trace for performance insights
replay Replay a captured trace
benchmark Run 10K GPU call stacks benchmark
devices List available GPU devices (CUDA, Metal, MACA, ROCm)
version Show version information
help Show help message

IMPORTANT: GPU Profiling API Limitations

CUPTI/MCPTI can only profile the calling process, not child processes. For CUDA/MACA platforms, you MUST use system-level profilers:

Platform Required Option Tool Min Version
NVIDIA CUDA --nsys Nsight Systems 2022.x (2024.x recommended)
MetaX MACA --mctracer mcTracer MACA SDK 3.0+
Apple Metal --xctrace Instruments Xcode 14+

The record command is not supported for CUDA/MACA platforms. Use profile --nsys or profile --mctracer instead.

⚠️ Important Notes:

  1. Nsight Systems Version: Use version 2024.x or later. Older versions (2021.x) may have compatibility issues with newer CUDA drivers.

    # Check your nsys version
    nsys --version
    
    # Install latest version (Ubuntu with CUDA 12.x)
    sudo apt install cuda-nsight-systems-12-8
    
  2. CUPTI Conflict: When using --nsys, your program should not use CUPTI directly (nsys subscribes to CUPTI). If your program uses CUPTI, run it without --nsys:

    # For programs using CUPTI internally
    ./bin/cupti_example  # Run directly, not with --nsys
    
  3. Root/Admin Not Required: Modern nsys (2024.x) does not require root privileges for basic profiling.

C++ CLI Examples:

# NVIDIA CUDA - Use --nsys (REQUIRED for GPU profiling)
./bin/tracesmith profile --nsys -- python train.py
./bin/tracesmith profile --nsys --perfetto -- ./my_cuda_app
./bin/tracesmith profile --nsys -o model.sbt -- python train.py --epochs 10

# MetaX MACA - Use --mctracer (REQUIRED for GPU profiling)
./bin/tracesmith profile --mctracer -- ./my_maca_app
./bin/tracesmith profile --mctracer --perfetto -- python train.py

# Apple Metal - Use --xctrace for real Metal GPU events
./bin/tracesmith profile --xctrace -- python train.py
./bin/tracesmith profile --xctrace --keep-trace -- python mps_benchmark.py
./bin/tracesmith profile --xctrace --xctrace-template "GPU Driver" -- ./app

# View trace with statistics
./bin/tracesmith view trace.sbt --stats

# Show trace file info
./bin/tracesmith info trace.sbt

# Export to Perfetto (view at ui.perfetto.dev)
./bin/tracesmith export trace.sbt -f perfetto

# Analyze performance
./bin/tracesmith analyze trace.sbt

# Replay trace (dry-run)
./bin/tracesmith replay trace.sbt --mode dry-run

# List available GPUs
./bin/tracesmith devices

# Disable colored output
./bin/tracesmith --no-color help

Python CLI Examples:

# NVIDIA CUDA - Use --nsys (REQUIRED for GPU profiling)
tracesmith-cli profile --nsys -- python train.py
tracesmith-cli profile --nsys --perfetto -- ./my_cuda_app
tracesmith-cli profile --nsys -o model.sbt -- python train.py --epochs 10

# MetaX MACA - Use --mctracer (REQUIRED for GPU profiling)
tracesmith-cli profile --mctracer -- ./my_maca_app
tracesmith-cli profile --mctracer --perfetto -- python train.py

# Apple Metal - Use --xctrace for real Metal GPU events
tracesmith-cli profile --xctrace -- python train.py
tracesmith-cli profile --xctrace --keep-trace -- python mps_benchmark.py

# Other commands
tracesmith-cli info
tracesmith-cli devices

# View trace contents
tracesmith-cli view trace.sbt --stats

# Export to Perfetto
tracesmith-cli export trace.sbt -o trace.json

# Analyze trace
tracesmith-cli analyze trace.sbt

# Replay trace
tracesmith-cli replay trace.sbt --mode dry-run

NVIDIA Nsight Systems (nsys) Integration

TraceSmith integrates with NVIDIA Nsight Systems for system-wide GPU profiling, providing comprehensive CUDA kernel and memory operation tracing.

Usage:

# Profile with nsys (system-wide profiling)
./bin/tracesmith profile --nsys -- python train.py
./bin/tracesmith profile --nsys -o trace.sbt -- ./my_cuda_app

# With custom nsys options
./bin/tracesmith profile --nsys --nsys-args="-t cuda,nvtx" -- python benchmark.py

Features:

  • System-wide CUDA kernel tracing
  • Memory transfer profiling (H2D, D2H, D2D)
  • NVTX annotation support
  • Multi-GPU profiling
  • Automatic .nsys-rep to TraceSmith format conversion

macOS Metal GPU Profiling with xctrace

On macOS, TraceSmith integrates with Apple Instruments (xctrace) for capturing real Metal GPU events. This provides accurate GPU timing and event capture that the Metal Frame Capture API cannot achieve programmatically.

Why use xctrace?

  • Captures real Metal GPU execution events (kernel launches, command buffer submissions)
  • Accurate GPU timing from hardware counters
  • Works with any Metal application (PyTorch MPS, TensorFlow Metal, custom Metal apps)

Usage:

# Python CLI (recommended - includes event parsing)
tracesmith-cli profile --xctrace -- python train.py
tracesmith-cli profile --xctrace --keep-trace -o model.sbt -- python inference.py
tracesmith-cli profile --xctrace --perfetto -- python benchmark.py

# C++ CLI (calls xctrace, outputs raw .trace file)
./bin/tracesmith profile --xctrace -- python train.py
./bin/tracesmith profile --xctrace --xctrace-template "GPU Driver" -- ./app

# Python API
from tracesmith.xctrace import XCTraceProfiler, profile_with_xctrace

# Simple usage
events, trace_file = profile_with_xctrace(
    ["python", "train.py"],
    duration=60,
    template="Metal System Trace"
)

# Full control
profiler = XCTraceProfiler()
events = profiler.profile_command(["python", "train.py"])
profiler.export_perfetto("metal_trace.json")

Available Templates:

  • Metal System Trace - Most detailed Metal profiling (default)
  • GPU Driver - Driver-level analysis
  • Game Performance - Frame rate and GPU time
  • Animation Hitches - Animation performance

Output:

MetaX GPU Profiling with MCPTI

TraceSmith supports MetaX GPUs (C500, C550, etc.) using the MCPTI (MACA Profiling Tools Interface), which provides an API compatible with NVIDIA CUPTI.

📖 Full documentation: See docs/MACA_PROFILING.md for detailed setup and usage guide.

Tested Hardware:

GPU Memory Compute Units Driver Status
MetaX C500 64 GB 104 CUs 3.0.11 ✅ Verified
MetaX C550 - - - 🔜 Planned

Benchmark Results (MetaX C500, MACA 3.0.0):

Test Data Size Bandwidth
Host → Device 256 MB 10.1 GB/s
Device → Host 256 MB 9.9 GB/s
Device → Device 256 MB 608 GB/s
MCPTI Overhead - < 2% (negligible)

CLI Device Detection:

$ ./bin/tracesmith devices

MetaX MACA:
✓ MACA available
  Devices: 1
  Driver:  3000

  Device 0: MetaX C500
    Vendor:     MetaX
    Compute:    10.0
    Memory:     63.62 GB
    SMs:        104
    Clock:      1600 MHz

System-Wide Profiling with mcTracer:

TraceSmith integrates with MetaX's mcTracer tool for comprehensive system-wide GPU profiling (similar to NVIDIA nsys):

# Profile with mcTracer
./bin/tracesmith profile --mctracer -- ./my_maca_app
./bin/tracesmith profile --mctracer --perfetto -- python train.py

# Output is Perfetto-compatible JSON
# View at: https://ui.perfetto.dev

mcTracer captures:

  • All MACA API calls (mcInit, mcMalloc, mcMemcpy, etc.)
  • GPU memory operations with bandwidth
  • Stream operations and synchronization
  • CPU-GPU launch flow arrows

Build with MetaX support:

# On MetaX system (MACA SDK auto-detected at /opt/maca-3.0.0)
cmake -DTRACESMITH_ENABLE_MACA=ON ..
make -j4

# Run examples
./bin/metax_example      # Basic profiling demo
./bin/metax_benchmark    # Memory bandwidth test

C++ API:

#include <tracesmith/tracesmith.hpp>

// Check MetaX GPU availability
if (tracesmith::isMACAAvailable()) {
    std::cout << "MetaX devices: " << tracesmith::getMACADeviceCount() << std::endl;
}

// Create MCPTI profiler
auto profiler = tracesmith::createProfiler(tracesmith::PlatformType::MACA);

// Configure
tracesmith::ProfilerConfig config;
config.capture_kernels = true;
config.capture_memcpy = true;
profiler->initialize(config);

// Capture events
profiler->startCapture();
// ... GPU code using MACA runtime ...
profiler->stopCapture();

// Get events
std::vector<tracesmith::TraceEvent> events;
profiler->getEvents(events);

// Export to Perfetto
tracesmith::PerfettoExporter exporter;
exporter.exportToFile(events, "metax_trace.json");

Python API:

import tracesmith as ts

# Check MetaX availability
if ts.is_maca_available():
    print(f"MetaX devices: {ts.get_maca_device_count()}")
    
    # Create profiler
    profiler = ts.create_profiler(ts.PlatformType.MACA)
    profiler.initialize(ts.ProfilerConfig())
    
    profiler.start_capture()
    # ... GPU code ...
    profiler.stop_capture()
    
    events = profiler.get_events()
    
    # Save trace
    writer = ts.SBTWriter("metax_trace.sbt")
    writer.write_events(events)
    writer.finalize()

MCPTI Captured Events:

Event Type Description
KernelLaunch/Complete Kernel execution timing with grid/block dimensions
MemcpyH2D/D2H/D2D Memory transfers with bandwidth calculation
MemsetDevice Memory initialization operations
StreamSync/DeviceSync Synchronization events with duration

Output:

  • .sbt - TraceSmith Binary Trace format
  • .json - Perfetto JSON (view at https://ui.perfetto.dev)
  • Device info: name, memory, compute capability, clock rates

Python Examples with Cross-Platform Device Support

All Python examples support multiple GPU platforms with automatic device detection:

# Run examples on specific device
python examples/basic_usage.py --device cuda    # NVIDIA GPU
python examples/basic_usage.py --device mps     # Apple Silicon
python examples/basic_usage.py --device rocm    # AMD GPU
python examples/basic_usage.py --device cpu     # CPU fallback

# Run all examples with test runner
python examples/run_tests.py                    # Best available device
python examples/run_tests.py --all-devices      # Test on all devices
python examples/run_tests.py --test pytorch     # Run specific test
python examples/run_tests.py --list             # List available tests

Using DeviceManager for cross-platform code:

from examples.device_utils import DeviceManager, benchmark

# Auto-detect best device
dm = DeviceManager()  # or DeviceManager(prefer_device="mps")
print(f"Using: {dm.get_device_name()}")  # Apple Silicon GPU (mps:0, 25.2 GB)

# Create tensors on device
x = dm.randn(1000, 1000)
y = dm.randn(1000, 1000)

# Benchmark with proper synchronization
results = benchmark(lambda: x @ y, warmup=3, iterations=10, dm=dm)
print(f"Mean: {results['mean_ms']:.2f} ms")

# Device-agnostic operations
dm.synchronize()
print(f"Memory: {dm.memory_allocated() / 1024**2:.1f} MB")

C++ API

#include <tracesmith/tracesmith.hpp>

using namespace tracesmith;

int main() {
    // Create profiler
    auto profiler = createProfiler(PlatformType::CUDA);
    
    // Configure
    ProfilerConfig config;
    config.buffer_size = 1000000;
    profiler->initialize(config);
    
    // Start capture
    profiler->startCapture();
    
    // ... run GPU code ...
    
    // Stop capture
    profiler->stopCapture();
    
    // Get events
    std::vector<TraceEvent> events;
    profiler->getEvents(events);
    
    // Write to file
    SBTWriter writer("trace.sbt");
    writer.writeEvents(events);
    writer.finalize();
    
    return 0;
}

Timeline Analysis (Phase 3)

#include <tracesmith/tracesmith.hpp>
#include <tracesmith/state/timeline_builder.hpp>
#include <tracesmith/state/timeline_viewer.hpp>
#include <tracesmith/state/perfetto_exporter.hpp>

using namespace tracesmith;

int main() {
    // Capture events (see above)
    std::vector<TraceEvent> events = captureEvents();
    
    // Build timeline
    TimelineBuilder builder;
    builder.addEvents(events);
    Timeline timeline = builder.build();
    
    // Print ASCII visualization
    TimelineViewer viewer;
    std::cout << viewer.render(timeline);
    
    // Export to Perfetto with enhanced GPU tracks
    PerfettoExporter exporter;
    exporter.setEnableGPUTracks(true);       // GPU-specific tracks
    exporter.setEnableFlowEvents(true);      // Dependency visualization
    exporter.exportToFile(events, "trace.json");
    // Open https://ui.perfetto.dev and load trace.json
    
    // Get statistics
    std::cout << "GPU Utilization: " << timeline.gpu_utilization << std::endl;
    std::cout << "Max Concurrent Ops: " << timeline.max_concurrent_ops << std::endl;
    
    return 0;
}

SBT File Format

TraceSmith uses a custom binary format (SBT - TraceSmith Binary Trace) optimized for:

  • Compactness: Variable-length integer encoding, string interning
  • Streaming: Support for streaming writes during capture
  • Fast Access: Indexed sections for random access

File structure:

┌──────────────────┐
│ Header (64 bytes)│ Magic, version, offsets
├──────────────────┤
│ Metadata Section │ Application info, timestamps
├──────────────────┤
│ Device Info      │ GPU device details
├──────────────────┤
│ Events Section   │ Trace events (variable length)
├──────────────────┤
│ String Table     │ Deduplicated strings
├──────────────────┤
│ EOF Marker       │
└──────────────────┘

Development Roadmap

Phase 1: MVP ✅

  • Project structure and build system
  • Core data structures (TraceEvent, DeviceInfo)
  • SBT binary trace format
  • Lock-free ring buffer
  • Platform abstraction interface
  • CLI tools (record, view, info)

Phase 2: Instruction-Level Call Stack ✅

  • Cross-platform stack capture (macOS/Linux/Windows)
  • Symbol resolution with demangling
  • GPU kernel call chain capture
  • Instruction stream builder
  • Dependency analysis

Phase 3: GPU State Machine & Timeline Builder ✅

  • GPU state machine with stream tracking
  • Timeline builder with span generation
  • Perfetto export (chrome://tracing format)
  • ASCII timeline visualization
  • Concurrent operation analysis

Phase 4: Replay Engine ✅

  • Replay engine with full orchestration
  • Stream scheduler with dependency tracking
  • Determinism checker with validation
  • Partial replay (time/operation ranges)
  • Dry-run mode for analysis

Phase 5: Production Release ✅

  • Python bindings (pybind11)
  • pip-installable package
  • Comprehensive documentation
  • Docker support
  • Example programs
  • TraceSmith Studio GUI (future)
  • Homebrew formula (future)

Phase 6: Advanced Integrations ✅ (v0.4.0)

  • Perfetto SDK Integration (85% smaller traces)
  • Real-time TracingSession with lock-free buffers
  • Kineto-compatible schema (thread_id, metadata, FlowInfo)
  • Memory profiling (MemoryEvent, MemoryCategory)
  • Counter tracks (CounterEvent)
  • LLVM XRay import support
  • eBPF types for Linux kernel tracing

Contributing

Contributions are welcome! Please read our Contributing Guide before submitting PRs.

License

TraceSmith is licensed under the Apache License 2.0. See LICENSE for details.

Benchmark Results

Core Feature: 10,000+ GPU Instruction-Level Call Stacks

Tested on NVIDIA GeForce RTX 4090 D (24GB, CUDA 12.8, Driver 570.124.06)

╔═══════════════════════════════════════════════════════════════════════╗
║  Non-intrusive capture of 10,000+ instruction-level GPU call stacks   ║
║  ✅ VERIFIED!                                                         ║
╚═══════════════════════════════════════════════════════════════════════╝
Metric Result Note
CUDA Kernels Launched 10,000 Real __global__ kernels
GPU Events (CUPTI) 20,011 Instruction-level events
Kernel Launches 10,000 Each kernel captured
Kernel Completes 10,000 Full lifecycle
Host Call Stacks 10,000 7 frames/stack avg
Events with Stacks 19,989 (99.9%) GPU + Host merged
Total Time 107 ms Non-intrusive
Throughput 93,457 kernels/sec High performance

Verified Capabilities:

  • ✅ Real CUDA kernels executed on GPU )
  • ✅ CUPTI captured instruction-level GPU events
  • ✅ Host call stacks attached to GPU events
  • ✅ Non-intrusive profiling

How to Run the Benchmark

# On NVIDIA GPU server with CUDA
git clone https://github.com/chenxingqiang/TraceSmith.git
cd TraceSmith
mkdir build && cd build

# Build with CUDA support
cmake .. -DTRACESMITH_ENABLE_CUDA=ON -DTRACESMITH_BUILD_EXAMPLES=ON
make benchmark_10k_stacks -j8

# Run the benchmark
./bin/benchmark_10k_stacks

CUPTI Real GPU Profiling Results

Kernel Duration (ns) Duration (µs) Duration (ms)
vectorAdd (1M elements) 5,313 5.31 0.0053
matrixMul (512×512) 66,912 66.91 0.0669
relu (1M elements) 4,704 4.70 0.0047
TOTAL 76,929 76.93 0.0769

Real GPU Memory Profiling Results

Phase Operation Memory
Parameters 5× cudaMalloc 31 MB
Activations 8× cudaMalloc 72 MB
Gradients 5× cudaMalloc 31 MB
Workspace 3× cudaMalloc 96 MB
Total Allocated 21 operations 230 MB
Total Freed 16 cudaFree 199 MB
Test Duration - 5 ms

Performance Characteristics

Feature Performance
GPU Event Capture 93K+ kernels/sec
Ring Buffer Throughput 10K+ events/sec
Event Collection Overhead < 1%
SBT File Compression ~3x vs JSON
Perfetto Protobuf 85% smaller than JSON
Stack Capture (no symbols) ~5 µs/stack
Stack Capture (with symbols) ~13 µs/stack

Test Categories

✅ RingBuffer Tests      (9/9)   - Lock-free SPSC buffer
✅ SBT Format Tests      (7/7)   - Binary trace format
✅ Types Tests           (12/12) - Core data structures
✅ Kineto Schema Tests   (7/7)   - PyTorch compatibility
✅ Kineto V2 Tests       (6/6)   - Memory & Counter events
✅ TracingSession Tests  (10/10) - Real-time tracing
✅ XRay Importer Tests   (5/5)   - LLVM XRay support
✅ BPF Types Tests       (6/6)   - eBPF integration
✅ FrameCapture Tests    (12/12) - RenderDoc-style capture
✅ MemoryProfiler Tests  (12/12) - GPU memory tracking
✅ CUPTI Profiler        (14/14) - Real GPU profiling

PyPI Package

PyPI version

# Basic installation
pip install tracesmith==0.8.3

# With CuPy for real GPU profiling in Python CLI (choose one):
pip install tracesmith[cuda12]    # CUDA 12.x
pip install tracesmith[cuda11]    # CUDA 11.x
pip install tracesmith[cuda118]   # CUDA 11.8 specific
pip install tracesmith[cuda120]   # CUDA 12.0 specific

# With visualization tools
pip install tracesmith[visualization]

# With PyTorch integration
pip install tracesmith[torch]

# All optional dependencies
pip install tracesmith[all]

Python CLI Real GPU Benchmark

With CuPy installed, you can run real GPU profiling from Python:

# Install CuPy first
pip install tracesmith[cuda12]

# Run real GPU benchmark
tracesmith-cli benchmark --real-gpu -n 10000

Tested on NVIDIA GPU Server (RTX 4090):

Feature Status
Core Types (69 exports)
CUPTIProfiler
MemoryProfiler
Frame Capture
Stack Capture
BPF Tracing ✅ (Linux)
CLI Tools

Testing Methodology

Feature Validation

TraceSmith provides a comprehensive validation example that tests all features from PLANNING.md:

# Build and run feature validation
cd build
cmake .. -DTRACESMITH_ENABLE_CUDA=ON -DTRACESMITH_BUILD_EXAMPLES=ON
make goal_validation_example
./bin/goal_validation_example

Benchmark Testing

The benchmark_10k_stacks uses real CUDA kernels and CUPTI profiling:

// Real CUDA kernel executed on GPU
__global__ void benchmark_kernel(float* data, int n, int kernel_id) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = data[idx] * 2.0f + static_cast<float>(kernel_id);
    }
}

// Launches 10,000 real kernels with CUPTI profiling
for (int i = 0; i < 10000; ++i) {
    benchmark_kernel<<<blocks, threads>>>(d_data, n, i);
}

Platform-Specific Testing

Platform Profiler Test Command
NVIDIA CUDA CUPTIProfiler ./bin/cupti_example
Apple Metal MetalProfiler ./bin/metal_example
MetaX MACA MCPTIProfiler ./bin/metax_example
CPU Fallback StackCapture ./bin/stack_capture_example

Version History

Version Date Highlights
v0.8.2 2025-12 CLI Breaking Change - Enforce --nsys/--mctracer for CUDA/MACA, record command blocked, clearer API limitation messages
v0.8.1 2025-12 nsys & MACA Enhancement - NVIDIA Nsight Systems integration, MetaX CLI device detection, MACA cluster module support
v0.8.1 2025-12 mcTracer Integration - MetaX system-wide profiling, Enhanced MACA CLI, Cluster module support
v0.8.0 2025-12 xctrace Integration - Apple Instruments, Cross-Platform Device Utils, Enhanced Examples
v0.7.1 2025-12 Multi-GPU Phase 2 - TimeSync, NCCLTracker, ClockCorrelator, CommAnalysis
v0.7.0 2025-12 Multi-GPU Cluster - GPUTopology, MultiGPUProfiler, GitHub Actions CI/CD
v0.6.9 2025-12 Include reorganization - Directory structure matches src/ layout
v0.6.8 2025-12 Enhanced CLI - ASCII banner, all commands, Python CLI
v0.6.7 2025-12 Real GPU benchmark - 10K+ CUDA kernels with CUPTI
v0.6.5 2025-12 StackCapture bindings, OverflowPolicy, detect_leaks
v0.6.2 2025-12 PyPI release, Native extension packaging fix
v0.6.0 2025-12 NVIDIA CUPTI integration, Full GPU testing
v0.5.0 2025-12 RenderDoc-style frame capture, Resource tracking
v0.4.0 2025-12 LLVM XRay, eBPF types, TracingSession, Counter tracks
v0.3.0 2025-12 Real-time tracing, Counter events, Memory events
v0.2.0 2025-12 Perfetto SDK (85% smaller traces), Kineto schema
v0.1.1 2025-11 libunwind, Enhanced Perfetto export, Flow events
v0.1.0 2025-11 Initial release: SBT format, Ring buffer, Replay

Acknowledgments

TraceSmith draws inspiration from:

Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracesmith-0.8.3.tar.gz (317.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracesmith-0.8.3-cp313-cp313-macosx_15_0_arm64.whl (529.0 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

File details

Details for the file tracesmith-0.8.3.tar.gz.

File metadata

  • Download URL: tracesmith-0.8.3.tar.gz
  • Upload date:
  • Size: 317.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tracesmith-0.8.3.tar.gz
Algorithm Hash digest
SHA256 11fd591d857e2c1e540ef0240572d5e6d25512ba63462e14ccca3f71401818c2
MD5 2e8611188e0cde395228ba3b2115291e
BLAKE2b-256 69bec8868df8bca7f71e216e3fb5e0763b848e1510c614707b970526a6d1888f

See more details on using hashes here.

File details

Details for the file tracesmith-0.8.3-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for tracesmith-0.8.3-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 811ebcadd390672327a3724f6d5d0137bfec198d2e8fddecad5cdf590e82bd56
MD5 4f2077a2daf5b43746c4c7d14ad9d213
BLAKE2b-256 1e139be04b4e1e9d9694c389a0c101433a501e322c4221757195a00dae02ce9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page