Cross-platform GPU profiling and replay system
Project description
Open-source, cross-platform GPU Profiling & Replay System
Designed for AI compilers, deep learning frameworks, and GPU driver engineers
Features
- High-Performance Event Capture: Collect 10,000+ GPU instruction-level call stacks without interrupting execution
- Lock-Free Ring Buffer: Minimal overhead event collection using SPSC (Single Producer Single Consumer) design
- SBT Binary Trace Format: Compact, efficient binary format with string interning and delta timestamp encoding
- Multi-Platform Support: NVIDIA CUDA (via CUPTI/nsys), AMD ROCm, Apple Metal + Instruments (xctrace), MetaX MACA (via MCPTI)
- Multi-GPU & Multi-Stream: Full support for complex GPU topologies and async execution
- Multi-GPU Cluster Profiling (v0.7.x): GPUTopology discovery, TimeSync (NTP/PTP/CUDA), NCCLTracker for distributed training
- Perfetto SDK Integration: Native protobuf export (85% smaller files) + JSON fallback
- Real-time Tracing: Thread-safe
TracingSessionwith lock-free buffers (9K+ events/sec) - Kineto-Compatible Schema: PyTorch profiler compatibility with thread tracking, flexible metadata, and structured flows
- Memory & Counter Profiling:
MemoryEventandCounterEventfor detailed resource tracking - LLVM XRay Support: Import compiler-instrumented function traces
- eBPF Types (Linux): Kernel-level GPU event tracing support
- RenderDoc-style Frame Capture: F12-trigger capture with resource state snapshots
- GPU Memory Profiler: Allocation tracking, leak detection, peak usage monitoring
- CLI Tools: Easy-to-use command-line interface for recording and viewing traces
Architecture
Core Modules:
| Module | Description |
|---|---|
| Capture | GPU profiling backends (CUPTI, Metal, BPF, Memory) |
| Common | Core types, lock-free ring buffer, stack capture, XRay import |
| Format | SBT binary trace format (read/write) |
| State | GPU state machine, timeline builder, Perfetto exporters |
| Replay | Trace replay engine, stream scheduler, determinism checker |
| Cluster | Multi-GPU profiling, time sync, NCCL tracking (v0.7.x) |
Supported Backends:
| Platform | Backend | Status |
|---|---|---|
| NVIDIA | CUPTI SDK | ✅ Production |
| Apple | Metal API | ✅ Production |
| Apple | Instruments (xctrace) | ✅ Production |
| MetaX | MCPTI SDK | ✅ Production |
| AMD | ROCm | 🔜 Coming Soon |
| Linux | eBPF | ✅ Available |
Output Formats:
.sbt- TraceSmith Binary Trace (compact, indexed).json- Perfetto JSON (chrome://tracing).perfetto- Perfetto Protobuf (85% smaller).dot- Graphviz dependency graph- ASCII Timeline - Terminal visualization
Prerequisites & Dependencies
Before installing TraceSmith, install the required dependencies for your platform.
Linux (Ubuntu/Debian)
# Core build tools
sudo apt update
sudo apt install -y cmake g++ make git
# Optional: libunwind for call stack capture
sudo apt install -y libunwind-dev
# Optional: Python development headers (for Python bindings)
sudo apt install -y python3-dev python3-pip
Linux with NVIDIA CUDA
# 1. Install CUDA Toolkit (includes CUPTI)
# Download from: https://developer.nvidia.com/cuda-downloads
# Or use package manager:
sudo apt install -y nvidia-cuda-toolkit
# 2. Install Nsight Systems (required for `tracesmith profile --nsys`)
# Option A: Install from CUDA repository (recommended)
sudo apt install -y cuda-nsight-systems-12-8 # Match your CUDA version
# Option B: Install standalone
sudo apt install -y nsight-systems
# 3. Verify installation
nvcc --version # CUDA compiler
nsys --version # Nsight Systems profiler (2024.x recommended)
ls /usr/local/cuda/include/cupti.h # CUPTI headers
# 4. Set environment variables (add to ~/.bashrc)
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Linux with MetaX MACA
# Install MACA SDK (includes MCPTI)
# Contact MetaX for SDK access: https://www.metax-tech.com
# Set MACA_ROOT environment variable
export MACA_ROOT=/opt/maca-3.0.0
# Install mcTracer (required for `tracesmith profile --mctracer`)
# mcTracer is included in MACA SDK
macOS
# Install Xcode Command Line Tools (includes Metal framework)
xcode-select --install
# Install CMake via Homebrew
brew install cmake
# Verify Metal support
xcrun metal --version
# Note: Instruments (xctrace) is included with Xcode
# Used for `tracesmith profile --xctrace`
Windows
# Install Visual Studio 2019+ with C++ workload
# Install CMake: https://cmake.org/download/
# For CUDA support:
# Install CUDA Toolkit: https://developer.nvidia.com/cuda-downloads
# Install Nsight Systems: https://developer.nvidia.com/nsight-systems
Version Requirements
| Dependency | Minimum Version | Recommended | Notes |
|---|---|---|---|
| CMake | 3.16 | 3.22+ | Build system |
| GCC | 8.0 | 11+ | C++17 support |
| Clang | 8.0 | 14+ | C++17 support |
| CUDA Toolkit | 11.0 | 12.x | CUPTI included |
| Nsight Systems | 2022.x | 2024.x | GPU profiling |
| libunwind | 1.3 | 1.6+ | Stack capture |
| Python | 3.7 | 3.10+ | Python bindings |
Quick Start
Installation
Python (Recommended)
# Install from PyPI (auto-detects GPU platform)
pip install tracesmith
# Platform-specific installation:
# CUDA/CUPTI (NVIDIA GPU)
TRACESMITH_CUDA=1 pip install tracesmith
# ROCm (AMD GPU)
TRACESMITH_ROCM=1 pip install tracesmith
# Metal (Apple GPU)
TRACESMITH_METAL=1 pip install tracesmith
# Verify installation
python -c "import tracesmith; print(tracesmith.__version__, tracesmith.detect_platform())"
# Or install from source
git clone https://github.com/chenxingqiang/TraceSmith.git
cd TraceSmith
TRACESMITH_CUDA=1 pip install . # with CUDA support
C++ from Source
Prerequisites:
- CMake 3.16+
- C++17 compatible compiler (GCC 8+, Clang 8+, MSVC 2019+)
- Python 3.7+ (for Python bindings)
- (Optional) NVIDIA CUDA Toolkit with CUPTI
- (Optional) Nsight Systems for
--nsysprofiling - (Optional) Xcode Command Line Tools (for Metal on macOS)
Basic Build:
git clone https://github.com/chenxingqiang/TraceSmith.git
cd TraceSmith
mkdir build && cd build
cmake ..
cmake --build . -j$(nproc)
CMake Build Options:
| Option | Default | Description |
|---|---|---|
CMAKE_BUILD_TYPE |
Debug | Build type: Debug, Release, RelWithDebInfo |
TRACESMITH_ENABLE_CUDA |
OFF | Enable NVIDIA CUDA/CUPTI support |
TRACESMITH_ENABLE_ROCM |
OFF | Enable AMD ROCm support |
TRACESMITH_ENABLE_METAL |
OFF | Enable Apple Metal support |
TRACESMITH_ENABLE_MACA |
OFF | Enable MetaX MACA/MCPTI support |
TRACESMITH_BUILD_PYTHON |
OFF | Build Python bindings (pybind11) |
TRACESMITH_BUILD_TESTS |
ON | Build unit tests (Google Test) |
TRACESMITH_BUILD_EXAMPLES |
ON | Build example programs |
TRACESMITH_BUILD_CLI |
ON | Build command-line interface |
TRACESMITH_USE_PERFETTO_SDK |
OFF | Use Perfetto SDK for protobuf export |
Build Examples:
# Release build with Metal support (macOS)
cmake .. -DCMAKE_BUILD_TYPE=Release -DTRACESMITH_ENABLE_METAL=ON
cmake --build . -j$(nproc)
# CUDA build (Linux/Windows with NVIDIA GPU)
cmake .. -DCMAKE_BUILD_TYPE=Release -DTRACESMITH_ENABLE_CUDA=ON
cmake --build . -j$(nproc)
# Full build with all features
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DTRACESMITH_ENABLE_METAL=ON \
-DTRACESMITH_BUILD_PYTHON=ON \
-DTRACESMITH_USE_PERFETTO_SDK=ON
cmake --build . -j$(nproc)
# Minimal build (library only, no tests/examples/CLI)
cmake .. -DTRACESMITH_BUILD_TESTS=OFF \
-DTRACESMITH_BUILD_EXAMPLES=OFF \
-DTRACESMITH_BUILD_CLI=OFF
cmake --build . -j$(nproc)
Install:
# Install to default location (/usr/local)
sudo cmake --install .
# Install to custom prefix
cmake --install . --prefix /path/to/install
# Installed files:
# bin/tracesmith - CLI executable
# include/tracesmith/ - Header files
# lib/libtracesmith-*.a - Static libraries
Run Tests:
# Run all tests
ctest --output-on-failure
# Run specific test
./bin/tracesmith_tests --gtest_filter="RingBuffer*"
Docker
docker build -t tracesmith .
docker run -it tracesmith
Usage
Python API (Recommended)
import tracesmith as ts
# Create profiler for your GPU platform
profiler = ts.create_profiler(ts.PlatformType.CUDA) # or ROCm, Metal
# Configure and capture
config = ts.ProfilerConfig()
config.capture_kernels = True
config.capture_memcpy = True
profiler.initialize(config)
profiler.start_capture()
# ... your GPU code here (CUDA kernels, etc.) ...
profiler.stop_capture()
# Get captured events
events = profiler.get_events()
print(f"Captured {len(events)} events")
# Build timeline and analyze
timeline = ts.build_timeline(events)
print(f"GPU Utilization: {timeline.gpu_utilization * 100:.1f}%")
print(f"Max Concurrent Ops: {timeline.max_concurrent_ops}")
# Export to Perfetto (chrome://tracing or ui.perfetto.dev)
ts.export_perfetto(events, "trace.json")
# Save to TraceSmith binary format
writer = ts.SBTWriter("trace.sbt")
writer.write_events(events)
writer.finalize()
Real-time Tracing (v0.3.0+)
import tracesmith as ts
# Create tracing session with custom config
config = ts.TracingConfig()
config.buffer_size_kb = 8192 # 8MB buffer
config.enable_counter_tracks = True
session = ts.TracingSession()
session.start(config)
# Emit events from your application (thread-safe!)
event = ts.TraceEvent()
event.type = ts.EventType.KernelLaunch
event.name = "my_kernel"
event.thread_id = 12345
event.metadata["grid_dim"] = "256x256x1"
session.emit(event)
# Emit counter metrics
session.emit_counter("GPU Memory (MB)", 1024.5)
session.emit_counter("SM Occupancy %", 85.2)
# Stop and export
session.stop()
session.export_to_file("realtime_trace.perfetto-trace")
# Get statistics
stats = session.get_statistics()
print(f"Duration: {stats.duration_ms():.1f}ms")
print(f"Events: {stats.events_emitted} emitted, {stats.events_dropped} dropped")
Command Line Interface
TraceSmith provides a comprehensive CLI with ASCII banner and colored output:
████████╗██████╗ █████╗ ██████╗███████╗███████╗███╗ ███╗██╗████████╗██╗ ██╗
╚══██╔══╝██╔══██╗██╔══██╗██╔════╝██╔════╝██╔════╝████╗ ████║██║╚══██╔══╝██║ ██║
██║ ██████╔╝███████║██║ █████╗ ███████╗██╔████╔██║██║ ██║ ███████║
██║ ██╔══██╗██╔══██║██║ ██╔══╝ ╚════██║██║╚██╔╝██║██║ ██║ ██╔══██║
██║ ██║ ██║██║ ██║╚██████╗███████╗███████║██║ ╚═╝ ██║██║ ██║ ██║ ██║
╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝ ╚═╝
GPU Profiling & Replay System v0.8.3
Available Commands:
| Command | Description |
|---|---|
profile |
Profile a command (record + execute in one step) |
record |
Record GPU events to a trace file |
view |
View contents of a trace file |
info |
Show detailed information about a trace file |
export |
Export trace to Perfetto or other formats |
analyze |
Analyze trace for performance insights |
replay |
Replay a captured trace |
benchmark |
Run 10K GPU call stacks benchmark |
devices |
List available GPU devices (CUDA, Metal, MACA, ROCm) |
version |
Show version information |
help |
Show help message |
IMPORTANT: GPU Profiling API Limitations
CUPTI/MCPTI can only profile the calling process, not child processes. For CUDA/MACA platforms, you MUST use system-level profilers:
Platform Required Option Tool Min Version NVIDIA CUDA --nsysNsight Systems 2022.x (2024.x recommended) MetaX MACA --mctracermcTracer MACA SDK 3.0+ Apple Metal --xctraceInstruments Xcode 14+ The
recordcommand is not supported for CUDA/MACA platforms. Useprofile --nsysorprofile --mctracerinstead.
⚠️ Important Notes:
-
Nsight Systems Version: Use version 2024.x or later. Older versions (2021.x) may have compatibility issues with newer CUDA drivers.
# Check your nsys version nsys --version # Install latest version (Ubuntu with CUDA 12.x) sudo apt install cuda-nsight-systems-12-8
-
CUPTI Conflict: When using
--nsys, your program should not use CUPTI directly (nsys subscribes to CUPTI). If your program uses CUPTI, run it without--nsys:# For programs using CUPTI internally ./bin/cupti_example # Run directly, not with --nsys
-
Root/Admin Not Required: Modern nsys (2024.x) does not require root privileges for basic profiling.
C++ CLI Examples:
# NVIDIA CUDA - Use --nsys (REQUIRED for GPU profiling)
./bin/tracesmith profile --nsys -- python train.py
./bin/tracesmith profile --nsys --perfetto -- ./my_cuda_app
./bin/tracesmith profile --nsys -o model.sbt -- python train.py --epochs 10
# MetaX MACA - Use --mctracer (REQUIRED for GPU profiling)
./bin/tracesmith profile --mctracer -- ./my_maca_app
./bin/tracesmith profile --mctracer --perfetto -- python train.py
# Apple Metal - Use --xctrace for real Metal GPU events
./bin/tracesmith profile --xctrace -- python train.py
./bin/tracesmith profile --xctrace --keep-trace -- python mps_benchmark.py
./bin/tracesmith profile --xctrace --xctrace-template "GPU Driver" -- ./app
# View trace with statistics
./bin/tracesmith view trace.sbt --stats
# Show trace file info
./bin/tracesmith info trace.sbt
# Export to Perfetto (view at ui.perfetto.dev)
./bin/tracesmith export trace.sbt -f perfetto
# Analyze performance
./bin/tracesmith analyze trace.sbt
# Replay trace (dry-run)
./bin/tracesmith replay trace.sbt --mode dry-run
# List available GPUs
./bin/tracesmith devices
# Disable colored output
./bin/tracesmith --no-color help
Python CLI Examples:
# NVIDIA CUDA - Use --nsys (REQUIRED for GPU profiling)
tracesmith-cli profile --nsys -- python train.py
tracesmith-cli profile --nsys --perfetto -- ./my_cuda_app
tracesmith-cli profile --nsys -o model.sbt -- python train.py --epochs 10
# MetaX MACA - Use --mctracer (REQUIRED for GPU profiling)
tracesmith-cli profile --mctracer -- ./my_maca_app
tracesmith-cli profile --mctracer --perfetto -- python train.py
# Apple Metal - Use --xctrace for real Metal GPU events
tracesmith-cli profile --xctrace -- python train.py
tracesmith-cli profile --xctrace --keep-trace -- python mps_benchmark.py
# Other commands
tracesmith-cli info
tracesmith-cli devices
# View trace contents
tracesmith-cli view trace.sbt --stats
# Export to Perfetto
tracesmith-cli export trace.sbt -o trace.json
# Analyze trace
tracesmith-cli analyze trace.sbt
# Replay trace
tracesmith-cli replay trace.sbt --mode dry-run
NVIDIA Nsight Systems (nsys) Integration
TraceSmith integrates with NVIDIA Nsight Systems for system-wide GPU profiling, providing comprehensive CUDA kernel and memory operation tracing.
Usage:
# Profile with nsys (system-wide profiling)
./bin/tracesmith profile --nsys -- python train.py
./bin/tracesmith profile --nsys -o trace.sbt -- ./my_cuda_app
# With custom nsys options
./bin/tracesmith profile --nsys --nsys-args="-t cuda,nvtx" -- python benchmark.py
Features:
- System-wide CUDA kernel tracing
- Memory transfer profiling (H2D, D2H, D2D)
- NVTX annotation support
- Multi-GPU profiling
- Automatic .nsys-rep to TraceSmith format conversion
macOS Metal GPU Profiling with xctrace
On macOS, TraceSmith integrates with Apple Instruments (xctrace) for capturing real Metal GPU events. This provides accurate GPU timing and event capture that the Metal Frame Capture API cannot achieve programmatically.
Why use xctrace?
- Captures real Metal GPU execution events (kernel launches, command buffer submissions)
- Accurate GPU timing from hardware counters
- Works with any Metal application (PyTorch MPS, TensorFlow Metal, custom Metal apps)
Usage:
# Python CLI (recommended - includes event parsing)
tracesmith-cli profile --xctrace -- python train.py
tracesmith-cli profile --xctrace --keep-trace -o model.sbt -- python inference.py
tracesmith-cli profile --xctrace --perfetto -- python benchmark.py
# C++ CLI (calls xctrace, outputs raw .trace file)
./bin/tracesmith profile --xctrace -- python train.py
./bin/tracesmith profile --xctrace --xctrace-template "GPU Driver" -- ./app
# Python API
from tracesmith.xctrace import XCTraceProfiler, profile_with_xctrace
# Simple usage
events, trace_file = profile_with_xctrace(
["python", "train.py"],
duration=60,
template="Metal System Trace"
)
# Full control
profiler = XCTraceProfiler()
events = profiler.profile_command(["python", "train.py"])
profiler.export_perfetto("metal_trace.json")
Available Templates:
Metal System Trace- Most detailed Metal profiling (default)GPU Driver- Driver-level analysisGame Performance- Frame rate and GPU timeAnimation Hitches- Animation performance
Output:
MetaX GPU Profiling with MCPTI
TraceSmith supports MetaX GPUs (C500, C550, etc.) using the MCPTI (MACA Profiling Tools Interface), which provides an API compatible with NVIDIA CUPTI.
📖 Full documentation: See docs/MACA_PROFILING.md for detailed setup and usage guide.
Tested Hardware:
| GPU | Memory | Compute Units | Driver | Status |
|---|---|---|---|---|
| MetaX C500 | 64 GB | 104 CUs | 3.0.11 | ✅ Verified |
| MetaX C550 | - | - | - | 🔜 Planned |
Benchmark Results (MetaX C500, MACA 3.0.0):
| Test | Data Size | Bandwidth |
|---|---|---|
| Host → Device | 256 MB | 10.1 GB/s |
| Device → Host | 256 MB | 9.9 GB/s |
| Device → Device | 256 MB | 608 GB/s |
| MCPTI Overhead | - | < 2% (negligible) |
CLI Device Detection:
$ ./bin/tracesmith devices
MetaX MACA:
✓ MACA available
Devices: 1
Driver: 3000
Device 0: MetaX C500
Vendor: MetaX
Compute: 10.0
Memory: 63.62 GB
SMs: 104
Clock: 1600 MHz
System-Wide Profiling with mcTracer:
TraceSmith integrates with MetaX's mcTracer tool for comprehensive system-wide GPU profiling (similar to NVIDIA nsys):
# Profile with mcTracer
./bin/tracesmith profile --mctracer -- ./my_maca_app
./bin/tracesmith profile --mctracer --perfetto -- python train.py
# Output is Perfetto-compatible JSON
# View at: https://ui.perfetto.dev
mcTracer captures:
- All MACA API calls (mcInit, mcMalloc, mcMemcpy, etc.)
- GPU memory operations with bandwidth
- Stream operations and synchronization
- CPU-GPU launch flow arrows
Build with MetaX support:
# On MetaX system (MACA SDK auto-detected at /opt/maca-3.0.0)
cmake -DTRACESMITH_ENABLE_MACA=ON ..
make -j4
# Run examples
./bin/metax_example # Basic profiling demo
./bin/metax_benchmark # Memory bandwidth test
C++ API:
#include <tracesmith/tracesmith.hpp>
// Check MetaX GPU availability
if (tracesmith::isMACAAvailable()) {
std::cout << "MetaX devices: " << tracesmith::getMACADeviceCount() << std::endl;
}
// Create MCPTI profiler
auto profiler = tracesmith::createProfiler(tracesmith::PlatformType::MACA);
// Configure
tracesmith::ProfilerConfig config;
config.capture_kernels = true;
config.capture_memcpy = true;
profiler->initialize(config);
// Capture events
profiler->startCapture();
// ... GPU code using MACA runtime ...
profiler->stopCapture();
// Get events
std::vector<tracesmith::TraceEvent> events;
profiler->getEvents(events);
// Export to Perfetto
tracesmith::PerfettoExporter exporter;
exporter.exportToFile(events, "metax_trace.json");
Python API:
import tracesmith as ts
# Check MetaX availability
if ts.is_maca_available():
print(f"MetaX devices: {ts.get_maca_device_count()}")
# Create profiler
profiler = ts.create_profiler(ts.PlatformType.MACA)
profiler.initialize(ts.ProfilerConfig())
profiler.start_capture()
# ... GPU code ...
profiler.stop_capture()
events = profiler.get_events()
# Save trace
writer = ts.SBTWriter("metax_trace.sbt")
writer.write_events(events)
writer.finalize()
MCPTI Captured Events:
| Event Type | Description |
|---|---|
| KernelLaunch/Complete | Kernel execution timing with grid/block dimensions |
| MemcpyH2D/D2H/D2D | Memory transfers with bandwidth calculation |
| MemsetDevice | Memory initialization operations |
| StreamSync/DeviceSync | Synchronization events with duration |
Output:
.sbt- TraceSmith Binary Trace format.json- Perfetto JSON (view at https://ui.perfetto.dev)- Device info: name, memory, compute capability, clock rates
Python Examples with Cross-Platform Device Support
All Python examples support multiple GPU platforms with automatic device detection:
# Run examples on specific device
python examples/basic_usage.py --device cuda # NVIDIA GPU
python examples/basic_usage.py --device mps # Apple Silicon
python examples/basic_usage.py --device rocm # AMD GPU
python examples/basic_usage.py --device cpu # CPU fallback
# Run all examples with test runner
python examples/run_tests.py # Best available device
python examples/run_tests.py --all-devices # Test on all devices
python examples/run_tests.py --test pytorch # Run specific test
python examples/run_tests.py --list # List available tests
Using DeviceManager for cross-platform code:
from examples.device_utils import DeviceManager, benchmark
# Auto-detect best device
dm = DeviceManager() # or DeviceManager(prefer_device="mps")
print(f"Using: {dm.get_device_name()}") # Apple Silicon GPU (mps:0, 25.2 GB)
# Create tensors on device
x = dm.randn(1000, 1000)
y = dm.randn(1000, 1000)
# Benchmark with proper synchronization
results = benchmark(lambda: x @ y, warmup=3, iterations=10, dm=dm)
print(f"Mean: {results['mean_ms']:.2f} ms")
# Device-agnostic operations
dm.synchronize()
print(f"Memory: {dm.memory_allocated() / 1024**2:.1f} MB")
C++ API
#include <tracesmith/tracesmith.hpp>
using namespace tracesmith;
int main() {
// Create profiler
auto profiler = createProfiler(PlatformType::CUDA);
// Configure
ProfilerConfig config;
config.buffer_size = 1000000;
profiler->initialize(config);
// Start capture
profiler->startCapture();
// ... run GPU code ...
// Stop capture
profiler->stopCapture();
// Get events
std::vector<TraceEvent> events;
profiler->getEvents(events);
// Write to file
SBTWriter writer("trace.sbt");
writer.writeEvents(events);
writer.finalize();
return 0;
}
Timeline Analysis (Phase 3)
#include <tracesmith/tracesmith.hpp>
#include <tracesmith/state/timeline_builder.hpp>
#include <tracesmith/state/timeline_viewer.hpp>
#include <tracesmith/state/perfetto_exporter.hpp>
using namespace tracesmith;
int main() {
// Capture events (see above)
std::vector<TraceEvent> events = captureEvents();
// Build timeline
TimelineBuilder builder;
builder.addEvents(events);
Timeline timeline = builder.build();
// Print ASCII visualization
TimelineViewer viewer;
std::cout << viewer.render(timeline);
// Export to Perfetto with enhanced GPU tracks
PerfettoExporter exporter;
exporter.setEnableGPUTracks(true); // GPU-specific tracks
exporter.setEnableFlowEvents(true); // Dependency visualization
exporter.exportToFile(events, "trace.json");
// Open https://ui.perfetto.dev and load trace.json
// Get statistics
std::cout << "GPU Utilization: " << timeline.gpu_utilization << std::endl;
std::cout << "Max Concurrent Ops: " << timeline.max_concurrent_ops << std::endl;
return 0;
}
SBT File Format
TraceSmith uses a custom binary format (SBT - TraceSmith Binary Trace) optimized for:
- Compactness: Variable-length integer encoding, string interning
- Streaming: Support for streaming writes during capture
- Fast Access: Indexed sections for random access
File structure:
┌──────────────────┐
│ Header (64 bytes)│ Magic, version, offsets
├──────────────────┤
│ Metadata Section │ Application info, timestamps
├──────────────────┤
│ Device Info │ GPU device details
├──────────────────┤
│ Events Section │ Trace events (variable length)
├──────────────────┤
│ String Table │ Deduplicated strings
├──────────────────┤
│ EOF Marker │
└──────────────────┘
Development Roadmap
Phase 1: MVP ✅
- Project structure and build system
- Core data structures (TraceEvent, DeviceInfo)
- SBT binary trace format
- Lock-free ring buffer
- Platform abstraction interface
- CLI tools (record, view, info)
Phase 2: Instruction-Level Call Stack ✅
- Cross-platform stack capture (macOS/Linux/Windows)
- Symbol resolution with demangling
- GPU kernel call chain capture
- Instruction stream builder
- Dependency analysis
Phase 3: GPU State Machine & Timeline Builder ✅
- GPU state machine with stream tracking
- Timeline builder with span generation
- Perfetto export (chrome://tracing format)
- ASCII timeline visualization
- Concurrent operation analysis
Phase 4: Replay Engine ✅
- Replay engine with full orchestration
- Stream scheduler with dependency tracking
- Determinism checker with validation
- Partial replay (time/operation ranges)
- Dry-run mode for analysis
Phase 5: Production Release ✅
- Python bindings (pybind11)
- pip-installable package
- Comprehensive documentation
- Docker support
- Example programs
- TraceSmith Studio GUI (future)
- Homebrew formula (future)
Phase 6: Advanced Integrations ✅ (v0.4.0)
- Perfetto SDK Integration (85% smaller traces)
- Real-time TracingSession with lock-free buffers
- Kineto-compatible schema (thread_id, metadata, FlowInfo)
- Memory profiling (MemoryEvent, MemoryCategory)
- Counter tracks (CounterEvent)
- LLVM XRay import support
- eBPF types for Linux kernel tracing
Contributing
Contributions are welcome! Please read our Contributing Guide before submitting PRs.
License
TraceSmith is licensed under the Apache License 2.0. See LICENSE for details.
Benchmark Results
Core Feature: 10,000+ GPU Instruction-Level Call Stacks
Tested on NVIDIA GeForce RTX 4090 D (24GB, CUDA 12.8, Driver 570.124.06)
╔═══════════════════════════════════════════════════════════════════════╗
║ Non-intrusive capture of 10,000+ instruction-level GPU call stacks ║
║ ✅ VERIFIED! ║
╚═══════════════════════════════════════════════════════════════════════╝
| Metric | Result | Note |
|---|---|---|
| CUDA Kernels Launched | 10,000 | Real __global__ kernels |
| GPU Events (CUPTI) | 20,011 | Instruction-level events |
| Kernel Launches | 10,000 | Each kernel captured |
| Kernel Completes | 10,000 | Full lifecycle |
| Host Call Stacks | 10,000 | 7 frames/stack avg |
| Events with Stacks | 19,989 (99.9%) | GPU + Host merged |
| Total Time | 107 ms | Non-intrusive |
| Throughput | 93,457 kernels/sec | High performance |
Verified Capabilities:
- ✅ Real CUDA kernels executed on GPU )
- ✅ CUPTI captured instruction-level GPU events
- ✅ Host call stacks attached to GPU events
- ✅ Non-intrusive profiling
How to Run the Benchmark
# On NVIDIA GPU server with CUDA
git clone https://github.com/chenxingqiang/TraceSmith.git
cd TraceSmith
mkdir build && cd build
# Build with CUDA support
cmake .. -DTRACESMITH_ENABLE_CUDA=ON -DTRACESMITH_BUILD_EXAMPLES=ON
make benchmark_10k_stacks -j8
# Run the benchmark
./bin/benchmark_10k_stacks
CUPTI Real GPU Profiling Results
| Kernel | Duration (ns) | Duration (µs) | Duration (ms) |
|---|---|---|---|
| vectorAdd (1M elements) | 5,313 | 5.31 | 0.0053 |
| matrixMul (512×512) | 66,912 | 66.91 | 0.0669 |
| relu (1M elements) | 4,704 | 4.70 | 0.0047 |
| TOTAL | 76,929 | 76.93 | 0.0769 |
Real GPU Memory Profiling Results
| Phase | Operation | Memory |
|---|---|---|
| Parameters | 5× cudaMalloc | 31 MB |
| Activations | 8× cudaMalloc | 72 MB |
| Gradients | 5× cudaMalloc | 31 MB |
| Workspace | 3× cudaMalloc | 96 MB |
| Total Allocated | 21 operations | 230 MB |
| Total Freed | 16 cudaFree | 199 MB |
| Test Duration | - | 5 ms |
Performance Characteristics
| Feature | Performance |
|---|---|
| GPU Event Capture | 93K+ kernels/sec |
| Ring Buffer Throughput | 10K+ events/sec |
| Event Collection Overhead | < 1% |
| SBT File Compression | ~3x vs JSON |
| Perfetto Protobuf | 85% smaller than JSON |
| Stack Capture (no symbols) | ~5 µs/stack |
| Stack Capture (with symbols) | ~13 µs/stack |
Test Categories
✅ RingBuffer Tests (9/9) - Lock-free SPSC buffer
✅ SBT Format Tests (7/7) - Binary trace format
✅ Types Tests (12/12) - Core data structures
✅ Kineto Schema Tests (7/7) - PyTorch compatibility
✅ Kineto V2 Tests (6/6) - Memory & Counter events
✅ TracingSession Tests (10/10) - Real-time tracing
✅ XRay Importer Tests (5/5) - LLVM XRay support
✅ BPF Types Tests (6/6) - eBPF integration
✅ FrameCapture Tests (12/12) - RenderDoc-style capture
✅ MemoryProfiler Tests (12/12) - GPU memory tracking
✅ CUPTI Profiler (14/14) - Real GPU profiling
PyPI Package
# Basic installation
pip install tracesmith==0.8.3
# With CuPy for real GPU profiling in Python CLI (choose one):
pip install tracesmith[cuda12] # CUDA 12.x
pip install tracesmith[cuda11] # CUDA 11.x
pip install tracesmith[cuda118] # CUDA 11.8 specific
pip install tracesmith[cuda120] # CUDA 12.0 specific
# With visualization tools
pip install tracesmith[visualization]
# With PyTorch integration
pip install tracesmith[torch]
# All optional dependencies
pip install tracesmith[all]
Python CLI Real GPU Benchmark
With CuPy installed, you can run real GPU profiling from Python:
# Install CuPy first
pip install tracesmith[cuda12]
# Run real GPU benchmark
tracesmith-cli benchmark --real-gpu -n 10000
Tested on NVIDIA GPU Server (RTX 4090):
| Feature | Status |
|---|---|
| Core Types (69 exports) | ✅ |
| CUPTIProfiler | ✅ |
| MemoryProfiler | ✅ |
| Frame Capture | ✅ |
| Stack Capture | ✅ |
| BPF Tracing | ✅ (Linux) |
| CLI Tools | ✅ |
Testing Methodology
Feature Validation
TraceSmith provides a comprehensive validation example that tests all features from PLANNING.md:
# Build and run feature validation
cd build
cmake .. -DTRACESMITH_ENABLE_CUDA=ON -DTRACESMITH_BUILD_EXAMPLES=ON
make goal_validation_example
./bin/goal_validation_example
Benchmark Testing
The benchmark_10k_stacks uses real CUDA kernels and CUPTI profiling:
// Real CUDA kernel executed on GPU
__global__ void benchmark_kernel(float* data, int n, int kernel_id) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] = data[idx] * 2.0f + static_cast<float>(kernel_id);
}
}
// Launches 10,000 real kernels with CUPTI profiling
for (int i = 0; i < 10000; ++i) {
benchmark_kernel<<<blocks, threads>>>(d_data, n, i);
}
Platform-Specific Testing
| Platform | Profiler | Test Command |
|---|---|---|
| NVIDIA CUDA | CUPTIProfiler | ./bin/cupti_example |
| Apple Metal | MetalProfiler | ./bin/metal_example |
| MetaX MACA | MCPTIProfiler | ./bin/metax_example |
| CPU Fallback | StackCapture | ./bin/stack_capture_example |
Version History
| Version | Date | Highlights |
|---|---|---|
| v0.8.2 | 2025-12 | CLI Breaking Change - Enforce --nsys/--mctracer for CUDA/MACA, record command blocked, clearer API limitation messages |
| v0.8.1 | 2025-12 | nsys & MACA Enhancement - NVIDIA Nsight Systems integration, MetaX CLI device detection, MACA cluster module support |
| v0.8.1 | 2025-12 | mcTracer Integration - MetaX system-wide profiling, Enhanced MACA CLI, Cluster module support |
| v0.8.0 | 2025-12 | xctrace Integration - Apple Instruments, Cross-Platform Device Utils, Enhanced Examples |
| v0.7.1 | 2025-12 | Multi-GPU Phase 2 - TimeSync, NCCLTracker, ClockCorrelator, CommAnalysis |
| v0.7.0 | 2025-12 | Multi-GPU Cluster - GPUTopology, MultiGPUProfiler, GitHub Actions CI/CD |
| v0.6.9 | 2025-12 | Include reorganization - Directory structure matches src/ layout |
| v0.6.8 | 2025-12 | Enhanced CLI - ASCII banner, all commands, Python CLI |
| v0.6.7 | 2025-12 | Real GPU benchmark - 10K+ CUDA kernels with CUPTI |
| v0.6.5 | 2025-12 | StackCapture bindings, OverflowPolicy, detect_leaks |
| v0.6.2 | 2025-12 | PyPI release, Native extension packaging fix |
| v0.6.0 | 2025-12 | NVIDIA CUPTI integration, Full GPU testing |
| v0.5.0 | 2025-12 | RenderDoc-style frame capture, Resource tracking |
| v0.4.0 | 2025-12 | LLVM XRay, eBPF types, TracingSession, Counter tracks |
| v0.3.0 | 2025-12 | Real-time tracing, Counter events, Memory events |
| v0.2.0 | 2025-12 | Perfetto SDK (85% smaller traces), Kineto schema |
| v0.1.1 | 2025-11 | libunwind, Enhanced Perfetto export, Flow events |
| v0.1.0 | 2025-11 | Initial release: SBT format, Ring buffer, Replay |
Acknowledgments
TraceSmith draws inspiration from:
Contact
- GitHub Issues: Report a bug
- Discussions: Ask questions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tracesmith-0.8.3.tar.gz.
File metadata
- Download URL: tracesmith-0.8.3.tar.gz
- Upload date:
- Size: 317.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11fd591d857e2c1e540ef0240572d5e6d25512ba63462e14ccca3f71401818c2
|
|
| MD5 |
2e8611188e0cde395228ba3b2115291e
|
|
| BLAKE2b-256 |
69bec8868df8bca7f71e216e3fb5e0763b848e1510c614707b970526a6d1888f
|
File details
Details for the file tracesmith-0.8.3-cp313-cp313-macosx_15_0_arm64.whl.
File metadata
- Download URL: tracesmith-0.8.3-cp313-cp313-macosx_15_0_arm64.whl
- Upload date:
- Size: 529.0 kB
- Tags: CPython 3.13, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
811ebcadd390672327a3724f6d5d0137bfec198d2e8fddecad5cdf590e82bd56
|
|
| MD5 |
4f2077a2daf5b43746c4c7d14ad9d213
|
|
| BLAKE2b-256 |
1e139be04b4e1e9d9694c389a0c101433a501e322c4221757195a00dae02ce9f
|