Skip to main content

Dynamic Performance Profiler for Distributed AI

Project description

Probing - Dynamic Performance Profiler for Distributed AI

PyPI version License Downloads codecov

Uncover the Hidden Truth of AI Performance

Probing is a production-grade performance profiler designed specifically for distributed AI workloads. Built on dynamic probe injection technology, it delivers zero-overhead runtime introspection with SQL-queryable performance metrics and cross-node correlation analysis.

What probing delivers...

🔍 For AI Researchers & Algorithm Engineers

  • Debug Training Instabilities - Real-time insight into why training diverges or hangs
  • Optimize Model Performance - Identify bottlenecks in forward/backward passes
  • Memory Leak Detection - Track GPU/CPU memory usage across training steps
  • Live Variable Inspection - Check tensor values, gradients, and model states without stopping training

🛠️ For Framework & Library Developers

  • Runtime Framework Analysis - Understand how your framework performs in real-world usage
  • Zero-Intrusion Profiling - Profile framework internals without code modifications
  • Production Debugging - Debug issues reported by users in their actual environments
  • Performance Benchmarking - Collect real performance data for optimization decisions

⚙️ For System Engineers & MLOps

  • Production Monitoring - Monitor AI services without service restarts
  • Resource Optimization - Analyze resource usage patterns across the cluster
  • Custom Metrics Collection - Gather any application-specific performance data
  • Distributed Debugging - Correlate performance issues across multiple nodes

🚀 Core Technical Capabilities

  • Dynamic Probe Injection - Attach to running processes without code changes
  • SQL-Powered Analytics - Use standard SQL to query performance data
  • Live Code Execution - Run Python code directly in target processes
  • Real-time Stack Analysis - Capture execution context with variable values

In contrast with traditional profilers, probing does not...

  • Require Code Instrumentation - No need to add logging statements, insert timers, or modify your training scripts
  • Force "Break-Then-Fix" Workflow - No waiting for issues to occur, then spending days trying to reproduce them
  • Lock You Into Fixed Reports - No more deciphering pre-formatted tables; use SQL to create custom analysis reports that match your specific needs
  • Disrupt Your Workflow - Attach to running processes without stopping your training jobs or services
  • Force You to Learn New Tools - Use familiar SQL syntax and Python code for all your analysis needs

Getting Started

Installation

pip install probing

Quick Start (30 seconds)

# Enable instrumentation at startup
PROBING=1 python train.py

# Or inject into running process
probing -t <pid> inject

# Real-time stack trace analysis
probing -t <pid> backtrace

Core Features

  • Dynamic Probe Injection - Runtime instrumentation without target application modification
  • Distributed Performance Aggregation - Cross-node data collection with unified correlation analysis
  • SQL Analytics Interface - Apache DataFusion-powered query engine with standard SQL syntax
  • Interactive Python REPL - Live debugging and variable inspection in running processes
  • Production-Grade Overhead - Efficient sampling strategies maintaining <1% performance impact
  • Time-Series Storage - Columnar data storage with configurable compression and retention
  • Real-Time Introspection - Live performance metrics and runtime stack trace analysis
  • Advanced CLI - Comprehensive command-line interface with process monitoring and management

Basic Usage

# Inject performance monitoring (Linux only)
probing -t <pid> inject

# Real-time stack trace analysis
probing -t <pid> backtrace

# Query performance data with SQL
probing -t <pid> query "SELECT * FROM python.torch_trace LIMIT 10"

# Evaluate Python code in target process
probing -t <pid> eval "import torch; print(torch.cuda.is_available())"

# Interactive Python REPL (connect to running process)
probing -t <pid> repl

# RDMA Flow Analysis
probing -t <pid> rdma

# List all processes with injected probes
probing list

Advanced Features

SQL Analytics Interface

# Memory usage analysis
probing -t <pid> query "SELECT * FROM memory_usage WHERE timestamp > now() - interval '5 min'"

# Performance hotspot analysis
probing -t <pid> query "
  SELECT operation_name, avg(duration_ms), count(*)
  FROM profiling_data
  WHERE timestamp > now() - interval '5 minutes'
  GROUP BY operation_name
  ORDER BY avg(duration_ms) DESC
"

# Training progress tracking
probing -t <pid> query "
  SELECT epoch, avg(loss), min(loss), count(*) as steps
  FROM training_logs
  GROUP BY epoch
  ORDER BY epoch
"

Interactive Python REPL

Probing provides an interactive Python REPL that connects to running processes, allowing you to inspect variables, execute code, and debug in real-time:

# Connect to a process via REPL
probing -t <pid> repl

# For remote processes
probing -t <host|ip:port> repl

Example REPL session:

>>> import torch
>>> # Inspect torch models in the target process
>>> models = [m for m in gc.get_objects() if isinstance(m, torch.nn.Module)]

The REPL provides:

  • Live Variable Inspection: Access all variables in the target process context
  • Code Execution: Run arbitrary Python code within the target process
  • Real-time Debugging: Set breakpoints and inspect state without stopping the process

Distributed Training Analysis

# Monitor all cluster nodes
probing cluster attach

# Inter-node communication latency
probing -t <pid> query "SELECT src_rank, dst_rank, avg(latency_ms) FROM comm_metrics"

# Cross-node stack trace comparison
probing -t <pid> query "SELECT * FROM python.backtrace"

# GPU utilization analysis
probing -t <pid> query "SELECT avg(gpu_util) FROM gpu_metrics WHERE timestamp > now() - 60"

Memory Analysis

# Quick memory usage overview
probing -t <pid> memory

# Memory growth trend analysis
probing -t <pid> query "SELECT hour(timestamp), avg(memory_mb) FROM memory_usage GROUP BY hour(timestamp)"

# Memory leak detection
probing -t <pid> query "
  SELECT function_name, sum(allocated_bytes) as total_alloc
  FROM memory_allocations
  WHERE timestamp > now() - interval '1 hour'
  GROUP BY function_name
  ORDER BY total_alloc DESC
"

Configuration Options

# Environment variable configuration
export PROBING_SAMPLE_RATE=0.1      # Set sampling rate
export PROBING_RETENTION_DAYS=7     # Data retention period

# View current configuration
probing -t <pid> config

# Dynamic configuration updates
probing -t <pid> config probing.sample_rate=0.05
probing -t <pid> config probing.max_memory=1GB
probing -t <pid> config "probing.rdma.hca.name='mlx5_cx6_0'"
probing -t <pid> config "probing.rdma.sample.rate='5'"

Development

Prerequisites

Before building Probing from source, ensure you have the following dependencies installed:

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install nightly toolchain (required)
rustup toolchain install nightly
rustup default nightly

# Add WebAssembly target for web UI
rustup target add wasm32-unknown-unknown

# Install Dioxus CLI for building WebAssembly frontend
cargo install dioxus-cli

# Install cross-compilation tools (optional, for distribution builds)
cargo install cargo-zigbuild
pip install ziglang

Building from Source

# Clone repository
git clone https://github.com/reiase/probing.git
cd probing

# Development build (faster compilation)
make

# Production build with cross-platform compatibility
make ZIG=1

# Build web UI separately (optional)
cd web && dx build --release

# Build and install wheel package
make wheel
pip install dist/probing-*.whl --force-reinstall

Testing

prepare your environment:

# Install dependencies
cargo install cargo-nextest --locked
# Run all tests
make test

# Test with a simple example
PROBING=1 python examples/test_probing.py

# Advanced testing with variable tracking
PROBING_TORCH_PROFILING="on,exprs=loss@train,acc1@train" PROBE=1 python examples/imagenet.py

Project Structure

  • probing/cli/ - Command-line interface
  • probing/core/ - Core profiling engine
  • probing/extensions/ - Language-specific extensions (Python, C++)
  • probing/server/ - HTTP API server
  • web/ - Web UI source and build output (Dioxus + WebAssembly)
    • web/dist/ - Web UI build output directory
  • python/ - Python hooks and integration
  • examples/ - Usage examples and demos

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Run tests: make test
  5. Submit a pull request

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

probing-0.2.4.tar.gz (481.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

probing-0.2.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.3 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ x86-64

probing-0.2.4-cp37-abi3-macosx_11_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.7+macOS 11.0+ ARM64

probing-0.2.4-cp37-abi3-macosx_10_12_x86_64.whl (15.8 MB view details)

Uploaded CPython 3.7+macOS 10.12+ x86-64

File details

Details for the file probing-0.2.4.tar.gz.

File metadata

  • Download URL: probing-0.2.4.tar.gz
  • Upload date:
  • Size: 481.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for probing-0.2.4.tar.gz
Algorithm Hash digest
SHA256 7d577a17cdebbab5fa485df727c0f403f8a511fee76c06e29411328f304cddc9
MD5 ff715d54404de98a89b71c39be6ee06f
BLAKE2b-256 ed3341b8061424165798dae2020104fa2cdf4f6c8af4a7380dfbb11932b3d7a7

See more details on using hashes here.

File details

Details for the file probing-0.2.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for probing-0.2.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8792ba14c7ea83381d6ad580539ffa43dbfc9abab0d0b56a66ad43cda6ce3757
MD5 175aca0f9ec9ad69bd3935319b085561
BLAKE2b-256 fca4196be1863dfc8e1ac71f205ceec1575194aefb1fa6e7143380beac4c7782

See more details on using hashes here.

File details

Details for the file probing-0.2.4-cp37-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for probing-0.2.4-cp37-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9d3e3b46b513a40631be1e7bd4a5bfda9f2bae7edcf5903df6b28933c33519b7
MD5 f4c83b7454b106a7ceefa9eae4c177e7
BLAKE2b-256 115f782a88b18ca42de72dfe3fd700b5720f1dd51106fb04c9867b420aaab361

See more details on using hashes here.

File details

Details for the file probing-0.2.4-cp37-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for probing-0.2.4-cp37-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3fbd6b6f1d3962013fe984563f69b91cec8f504705c2a9d28fe18c0590df3774
MD5 978ef1add8f5ff7c9c5ede6ddc63cd66
BLAKE2b-256 ad9ebffd4f89ce0239b2320a9f950856f42946999294391922a9d84f4c6f8885

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page