Performance and Stability Diagnostic Tool for AI Applications
Project description
Probing - Dynamic Performance Profiler for Distributed AI
Uncover the Hidden Truth of AI Performance
Probing is a production-grade performance profiler designed specifically for distributed AI workloads. Built on dynamic probe injection technology, it delivers zero-overhead runtime introspection with SQL-queryable performance metrics and cross-node correlation analysis.
What probing delivers...
- Runtime Performance Visibility - Expose execution bottlenecks in real-time without code modification
- Distributed System Observability - Cross-node performance correlation and bottleneck identification
- Production-Ready Monitoring - Continuous profiling with <1% overhead for large-scale training jobs
In contrast with traditional profilers, probing does not...
- Require Code Modification - No need to add logging, insert timers, or modify training scripts
- Force "Break-Then-Fix" Debugging - No waiting for issues to occur, then spending days reproducing them
- Force You to Decode Fixed Reports - No more deciphering pre-formatted tables where every row and column needs interpretation; use SQL to create custom analysis reports
Getting Started
Installation
pip install probing
Quick Start (30 seconds)
# Enable instrumentation at startup
PROBING=1 python train.py
# Or inject into running process
probing -t <pid> inject
# Real-time stack trace analysis
probing -t <pid> backtrace
Core Features
- Dynamic Probe Injection - Runtime instrumentation without target application modification
- Distributed Performance Aggregation - Cross-node data collection with unified correlation analysis
- SQL Analytics Interface - Apache DataFusion-powered query engine with standard SQL syntax
- Interactive Python REPL - Live debugging and variable inspection in running processes
- Production-Grade Overhead - Efficient sampling strategies maintaining <1% performance impact
- Time-Series Storage - Columnar data storage with configurable compression and retention
- Real-Time Introspection - Live performance metrics and runtime stack trace analysis
- Advanced CLI - Comprehensive command-line interface with process monitoring and management
Basic Usage
# Inject performance monitoring
probing -t <pid> inject
# Real-time stack trace analysis
probing -t <pid> backtrace
# Memory usage profiling
probing -t <pid> memory
# Generate flame graphs
probing -t <pid> flamegraph
# Interactive Python REPL (connect to running process)
probing -t <pid> repl
Advanced Features
SQL Analytics Interface
# Memory usage analysis
probing -t <pid> query "SELECT * FROM memory_usage WHERE timestamp > now() - interval '5 min'"
# Performance hotspot analysis
probing -t <pid> query "
SELECT operation_name, avg(duration_ms), count(*)
FROM profiling_data
WHERE timestamp > now() - interval '5 minutes'
GROUP BY operation_name
ORDER BY avg(duration_ms) DESC
"
# Training progress tracking
probing -t <pid> query "
SELECT epoch, avg(loss), min(loss), count(*) as steps
FROM training_logs
GROUP BY epoch
ORDER BY epoch
"
Interactive Python REPL
Probing provides an interactive Python REPL that connects to running processes, allowing you to inspect variables, execute code, and debug in real-time:
# Connect to a process via REPL
probing -t <pid> repl
# For remote processes
probing -t <host|ip:port> repl
Example REPL session:
>>> import torch
>>> # Inspect torch models in the target process
>>> models = [m for m in gc.get_objects() if isinstance(m, torch.nn.Module)]
The REPL provides:
- Live Variable Inspection: Access all variables in the target process context
- Code Execution: Run arbitrary Python code within the target process
- Real-time Debugging: Set breakpoints and inspect state without stopping the process
Distributed Training Analysis
# Monitor all cluster nodes
probing cluster attach
# Inter-node communication latency
probing -t <pid> query "SELECT src_rank, dst_rank, avg(latency_ms) FROM comm_metrics"
# Cross-node stack trace comparison
probing -t <pid> query "SELECT * FROM python.backtrace"
# GPU utilization analysis
probing -t <pid> query "SELECT avg(gpu_util) FROM gpu_metrics WHERE timestamp > now() - 60"
Memory Analysis
# Quick memory usage overview
probing -t <pid> memory
# Memory growth trend analysis
probing -t <pid> query "SELECT hour(timestamp), avg(memory_mb) FROM memory_usage GROUP BY hour(timestamp)"
# Memory leak detection
probing -t <pid> query "
SELECT function_name, sum(allocated_bytes) as total_alloc
FROM memory_allocations
WHERE timestamp > now() - interval '1 hour'
GROUP BY function_name
ORDER BY total_alloc DESC
"
Configuration Options
# Environment variable configuration
export PROBING_SAMPLE_RATE=0.1 # Set sampling rate
export PROBING_RETENTION_DAYS=7 # Data retention period
# View current configuration
probing -t <pid> config
# Dynamic configuration updates
probing -t <pid> config probing.sample_rate=0.05
probing -t <pid> config probing.max_memory=1GB
Development
Prerequisites
Before building Probing from source, ensure you have the following dependencies installed:
# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install nightly toolchain (required)
rustup toolchain install nightly
rustup default nightly
# Add WebAssembly target for web UI
rustup target add wasm32-unknown-unknown
# Install trunk for building WebAssembly frontend
cargo install trunk
# Install cross-compilation tools (optional, for distribution builds)
cargo install cargo-zigbuild
pip install ziglang
Building from Source
# Clone repository
git clone https://github.com/reiase/probing.git
cd probing
# Development build (faster compilation)
make
# Production build with cross-platform compatibility
make ZIG=1
# Build web UI separately (optional)
cd app && trunk build --release
# Build and install wheel package
make wheel
pip install dist/probing-*.whl --force-reinstall
Testing
# Run all tests
make test
# Test with a simple example
PROBE=1 python examples/test_probing.py
# Advanced testing with variable tracking
PROBE_TORCH_EXPRS="loss@train,acc1@train" PROBE=1 python examples/imagenet.py
Project Structure
probing/cli/- Command-line interfaceprobing/core/- Core profiling engineprobing/extensions/- Language-specific extensions (Python, C++)probing/server/- HTTP API serverapp/- Web UI (WebAssembly + Leptos)python/- Python hooks and integrationexamples/- Usage examples and demos
Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Run tests:
make test - Submit a pull request
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file probing-0.2.0alpha1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.
File metadata
- Download URL: probing-0.2.0alpha1-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 14.9 MB
- Tags: Python 3, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35098c38c8190bd213bc6f9967d0c68ed79166d501046523da0c2fe38a9051aa
|
|
| MD5 |
f3cb77d8a1fd886d4ba827545adacd94
|
|
| BLAKE2b-256 |
25bba612a43c95b12fc35eb129deec4b40060ee88c0e11f4556022246583b587
|