Skip to main content

PyTorch compute-communication overlap debugging toolkit with GPU hardware queue evaluation

Project description

AORTA

GPU performance benchmarking and debugging toolkit for PyTorch workloads on AMD ROCm.

Training Overlap Issue

What It Does

FSDP2 Compute-Communication Overlap Analysis Debug why distributed training isn't overlapping compute with communication. Runs a synthetic transformer workload with explicit multi-stream execution, captures per-iteration timing, and generates overlap efficiency reports.

param_sweep

Hardware Queue Evaluation Stress-test GPU queue scheduling with 8-64+ concurrent streams. Includes 15 workloads covering distributed training patterns (FSDP, MoE, activation checkpointing), inference (speculative decoding, continuous batching), and latency-sensitive scenarios (heterogeneous kernels, tiny kernel dispatch).

hw_queue_cmds

Environment Snapshot for Reproducibility Capture a versioned, schema-stable snapshot of the trial environment — ROCm / HIP / hipBLASLt / rocBLAS / MIOpen / RCCL identities, GPU arch, PyTorch build flags + cmake cache + per-target HIPCC defines, runtime SDPA backend state, ~30 numerics-relevant env vars — so cross-environment regressions become a jq diff instead of a multi-day investigation. Used standalone (aorta env probe) and embedded automatically into every trial result.

LLM Determinism Probe Catch kernel-level nondeterminism / silent data corruption in a transformer training step. Runs the same forward+backward twice on the same inputs (params + RNG restored between runs) and compares bit-exact checksums of every per-block boundary activation, loss, logits, every grad, and every param. FSDP2-aware, RCCL-safe, optional MoE. See docs/llm-determinism.md for the quick start.

Quick Start

# FSDP2 overlap benchmark
bash scripts/launch_rocm.sh config/default.yaml

# Hardware queue evaluation
python -m aorta.hw_queue_eval list                          # List workloads
python -m aorta.hw_queue_eval run hetero_kernels --streams 8
python -m aorta.hw_queue_eval sweep hetero_kernels --streams 1,2,4,8,16

# Comm-compute overlap (simulated collectives)
python -m aorta.hw_queue_eval run comms_compute_overlap --streams 4 --profile

# Comm-compute overlap (real NCCL collectives via torchrun)
torchrun --nproc_per_node=8 -m aorta.hw_queue_eval run comms_compute_overlap \
    --streams 4 --real-collectives --async-op --backend nccl \
    --process-groups "[0,1,2,3,4,5,6,7]" --profile --profile-dir traces/

# Environment snapshot for reproducibility
aorta env probe -o env.json                               # full snapshot to disk
aorta env probe --summary                                 # one-screen brief, no file write
aorta env probe --field pytorch_build.git_commit          # one field, JSON-typed
diff <(jq -S . env_a.json) <(jq -S . env_b.json)          # diff two snapshots

Example Analysis

AORTA generates comprehensive performance reports comparing ROCm versions across multiple configurations. See a full example report comparing rocm-7.0.8-meta vs rocm-7.0.10-meta:

  • 8 configurations tested: 256/512 threads × 28/42/56/70 RCCL channels
  • 96 visualizations: Overlap ratios, GEMM throughput, NCCL metrics, timeline comparisons
  • Side-by-side diffs: Identify regressions or improvements between driver/library versions

Overlap Breakdown

Documentation

Guide Description
Getting Started Prerequisites, Docker setup, installation
Running the Benchmark Launch scripts, torch.compile, direct invocation
Hardware Queue Eval Workloads, CLI usage, metrics
Configuration FSDP tuning, RCCL variables, profiler settings
Profiling Torch profiler, rocprofv3, overlap reports
Environment Probe Capture / diff / query a versioned environment snapshot; jq cookbook
aorta probe Wrap-and-collect opaque launch commands; matrix + classifier
aorta agent Closed-loop mitigation search (optional LLM proposer)
aorta bundle Package probe artifacts with recipe-driven redaction
Buck2 Build / run the AORTA CLI via Buck2; hermetic Python, env probe + triage walkthroughs
Releasing Cut a customer-installable release; maintainer + customer install flow
Troubleshooting Common issues

Repository Layout

src/aorta/
├── training/          # FSDP2 trainer with multi-stream overlap instrumentation
├── hw_queue_eval/     # Hardware queue evaluation framework
├── models/            # Synthetic ranking transformer
├── profiling/         # Stream profiler for overlap measurement
├── instrumentation/   # Environment probe (versioned env.json schema + capture)
├── registry/          # Mitigations + environments registry (extension points)
├── cli/               # `aorta` CLI command groups (run, env probe, ...)
└── utils/             # Config loading, timing, device detection

config/                # YAML configurations for different scenarios
scripts/               # Launch scripts, profiling, analysis tools
analysis/              # Overlap report generation

Installation

From PyPI (recommended for users)

PyTorch is installed separately from the ROCm index (it is not bundled in the wheel), so install it first, then AORTA from PyPI:

# Install PyTorch for your ROCm version (adjust the index URL accordingly)
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1/

# Install AORTA (distribution: amd-aorta; import package: aorta)
pip install amd-aorta                 # latest stable
pip install "amd-aorta[hw-queue]"     # with optional extras

amd-aorta lands on PyPI with the first stable release cut after this merges (PyPI Trusted Publishing is a one-time setup). Until then -- or if pip install amd-aorta fails because PyPI isn't populated yet -- install from the GitHub Release assets instead.

Prefer the GitHub Release assets, or need a pre-release nightly? See docs/releasing.md for the GitHub-Release install, the nightly dev-wheels channel, and how releases are cut.

From source (for development)

We recommend using uv for fast, reliable Python environment management.

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate a virtual environment
uv venv && source .venv/bin/activate

# Install PyTorch nightly for ROCm 7.1
uv pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1/

# Install remaining dependencies
uv pip install -r requirements.txt

# For full installation including hw_queue_eval
uv pip install -e ".[hw-queue]"

Development

uv pip install -r requirements-dev.txt
pre-commit install
pytest tests/

The FSDP2 overlap workloads also run on NVIDIA CUDA for side-by-side comparison with ROCm.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amd_aorta-0.2.0.tar.gz (584.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

amd_aorta-0.2.0-py3-none-any.whl (695.4 kB view details)

Uploaded Python 3

File details

Details for the file amd_aorta-0.2.0.tar.gz.

File metadata

  • Download URL: amd_aorta-0.2.0.tar.gz
  • Upload date:
  • Size: 584.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for amd_aorta-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7db4c03cce258b0cc7da2ad32ed47e51e39448099a909dab3a65c95ab95a951f
MD5 acc477cb3683e26acc324672d6d54422
BLAKE2b-256 0163bce0538dfe9efe39ccb4ffd386ba6a3ac215b8f9d98daf92b2bb5206aa66

See more details on using hashes here.

Provenance

The following attestation bundles were made for amd_aorta-0.2.0.tar.gz:

Publisher: release.yml on ROCm/aorta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file amd_aorta-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: amd_aorta-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 695.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for amd_aorta-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b72e4bbdcb4168223f2869f13ad1ccc286c0131cd3492191a56ade328faa82f6
MD5 e88d531736990fa9acc9dfb413ed09b3
BLAKE2b-256 9eff196aca7c0f98193f9e71030b82fef7006583545500a684dc4382469cb71a

See more details on using hashes here.

Provenance

The following attestation bundles were made for amd_aorta-0.2.0-py3-none-any.whl:

Publisher: release.yml on ROCm/aorta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page