PyTorch compute-communication overlap debugging toolkit with GPU hardware queue evaluation
Project description
AORTA
GPU performance benchmarking and debugging toolkit for PyTorch workloads on AMD ROCm.
What It Does
FSDP2 Compute-Communication Overlap Analysis Debug why distributed training isn't overlapping compute with communication. Runs a synthetic transformer workload with explicit multi-stream execution, captures per-iteration timing, and generates overlap efficiency reports.
Hardware Queue Evaluation Stress-test GPU queue scheduling with 8-64+ concurrent streams. Includes 15 workloads covering distributed training patterns (FSDP, MoE, activation checkpointing), inference (speculative decoding, continuous batching), and latency-sensitive scenarios (heterogeneous kernels, tiny kernel dispatch).
Environment Snapshot for Reproducibility
Capture a versioned, schema-stable snapshot of the trial environment — ROCm / HIP / hipBLASLt / rocBLAS / MIOpen / RCCL identities, GPU arch, PyTorch build flags + cmake cache + per-target HIPCC defines, runtime SDPA backend state, ~30 numerics-relevant env vars — so cross-environment regressions become a jq diff instead of a multi-day investigation. Used standalone (aorta env probe) and embedded automatically into every trial result.
LLM Determinism Probe
Catch kernel-level nondeterminism / silent data corruption in a transformer training step. Runs the same forward+backward twice on the same inputs (params + RNG restored between runs) and compares bit-exact checksums of every per-block boundary activation, loss, logits, every grad, and every param. FSDP2-aware, RCCL-safe, optional MoE. See docs/llm-determinism.md for the quick start.
Quick Start
# FSDP2 overlap benchmark
bash scripts/launch_rocm.sh config/default.yaml
# Hardware queue evaluation
python -m aorta.hw_queue_eval list # List workloads
python -m aorta.hw_queue_eval run hetero_kernels --streams 8
python -m aorta.hw_queue_eval sweep hetero_kernels --streams 1,2,4,8,16
# Comm-compute overlap (simulated collectives)
python -m aorta.hw_queue_eval run comms_compute_overlap --streams 4 --profile
# Comm-compute overlap (real NCCL collectives via torchrun)
torchrun --nproc_per_node=8 -m aorta.hw_queue_eval run comms_compute_overlap \
--streams 4 --real-collectives --async-op --backend nccl \
--process-groups "[0,1,2,3,4,5,6,7]" --profile --profile-dir traces/
# Environment snapshot for reproducibility
aorta env probe -o env.json # full snapshot to disk
aorta env probe --summary # one-screen brief, no file write
aorta env probe --field pytorch_build.git_commit # one field, JSON-typed
diff <(jq -S . env_a.json) <(jq -S . env_b.json) # diff two snapshots
Example Analysis
AORTA generates comprehensive performance reports comparing ROCm versions across multiple configurations. See a full example report comparing rocm-7.0.8-meta vs rocm-7.0.10-meta:
- 8 configurations tested: 256/512 threads × 28/42/56/70 RCCL channels
- 96 visualizations: Overlap ratios, GEMM throughput, NCCL metrics, timeline comparisons
- Side-by-side diffs: Identify regressions or improvements between driver/library versions
Documentation
| Guide | Description |
|---|---|
| Getting Started | Prerequisites, Docker setup, installation |
| Running the Benchmark | Launch scripts, torch.compile, direct invocation |
| Hardware Queue Eval | Workloads, CLI usage, metrics |
| Configuration | FSDP tuning, RCCL variables, profiler settings |
| Profiling | Torch profiler, rocprofv3, overlap reports |
| Environment Probe | Capture / diff / query a versioned environment snapshot; jq cookbook |
aorta probe |
Wrap-and-collect opaque launch commands; matrix + classifier |
aorta agent |
Closed-loop mitigation search (optional LLM proposer) |
aorta bundle |
Package probe artifacts with recipe-driven redaction |
| Buck2 | Build / run the AORTA CLI via Buck2; hermetic Python, env probe + triage walkthroughs |
| Releasing | Cut a customer-installable release; maintainer + customer install flow |
| Troubleshooting | Common issues |
Repository Layout
src/aorta/
├── training/ # FSDP2 trainer with multi-stream overlap instrumentation
├── hw_queue_eval/ # Hardware queue evaluation framework
├── models/ # Synthetic ranking transformer
├── profiling/ # Stream profiler for overlap measurement
├── instrumentation/ # Environment probe (versioned env.json schema + capture)
├── registry/ # Mitigations + environments registry (extension points)
├── cli/ # `aorta` CLI command groups (run, env probe, ...)
└── utils/ # Config loading, timing, device detection
config/ # YAML configurations for different scenarios
scripts/ # Launch scripts, profiling, analysis tools
analysis/ # Overlap report generation
Installation
From PyPI (recommended for users)
PyTorch is installed separately from the ROCm index (it is not bundled in the wheel), so install it first, then AORTA from PyPI:
# Install PyTorch for your ROCm version (adjust the index URL accordingly)
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1/
# Install AORTA (distribution: amd-aorta; import package: aorta)
pip install amd-aorta # latest stable
pip install "amd-aorta[hw-queue]" # with optional extras
amd-aorta lands on PyPI with the first stable release cut after this merges
(PyPI Trusted Publishing is a one-time setup). Until then -- or if
pip install amd-aorta fails because PyPI isn't populated yet -- install from
the GitHub Release assets instead.
Prefer the GitHub Release assets, or need a pre-release nightly? See
docs/releasing.md for the GitHub-Release install, the
nightly dev-wheels channel, and how releases are cut.
From source (for development)
We recommend using uv for fast, reliable Python environment management.
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate a virtual environment
uv venv && source .venv/bin/activate
# Install PyTorch nightly for ROCm 7.1
uv pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1/
# Install remaining dependencies
uv pip install -r requirements.txt
# For full installation including hw_queue_eval
uv pip install -e ".[hw-queue]"
Development
uv pip install -r requirements-dev.txt
pre-commit install
pytest tests/
The FSDP2 overlap workloads also run on NVIDIA CUDA for side-by-side comparison with ROCm.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amd_aorta-0.2.0.tar.gz.
File metadata
- Download URL: amd_aorta-0.2.0.tar.gz
- Upload date:
- Size: 584.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7db4c03cce258b0cc7da2ad32ed47e51e39448099a909dab3a65c95ab95a951f
|
|
| MD5 |
acc477cb3683e26acc324672d6d54422
|
|
| BLAKE2b-256 |
0163bce0538dfe9efe39ccb4ffd386ba6a3ac215b8f9d98daf92b2bb5206aa66
|
Provenance
The following attestation bundles were made for amd_aorta-0.2.0.tar.gz:
Publisher:
release.yml on ROCm/aorta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
amd_aorta-0.2.0.tar.gz -
Subject digest:
7db4c03cce258b0cc7da2ad32ed47e51e39448099a909dab3a65c95ab95a951f - Sigstore transparency entry: 1965520157
- Sigstore integration time:
-
Permalink:
ROCm/aorta@7ed0da1f498cbfb57cc83554c8c869f2701c886f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ROCm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7ed0da1f498cbfb57cc83554c8c869f2701c886f -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file amd_aorta-0.2.0-py3-none-any.whl.
File metadata
- Download URL: amd_aorta-0.2.0-py3-none-any.whl
- Upload date:
- Size: 695.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b72e4bbdcb4168223f2869f13ad1ccc286c0131cd3492191a56ade328faa82f6
|
|
| MD5 |
e88d531736990fa9acc9dfb413ed09b3
|
|
| BLAKE2b-256 |
9eff196aca7c0f98193f9e71030b82fef7006583545500a684dc4382469cb71a
|
Provenance
The following attestation bundles were made for amd_aorta-0.2.0-py3-none-any.whl:
Publisher:
release.yml on ROCm/aorta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
amd_aorta-0.2.0-py3-none-any.whl -
Subject digest:
b72e4bbdcb4168223f2869f13ad1ccc286c0131cd3492191a56ade328faa82f6 - Sigstore transparency entry: 1965520360
- Sigstore integration time:
-
Permalink:
ROCm/aorta@7ed0da1f498cbfb57cc83554c8c869f2701c886f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ROCm
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7ed0da1f498cbfb57cc83554c8c869f2701c886f -
Trigger Event:
workflow_dispatch
-
Statement type: