PyTorch compute-communication overlap debugging toolkit with GPU hardware queue evaluation

These details have been verified by PyPI

Project links

Owner

Advanced Micro Devices

GitHub Statistics

Maintainers

vivekag

These details have not been verified by PyPI

Project description

AORTA

GPU performance benchmarking and debugging toolkit for PyTorch workloads on AMD ROCm.

Training Overlap Issue

What It Does

FSDP2 Compute-Communication Overlap Analysis Debug why distributed training isn't overlapping compute with communication. Runs a synthetic transformer workload with explicit multi-stream execution, captures per-iteration timing, and generates overlap efficiency reports.

param_sweep

Hardware Queue Evaluation Stress-test GPU queue scheduling with 8-64+ concurrent streams. Includes 15 workloads covering distributed training patterns (FSDP, MoE, activation checkpointing), inference (speculative decoding, continuous batching), and latency-sensitive scenarios (heterogeneous kernels, tiny kernel dispatch).

hw_queue_cmds

Environment Snapshot for Reproducibility Capture a versioned, schema-stable snapshot of the trial environment — ROCm / HIP / hipBLASLt / rocBLAS / MIOpen / RCCL identities, GPU arch, PyTorch build flags + cmake cache + per-target HIPCC defines, runtime SDPA backend state, ~30 numerics-relevant env vars — so cross-environment regressions become a jq diff instead of a multi-day investigation. Used standalone (aorta env probe) and embedded automatically into every trial result.

LLM Determinism Probe Catch kernel-level nondeterminism / silent data corruption in a transformer training step. Runs the same forward+backward twice on the same inputs (params + RNG restored between runs) and compares bit-exact checksums of every per-block boundary activation, loss, logits, every grad, and every param. FSDP2-aware, RCCL-safe, optional MoE. See docs/llm-determinism.md for the quick start.

Quick Start

# FSDP2 overlap benchmark
bash scripts/launch_rocm.sh config/default.yaml

# Hardware queue evaluation
python -m aorta.hw_queue_eval list                          # List workloads
python -m aorta.hw_queue_eval run hetero_kernels --streams 8
python -m aorta.hw_queue_eval sweep hetero_kernels --streams 1,2,4,8,16

# Comm-compute overlap (simulated collectives)
python -m aorta.hw_queue_eval run comms_compute_overlap --streams 4 --profile

# Comm-compute overlap (real NCCL collectives via torchrun)
torchrun --nproc_per_node=8 -m aorta.hw_queue_eval run comms_compute_overlap \
    --streams 4 --real-collectives --async-op --backend nccl \
    --process-groups "[0,1,2,3,4,5,6,7]" --profile --profile-dir traces/

# Environment snapshot for reproducibility
aorta env probe -o env.json                               # full snapshot to disk
aorta env probe --summary                                 # one-screen brief, no file write
aorta env probe --field pytorch_build.git_commit          # one field, JSON-typed
diff <(jq -S . env_a.json) <(jq -S . env_b.json)          # diff two snapshots

Example Analysis

AORTA generates comprehensive performance reports comparing ROCm versions across multiple configurations. See a full example report comparing rocm-7.0.8-meta vs rocm-7.0.10-meta:

8 configurations tested: 256/512 threads × 28/42/56/70 RCCL channels
96 visualizations: Overlap ratios, GEMM throughput, NCCL metrics, timeline comparisons
Side-by-side diffs: Identify regressions or improvements between driver/library versions

Overlap Breakdown

Documentation

Guide	Description
Getting Started	Prerequisites, Docker setup, installation
Running the Benchmark	Launch scripts, torch.compile, direct invocation
Hardware Queue Eval	Workloads, CLI usage, metrics
Configuration	FSDP tuning, RCCL variables, profiler settings
Profiling	Torch profiler, rocprofv3, overlap reports
Environment Probe	Capture / diff / query a versioned environment snapshot; jq cookbook
`aorta probe`	Wrap-and-collect opaque launch commands; matrix + classifier
`aorta agent`	Closed-loop mitigation search (optional LLM proposer)
`aorta bundle`	Package probe artifacts with recipe-driven redaction
Buck2	Build / run the AORTA CLI via Buck2; hermetic Python, `env probe` + `triage` walkthroughs
Releasing	Cut a customer-installable release; maintainer + customer install flow
Troubleshooting	Common issues

Repository Layout

src/aorta/
├── training/          # FSDP2 trainer with multi-stream overlap instrumentation
├── hw_queue_eval/     # Hardware queue evaluation framework
├── models/            # Synthetic ranking transformer
├── profiling/         # Stream profiler for overlap measurement
├── instrumentation/   # Environment probe (versioned env.json schema + capture)
├── registry/          # Mitigations + environments registry (extension points)
├── cli/               # `aorta` CLI command groups (run, env probe, ...)
└── utils/             # Config loading, timing, device detection

config/                # YAML configurations for different scenarios
scripts/               # Launch scripts, profiling, analysis tools
analysis/              # Overlap report generation

Installation

From PyPI (recommended for users)

PyTorch is installed separately from the ROCm index (it is not bundled in the wheel), so install it first, then AORTA from PyPI:

# Install PyTorch for your ROCm version (adjust the index URL accordingly)
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1/

# Install AORTA (distribution: amd-aorta; import package: aorta)
pip install amd-aorta                 # latest stable
pip install "amd-aorta[hw-queue]"     # with optional extras

amd-aorta lands on PyPI with the first stable release cut after this merges (PyPI Trusted Publishing is a one-time setup). Until then -- or if pip install amd-aorta fails because PyPI isn't populated yet -- install from the GitHub Release assets instead.

Prefer the GitHub Release assets, or need a pre-release nightly? See docs/releasing.md for the GitHub-Release install, the nightly dev-wheels channel, and how releases are cut.

From source (for development)

We recommend using uv for fast, reliable Python environment management.

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate a virtual environment
uv venv && source .venv/bin/activate

# Install PyTorch nightly for ROCm 7.1
uv pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1/

# Install remaining dependencies
uv pip install -r requirements.txt

# For full installation including hw_queue_eval
uv pip install -e ".[hw-queue]"

Development

uv pip install -r requirements-dev.txt
pre-commit install
pytest tests/

The FSDP2 overlap workloads also run on NVIDIA CUDA for side-by-side comparison with ROCm.

Project details

These details have been verified by PyPI

Project links

Owner

Advanced Micro Devices

GitHub Statistics

Maintainers

vivekag

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jun 26, 2026

0.0.1

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amd_aorta-0.2.0.tar.gz (584.4 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

amd_aorta-0.2.0-py3-none-any.whl (695.4 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file amd_aorta-0.2.0.tar.gz.

File metadata

Download URL: amd_aorta-0.2.0.tar.gz
Upload date: Jun 26, 2026
Size: 584.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for amd_aorta-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7db4c03cce258b0cc7da2ad32ed47e51e39448099a909dab3a65c95ab95a951f`
MD5	`acc477cb3683e26acc324672d6d54422`
BLAKE2b-256	`0163bce0538dfe9efe39ccb4ffd386ba6a3ac215b8f9d98daf92b2bb5206aa66`

See more details on using hashes here.

Provenance

The following attestation bundles were made for amd_aorta-0.2.0.tar.gz:

Publisher: release.yml on ROCm/aorta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: amd_aorta-0.2.0.tar.gz
- Subject digest: 7db4c03cce258b0cc7da2ad32ed47e51e39448099a909dab3a65c95ab95a951f
- Sigstore transparency entry: 1965520157
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: ROCm/aorta@7ed0da1f498cbfb57cc83554c8c869f2701c886f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/ROCm
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7ed0da1f498cbfb57cc83554c8c869f2701c886f
- Trigger Event: workflow_dispatch

File details

Details for the file amd_aorta-0.2.0-py3-none-any.whl.

File metadata

Download URL: amd_aorta-0.2.0-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 695.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for amd_aorta-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b72e4bbdcb4168223f2869f13ad1ccc286c0131cd3492191a56ade328faa82f6`
MD5	`e88d531736990fa9acc9dfb413ed09b3`
BLAKE2b-256	`9eff196aca7c0f98193f9e71030b82fef7006583545500a684dc4382469cb71a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for amd_aorta-0.2.0-py3-none-any.whl:

Publisher: release.yml on ROCm/aorta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: amd_aorta-0.2.0-py3-none-any.whl
- Subject digest: b72e4bbdcb4168223f2869f13ad1ccc286c0131cd3492191a56ade328faa82f6
- Sigstore transparency entry: 1965520360
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: ROCm/aorta@7ed0da1f498cbfb57cc83554c8c869f2701c886f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/ROCm
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7ed0da1f498cbfb57cc83554c8c869f2701c886f
- Trigger Event: workflow_dispatch

amd-aorta 0.2.0

Navigation

Verified details

Project links

Owner

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AORTA

What It Does

Quick Start

Example Analysis

Documentation

Repository Layout

Installation

From PyPI (recommended for users)

From source (for development)

Development

Project details

Verified details

Project links

Owner

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance