Differential debugger for CUDA/Triton GPU kernels
Project description
PRLX
Differential debugger for CUDA and Triton GPU kernels.
Run your kernel twice with different inputs (or a known-good vs buggy version). PRLX instruments every branch, captures per-warp execution traces, and diffs them — telling you exactly which warp diverged, at which instruction, and what each lane saw:
Site 0xfbe6edc1 (branch_kernel:12) 2 warps affected
Warp 1, event 3: Branch Direction
A: TAKEN B: NOT-TAKEN
Operand Snapshot (icmp sgt):
Lane A:lhs A:rhs B:lhs B:rhs
0 32 10 32 64 <<<
1 33 10 33 64 <<<
2 34 10 34 64 <<<
...
The threshold changed from 10 to 64. Lanes 0–31 compared their value against the threshold; in run A they all passed, in run B they didn't. That's the bug.
Install
pip install prlx
Needs CUDA 12+ and LLVM 18/19/20 on the host (for prlx compile and Triton integration). The differ and Python trace reader have no external deps.
From source (NVIDIA)
cmake -B build && cmake --build build
cd differ && cargo build --release && cd ..
pip install -e .
Build deps: CMake 3.20+, LLVM/Clang 18–20, CUDA Toolkit, Rust stable.
From source (AMD ROCm)
cmake -B build -DPRLX_ENABLE_CUDA=OFF -DPRLX_ENABLE_HIP=ON
cmake --build build
cd differ && cargo build --release && cd ..
pip install -e .
Build deps: CMake 3.20+, LLVM/Clang 18–20, ROCm 5.0+, Rust stable. The LLVM pass supports AMDGPU targets. The HIP runtime targets wave32 (RDNA) GPUs.
Usage
CUDA C
prlx compile kernel.cu -o kernel
PRLX_TRACE=a.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-a
PRLX_TRACE=b.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-b
prlx diff a.prlx b.prlx
Triton
import prlx
prlx.enable() # hooks the Triton compiler — no kernel changes needed
import os, triton
os.environ["PRLX_TRACE"] = "a.prlx"
my_kernel[grid](...)
os.environ["PRLX_TRACE"] = "b.prlx"
my_kernel[grid](...)
Python API
from prlx import read_trace, diff_traces
# Read traces directly
trace = read_trace("a.prlx")
print(trace.header.kernel_name, trace.num_warps, "warps")
for w in trace.warps():
for ev in w.events:
if ev.is_branch:
print(f" warp {w.warp_idx}: site {ev.site_id:#x} {'T' if ev.branch_taken else 'NT'}")
# Or just run the differ
diff_traces("a.prlx", "b.prlx", history=True)
Multi-Kernel Pipelines
Capture and diff entire GPU pipelines (multiple kernel launches):
// In your code:
prlx_session_begin(NULL);
kernel_A<<<grid, block>>>(...); // prlx_pre/post_launch called automatically
kernel_B<<<grid, block>>>(...);
prlx_session_end();
# Capture sessions
PRLX_SESSION=/tmp/session_a ./my_pipeline --param-a
PRLX_SESSION=/tmp/session_b ./my_pipeline --param-b
# Diff two sessions
prlx diff /tmp/session_a /tmp/session_b
# Or use the session subcommand:
prlx session diff /tmp/session_a /tmp/session_b
# Inspect a session manifest:
prlx session inspect /tmp/session_a
# Capture via CLI wrapper:
prlx session capture ./my_pipeline -o /tmp/session_a -- --param-a
Unmatched kernel launches between sessions are reported as warnings. Grid/block dimension mismatches are also flagged.
PyTorch
import prlx
# Hooks Triton (torch.compile) + load_inline (C++ extensions) automatically
prlx.enable_pytorch()
model = MyModel().cuda()
output = model(input_tensor) # kernels are instrumented
# Or use the context manager for session tracing:
with prlx.pytorch_trace("my_model", output="/tmp/trace"):
model(input_tensor)
# Run a script with PyTorch instrumentation
prlx pytorch run script.py
# NVBit fallback for pre-compiled ops (no recompilation needed)
prlx pytorch run --nvbit script.py
# Check integration status
prlx pytorch --info
Install the optional PyTorch dependency: pip install prlx[pytorch]
TUI
prlx diff a.prlx b.prlx --tui --map prlx-sites.json
Interactive terminal UI for navigating divergences across warps. Press s to toggle inline source view at divergence sites (requires --map for site-to-source mapping).
| Key | Action |
|---|---|
j/k |
Scroll up/down |
n/N |
Next/previous divergence |
]/[ |
Next/previous warp |
s |
Toggle source view |
Tab |
Switch pane focus |
/ |
Jump to warp by number |
q |
Quit |
CI Regression Gate
Automatically pass/fail based on divergence thresholds:
# Strict: zero divergences allowed (default)
prlx assert a.prlx b.prlx
# Tolerant: allow up to 5 divergences
prlx assert a.prlx b.prlx --max-divergences 5
# Golden mode: compare against a known-good trace
prlx assert --golden golden.prlx test.prlx
# JSON output for CI pipelines
prlx assert a.prlx b.prlx --json
# Ignore active mask differences (only count branch/path/value)
prlx assert a.prlx b.prlx --ignore-active-mask
Exit code 0 = pass, 1 = fail. Human-readable summary by default:
PRLX ASSERT: PASS (4 divergences, threshold: 5)
PRLX ASSERT: FAIL (4 divergences, threshold: 2)
Flamegraph Export
Export divergences to Chrome Trace Format for visual analysis:
prlx flamegraph a.prlx b.prlx -o divergences.json --map prlx-sites.json
Open divergences.json in chrome://tracing or ui.perfetto.dev. Each row is a warp (grouped by block), colored bars show divergence events, and counter tracks show per-site frequency heatmaps.
Environment Variables
| Variable | Default | What it does |
|---|---|---|
PRLX_TRACE |
trace.prlx |
Output path |
PRLX_SNAPSHOT_DEPTH |
0 |
Per-lane operand ring buffer size |
PRLX_HISTORY_DEPTH |
0 |
Time-travel value ring buffer size |
PRLX_SAMPLE_RATE |
1 |
Record 1 in N events |
PRLX_COMPRESS |
0 |
zstd compress the trace |
PRLX_ENABLED |
1 |
Kill switch |
PRLX_FILTER |
(none) | Comma-separated glob patterns for kernel names to instrument |
PRLX_SESSION |
(none) | Directory path for multi-launch session mode |
PRLX_SITES |
prlx-sites.json |
Output path for site map |
PRLX_INSTRUMENT_STORES |
0 |
Instrument global memory stores (opt-in, produces large traces) |
PRLX_OPT_TIMEOUT |
120 |
Timeout (seconds) for llvm-link/opt in Triton hook |
How It Works
PRLX has three backends for instrumenting GPU code:
-
LLVM pass (
lib/pass/) — loaded as-fpass-pluginduring compilation (clang) or injected between Triton'smake_llirandmake_ptxstages. Walks NVPTX or AMDGPU IR, inserts calls to__prlx_record_branch/__prlx_record_valueat every branch and comparison. For Triton's branchless single-BB kernels, it detects predicated ops (icmpfeeding inline PTX asm orselect). Supports both NVIDIA (NVPTX) and AMD (AMDGPU) targets. -
NVBit tool (
lib/nvbit_tool/) (experimental) — SASS-level binary instrumentation via NVBit. Works on closed-source kernels where you don't have IR access. Less tested than the LLVM pass; use for cases where recompilation is not possible. -
Runtime (
lib/runtime/) — device-side ring buffers (one per warp) that record events, value history, and per-lane comparison operand snapshots. Host hooks (prlx_pre_launch/prlx_post_launch) manage allocation and readback.
Traces are written to .prlx files (custom binary format, optionally zstd-compressed). The differ (differ/, Rust) aligns event streams with bounded lookahead, classifies divergences (branch direction, path length, missing events), and can display per-lane operand diffs.
Layout
lib/pass/ LLVM instrumentation pass (libPrlxPass.so) — NVPTX + AMDGPU
lib/runtime/ device-side recording + host hooks (CUDA + HIP)
lib/nvbit_tool/ NVBit binary instrumentation backend (experimental)
lib/common/ shared trace format header
differ/ Rust differ + TUI + JSON/flamegraph export (prlx-diff)
python/prlx/ trace reader, Triton hook, PyTorch hook, runtime FFI, CLI
examples/ demo kernels (branch, loop, matmul, occupancy)
tools/ utilities (gen_demo_traces.py — synthetic trace generator)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: Python 3, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7f61d7d8eaf5ed6de25e83de39d2424e0fe92f97eb2f8355e9a1f99bd2f09ea
|
|
| MD5 |
cb5d0e2253eeec74b439288a56248ae4
|
|
| BLAKE2b-256 |
8511b27fa6fe8cd4583e607ba58da0d836f940886c87a6e6c0de529ee0334e52
|
Provenance
The following attestation bundles were made for prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl:
Publisher:
release.yml on Khushiyant/parallax
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl -
Subject digest:
e7f61d7d8eaf5ed6de25e83de39d2424e0fe92f97eb2f8355e9a1f99bd2f09ea - Sigstore transparency entry: 953622958
- Sigstore integration time:
-
Permalink:
Khushiyant/parallax@d69ab147e24c9559912de4249a7578ac2d69695c -
Branch / Tag:
refs/tags/v0.4.1 - Owner: https://github.com/Khushiyant
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d69ab147e24c9559912de4249a7578ac2d69695c -
Trigger Event:
push
-
Statement type: