Differential debugger for CUDA/Triton GPU kernels
Project description
PRLX
Differential debugger for CUDA and Triton GPU kernels.
Run your kernel twice with different inputs (or a known-good vs buggy version). PRLX instruments every branch, captures per-warp execution traces, and diffs them — telling you exactly which warp diverged, at which instruction, and what each lane saw:
Site 0xfbe6edc1 (branch_kernel:12) 2 warps affected
Warp 1, event 3: Branch Direction
A: TAKEN B: NOT-TAKEN
Operand Snapshot (icmp sgt):
Lane A:lhs A:rhs B:lhs B:rhs
0 32 10 32 64 <<<
1 33 10 33 64 <<<
2 34 10 34 64 <<<
...
The threshold changed from 10 to 64. Lanes 0–31 compared their value against the threshold; in run A they all passed, in run B they didn't. That's the bug.
Install
pip install prlx
Needs CUDA 12+ and LLVM 18/19/20 on the host (for prlx compile and Triton integration). The differ and Python trace reader have no external deps.
From source
cmake -B build && cmake --build build
cd differ && cargo build --release && cd ..
pip install -e .
Build deps: CMake 3.20+, LLVM/Clang 18–20, CUDA Toolkit, Rust stable.
Usage
CUDA C
prlx compile kernel.cu -o kernel
PRLX_TRACE=a.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-a
PRLX_TRACE=b.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-b
prlx diff a.prlx b.prlx
Triton
import prlx
prlx.enable() # hooks the Triton compiler — no kernel changes needed
import os, triton
os.environ["PRLX_TRACE"] = "a.prlx"
my_kernel[grid](...)
os.environ["PRLX_TRACE"] = "b.prlx"
my_kernel[grid](...)
Python API
from prlx import read_trace, diff_traces
# Read traces directly
trace = read_trace("a.prlx")
print(trace.header.kernel_name, trace.num_warps, "warps")
for w in trace.warps():
for ev in w.events:
if ev.is_branch:
print(f" warp {w.warp_idx}: site {ev.site_id:#x} {'T' if ev.branch_taken else 'NT'}")
# Or just run the differ
diff_traces("a.prlx", "b.prlx", history=True)
Multi-Kernel Pipelines
Capture and diff entire GPU pipelines (multiple kernel launches):
// In your code:
prlx_session_begin(NULL);
kernel_A<<<grid, block>>>(...); // prlx_pre/post_launch called automatically
kernel_B<<<grid, block>>>(...);
prlx_session_end();
# Capture sessions
PRLX_SESSION=/tmp/session_a ./my_pipeline --param-a
PRLX_SESSION=/tmp/session_b ./my_pipeline --param-b
# Diff two sessions
prlx diff /tmp/session_a /tmp/session_b
# Or use the session subcommand:
prlx session diff /tmp/session_a /tmp/session_b
# Inspect a session manifest:
prlx session inspect /tmp/session_a
# Capture via CLI wrapper:
prlx session capture ./my_pipeline -o /tmp/session_a -- --param-a
Unmatched kernel launches between sessions are reported as warnings. Grid/block dimension mismatches are also flagged.
TUI
prlx diff a.prlx b.prlx --tui
Interactive terminal UI for navigating divergences across warps.
Environment Variables
| Variable | Default | What it does |
|---|---|---|
PRLX_TRACE |
trace.prlx |
Output path |
PRLX_SNAPSHOT_DEPTH |
0 |
Per-lane operand ring buffer size |
PRLX_HISTORY_DEPTH |
0 |
Time-travel value ring buffer size |
PRLX_SAMPLE_RATE |
1 |
Record 1 in N events |
PRLX_COMPRESS |
0 |
zstd compress the trace |
PRLX_ENABLED |
1 |
Kill switch |
PRLX_FILTER |
(none) | Comma-separated glob patterns for kernel names to instrument |
PRLX_SESSION |
(none) | Directory path for multi-launch session mode |
PRLX_SITES |
prlx-sites.json |
Output path for site map |
PRLX_INSTRUMENT_STORES |
0 |
Instrument global memory stores (opt-in, produces large traces) |
PRLX_OPT_TIMEOUT |
120 |
Timeout (seconds) for llvm-link/opt in Triton hook |
How It Works
PRLX has three backends for instrumenting GPU code:
-
LLVM pass (
lib/pass/) — loaded as-fpass-pluginduring compilation (clang) or injected between Triton'smake_llirandmake_ptxstages. Walks the NVPTX IR, inserts calls to__prlx_record_branch/__prlx_record_valueat every branch and comparison. For Triton's branchless single-BB kernels, it detects predicated ops (icmpfeeding inline PTX asm orselect). -
NVBit tool (
lib/nvbit_tool/) (experimental) — SASS-level binary instrumentation via NVBit. Works on closed-source kernels where you don't have IR access. Less tested than the LLVM pass; use for cases where recompilation is not possible. -
Runtime (
lib/runtime/) — device-side ring buffers (one per warp) that record events, value history, and per-lane comparison operand snapshots. Host hooks (prlx_pre_launch/prlx_post_launch) manage allocation and readback.
Traces are written to .prlx files (custom binary format, optionally zstd-compressed). The differ (differ/, Rust) aligns event streams with bounded lookahead, classifies divergences (branch direction, path length, missing events), and can display per-lane operand diffs.
Layout
lib/pass/ LLVM instrumentation pass (libPrlxPass.so)
lib/runtime/ device-side recording + host hooks
lib/nvbit_tool/ NVBit binary instrumentation backend (experimental)
lib/common/ shared trace format header
differ/ Rust differ + TUI (prlx-diff)
python/prlx/ trace reader, Triton hook, runtime FFI, CLI
examples/ demo kernels (branch, loop, matmul, occupancy)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prlx-0.3.1-py3-none-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: prlx-0.3.1-py3-none-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2fea57b1a8058d2c9e4d298acc0b0c01036ad78a3d03826b2fa8f05bb571a924
|
|
| MD5 |
38682b56af369ce374e092efb2b97fc2
|
|
| BLAKE2b-256 |
ddd72f5f1bea178ecf95d3aa927bb239c797d1b6e8a9dae3ec0422e2c4b3634c
|
Provenance
The following attestation bundles were made for prlx-0.3.1-py3-none-manylinux_2_28_x86_64.whl:
Publisher:
release.yml on Khushiyant/parallax
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
prlx-0.3.1-py3-none-manylinux_2_28_x86_64.whl -
Subject digest:
2fea57b1a8058d2c9e4d298acc0b0c01036ad78a3d03826b2fa8f05bb571a924 - Sigstore transparency entry: 953584117
- Sigstore integration time:
-
Permalink:
Khushiyant/parallax@579964f0ae77661fcb1b1bd0147e5836a039bc8a -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/Khushiyant
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@579964f0ae77661fcb1b1bd0147e5836a039bc8a -
Trigger Event:
push
-
Statement type: