Skip to main content

Differential debugger for CUDA/Triton GPU kernels

Project description

PRLX

prlx

Differential debugger for CUDA and Triton GPU kernels.

Run your kernel twice with different inputs (or a known-good vs buggy version). PRLX instruments every branch, captures per-warp execution traces, and diffs them — telling you exactly which warp diverged, at which instruction, and what each lane saw:

Site 0xfbe6edc1  (branch_kernel:12)  2 warps affected

  Warp 1, event 3: Branch Direction
    A: TAKEN    B: NOT-TAKEN

    Operand Snapshot (icmp sgt):
    Lane       A:lhs    A:rhs       B:lhs    B:rhs
       0          32       10          32       64  <<<
       1          33       10          33       64  <<<
       2          34       10          34       64  <<<
      ...

The threshold changed from 10 to 64. Lanes 0–31 compared their value against the threshold; in run A they all passed, in run B they didn't. That's the bug.

Install

pip install prlx

Needs CUDA 12+ and LLVM 18/19/20 on the host (for prlx compile and Triton integration). The differ and Python trace reader have no external deps.

From source (NVIDIA)
cmake -B build && cmake --build build
cd differ && cargo build --release && cd ..
pip install -e .

Build deps: CMake 3.20+, LLVM/Clang 18–20, CUDA Toolkit, Rust stable.

From source (AMD ROCm)
cmake -B build -DPRLX_ENABLE_CUDA=OFF -DPRLX_ENABLE_HIP=ON
cmake --build build
cd differ && cargo build --release && cd ..
pip install -e .

Build deps: CMake 3.20+, LLVM/Clang 18–20, ROCm 5.0+, Rust stable. The LLVM pass supports AMDGPU targets. The HIP runtime targets wave32 (RDNA) GPUs.

Usage

CUDA C

prlx compile kernel.cu -o kernel
PRLX_TRACE=a.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-a
PRLX_TRACE=b.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-b
prlx diff a.prlx b.prlx

Triton

import prlx
prlx.enable()  # hooks the Triton compiler — no kernel changes needed

import os, triton

os.environ["PRLX_TRACE"] = "a.prlx"
my_kernel[grid](...)

os.environ["PRLX_TRACE"] = "b.prlx"
my_kernel[grid](...)

Python API

from prlx import read_trace, diff_traces

# Read traces directly
trace = read_trace("a.prlx")
print(trace.header.kernel_name, trace.num_warps, "warps")
for w in trace.warps():
    for ev in w.events:
        if ev.is_branch:
            print(f"  warp {w.warp_idx}: site {ev.site_id:#x} {'T' if ev.branch_taken else 'NT'}")

# Or just run the differ
diff_traces("a.prlx", "b.prlx", history=True)

Multi-Kernel Pipelines

Capture and diff entire GPU pipelines (multiple kernel launches):

// In your code:
prlx_session_begin(NULL);
kernel_A<<<grid, block>>>(...);  // prlx_pre/post_launch called automatically
kernel_B<<<grid, block>>>(...);
prlx_session_end();
# Capture sessions
PRLX_SESSION=/tmp/session_a ./my_pipeline --param-a
PRLX_SESSION=/tmp/session_b ./my_pipeline --param-b

# Diff two sessions
prlx diff /tmp/session_a /tmp/session_b

# Or use the session subcommand:
prlx session diff /tmp/session_a /tmp/session_b

# Inspect a session manifest:
prlx session inspect /tmp/session_a

# Capture via CLI wrapper:
prlx session capture ./my_pipeline -o /tmp/session_a -- --param-a

Unmatched kernel launches between sessions are reported as warnings. Grid/block dimension mismatches are also flagged.

PyTorch

import prlx

# Hooks Triton (torch.compile) + load_inline (C++ extensions) automatically
prlx.enable_pytorch()

model = MyModel().cuda()
output = model(input_tensor)  # kernels are instrumented

# Or use the context manager for session tracing:
with prlx.pytorch_trace("my_model", output="/tmp/trace"):
    model(input_tensor)
# Run a script with PyTorch instrumentation
prlx pytorch run script.py

# NVBit fallback for pre-compiled ops (no recompilation needed)
prlx pytorch run --nvbit script.py

# Check integration status
prlx pytorch --info

Install the optional PyTorch dependency: pip install prlx[pytorch]

TUI

prlx diff a.prlx b.prlx --tui --map prlx-sites.json

Interactive terminal UI for navigating divergences across warps. Press s to toggle inline source view at divergence sites (requires --map for site-to-source mapping).

Key Action
j/k Scroll up/down
n/N Next/previous divergence
]/[ Next/previous warp
s Toggle source view
Tab Switch pane focus
/ Jump to warp by number
q Quit

CI Regression Gate

Automatically pass/fail based on divergence thresholds:

# Strict: zero divergences allowed (default)
prlx assert a.prlx b.prlx

# Tolerant: allow up to 5 divergences
prlx assert a.prlx b.prlx --max-divergences 5

# Golden mode: compare against a known-good trace
prlx assert --golden golden.prlx test.prlx

# JSON output for CI pipelines
prlx assert a.prlx b.prlx --json

# Ignore active mask differences (only count branch/path/value)
prlx assert a.prlx b.prlx --ignore-active-mask

Exit code 0 = pass, 1 = fail. Human-readable summary by default:

PRLX ASSERT: PASS (4 divergences, threshold: 5)
PRLX ASSERT: FAIL (4 divergences, threshold: 2)

Flamegraph Export

Export divergences to Chrome Trace Format for visual analysis:

prlx flamegraph a.prlx b.prlx -o divergences.json --map prlx-sites.json

Open divergences.json in chrome://tracing or ui.perfetto.dev. Each row is a warp (grouped by block), colored bars show divergence events, and counter tracks show per-site frequency heatmaps.

Environment Variables

Variable Default What it does
PRLX_TRACE trace.prlx Output path
PRLX_SNAPSHOT_DEPTH 0 Per-lane operand ring buffer size
PRLX_HISTORY_DEPTH 0 Time-travel value ring buffer size
PRLX_SAMPLE_RATE 1 Record 1 in N events
PRLX_COMPRESS 0 zstd compress the trace
PRLX_ENABLED 1 Kill switch
PRLX_FILTER (none) Comma-separated glob patterns for kernel names to instrument
PRLX_SESSION (none) Directory path for multi-launch session mode
PRLX_SITES prlx-sites.json Output path for site map
PRLX_INSTRUMENT_STORES 0 Instrument global memory stores (opt-in, produces large traces)
PRLX_OPT_TIMEOUT 120 Timeout (seconds) for llvm-link/opt in Triton hook

How It Works

PRLX has three backends for instrumenting GPU code:

  1. LLVM pass (lib/pass/) — loaded as -fpass-plugin during compilation (clang) or injected between Triton's make_llir and make_ptx stages. Walks NVPTX or AMDGPU IR, inserts calls to __prlx_record_branch / __prlx_record_value at every branch and comparison. For Triton's branchless single-BB kernels, it detects predicated ops (icmp feeding inline PTX asm or select). Supports both NVIDIA (NVPTX) and AMD (AMDGPU) targets.

  2. NVBit tool (lib/nvbit_tool/) (experimental) — SASS-level binary instrumentation via NVBit. Works on closed-source kernels where you don't have IR access. Less tested than the LLVM pass; use for cases where recompilation is not possible.

  3. Runtime (lib/runtime/) — device-side ring buffers (one per warp) that record events, value history, and per-lane comparison operand snapshots. Host hooks (prlx_pre_launch / prlx_post_launch) manage allocation and readback.

Traces are written to .prlx files (custom binary format, optionally zstd-compressed). The differ (differ/, Rust) aligns event streams with bounded lookahead, classifies divergences (branch direction, path length, missing events), and can display per-lane operand diffs.

Layout

lib/pass/           LLVM instrumentation pass (libPrlxPass.so) — NVPTX + AMDGPU
lib/runtime/        device-side recording + host hooks (CUDA + HIP)
lib/nvbit_tool/     NVBit binary instrumentation backend (experimental)
lib/common/         shared trace format header
differ/             Rust differ + TUI + JSON/flamegraph export (prlx-diff)
python/prlx/        trace reader, Triton hook, PyTorch hook, runtime FFI, CLI
examples/           demo kernels (branch, loop, matmul, occupancy)
tools/              utilities (gen_demo_traces.py — synthetic trace generator)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl (1.3 MB view details)

Uploaded Python 3manylinux: glibc 2.28+ x86-64

File details

Details for the file prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e7f61d7d8eaf5ed6de25e83de39d2424e0fe92f97eb2f8355e9a1f99bd2f09ea
MD5 cb5d0e2253eeec74b439288a56248ae4
BLAKE2b-256 8511b27fa6fe8cd4583e607ba58da0d836f940886c87a6e6c0de529ee0334e52

See more details on using hashes here.

Provenance

The following attestation bundles were made for prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Khushiyant/parallax

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page