Skip to main content

Differential debugger for CUDA/Triton GPU kernels

Project description

PRLX

prlx

Differential debugger for CUDA and Triton GPU kernels.

Run your kernel twice with different inputs (or a known-good vs buggy version). PRLX instruments every branch, captures per-warp execution traces, and diffs them — telling you exactly which warp diverged, at which instruction, and what each lane saw:

Site 0xfbe6edc1  (branch_kernel:12)  2 warps affected

  Warp 1, event 3: Branch Direction
    A: TAKEN    B: NOT-TAKEN

    Operand Snapshot (icmp sgt):
    Lane       A:lhs    A:rhs       B:lhs    B:rhs
       0          32       10          32       64  <<<
       1          33       10          33       64  <<<
       2          34       10          34       64  <<<
      ...

The threshold changed from 10 to 64. Lanes 0–31 compared their value against the threshold; in run A they all passed, in run B they didn't. That's the bug.

Install

pip install prlx

Needs CUDA 12+ and LLVM 18/19/20 on the host (for prlx compile and Triton integration). The differ and Python trace reader have no external deps.

From source
cmake -B build && cmake --build build
cd differ && cargo build --release && cd ..
pip install -e .

Build deps: CMake 3.20+, LLVM/Clang 18–20, CUDA Toolkit, Rust stable.

Usage

CUDA C

prlx compile kernel.cu -o kernel
PRLX_TRACE=a.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-a
PRLX_TRACE=b.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-b
prlx diff a.prlx b.prlx

Triton

import prlx
prlx.enable()  # hooks the Triton compiler — no kernel changes needed

import os, triton

os.environ["PRLX_TRACE"] = "a.prlx"
my_kernel[grid](...)

os.environ["PRLX_TRACE"] = "b.prlx"
my_kernel[grid](...)

Python API

from prlx import read_trace, diff_traces

# Read traces directly
trace = read_trace("a.prlx")
print(trace.header.kernel_name, trace.num_warps, "warps")
for w in trace.warps():
    for ev in w.events:
        if ev.is_branch:
            print(f"  warp {w.warp_idx}: site {ev.site_id:#x} {'T' if ev.branch_taken else 'NT'}")

# Or just run the differ
diff_traces("a.prlx", "b.prlx", history=True)

Multi-Kernel Pipelines

Capture and diff entire GPU pipelines (multiple kernel launches):

// In your code:
prlx_session_begin(NULL);
kernel_A<<<grid, block>>>(...);  // prlx_pre/post_launch called automatically
kernel_B<<<grid, block>>>(...);
prlx_session_end();
# Capture sessions
PRLX_SESSION=/tmp/session_a ./my_pipeline --param-a
PRLX_SESSION=/tmp/session_b ./my_pipeline --param-b

# Diff two sessions
prlx diff /tmp/session_a /tmp/session_b

# Or use the session subcommand:
prlx session diff /tmp/session_a /tmp/session_b

# Inspect a session manifest:
prlx session inspect /tmp/session_a

# Capture via CLI wrapper:
prlx session capture ./my_pipeline -o /tmp/session_a -- --param-a

Unmatched kernel launches between sessions are reported as warnings. Grid/block dimension mismatches are also flagged.

TUI

prlx diff a.prlx b.prlx --tui

Interactive terminal UI for navigating divergences across warps.

Environment Variables

Variable Default What it does
PRLX_TRACE trace.prlx Output path
PRLX_SNAPSHOT_DEPTH 0 Per-lane operand ring buffer size
PRLX_HISTORY_DEPTH 0 Time-travel value ring buffer size
PRLX_SAMPLE_RATE 1 Record 1 in N events
PRLX_COMPRESS 0 zstd compress the trace
PRLX_ENABLED 1 Kill switch
PRLX_FILTER (none) Comma-separated glob patterns for kernel names to instrument
PRLX_SESSION (none) Directory path for multi-launch session mode
PRLX_SITES prlx-sites.json Output path for site map
PRLX_INSTRUMENT_STORES 0 Instrument global memory stores (opt-in, produces large traces)
PRLX_OPT_TIMEOUT 120 Timeout (seconds) for llvm-link/opt in Triton hook

How It Works

PRLX has three backends for instrumenting GPU code:

  1. LLVM pass (lib/pass/) — loaded as -fpass-plugin during compilation (clang) or injected between Triton's make_llir and make_ptx stages. Walks the NVPTX IR, inserts calls to __prlx_record_branch / __prlx_record_value at every branch and comparison. For Triton's branchless single-BB kernels, it detects predicated ops (icmp feeding inline PTX asm or select).

  2. NVBit tool (lib/nvbit_tool/) (experimental) — SASS-level binary instrumentation via NVBit. Works on closed-source kernels where you don't have IR access. Less tested than the LLVM pass; use for cases where recompilation is not possible.

  3. Runtime (lib/runtime/) — device-side ring buffers (one per warp) that record events, value history, and per-lane comparison operand snapshots. Host hooks (prlx_pre_launch / prlx_post_launch) manage allocation and readback.

Traces are written to .prlx files (custom binary format, optionally zstd-compressed). The differ (differ/, Rust) aligns event streams with bounded lookahead, classifies divergences (branch direction, path length, missing events), and can display per-lane operand diffs.

Layout

lib/pass/           LLVM instrumentation pass (libPrlxPass.so)
lib/runtime/        device-side recording + host hooks
lib/nvbit_tool/     NVBit binary instrumentation backend (experimental)
lib/common/         shared trace format header
differ/             Rust differ + TUI (prlx-diff)
python/prlx/        trace reader, Triton hook, runtime FFI, CLI
examples/           demo kernels (branch, loop, matmul, occupancy)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prlx-0.3.1-py3-none-manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded Python 3manylinux: glibc 2.28+ x86-64

File details

Details for the file prlx-0.3.1-py3-none-manylinux_2_28_x86_64.whl.

File metadata

  • Download URL: prlx-0.3.1-py3-none-manylinux_2_28_x86_64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3, manylinux: glibc 2.28+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prlx-0.3.1-py3-none-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2fea57b1a8058d2c9e4d298acc0b0c01036ad78a3d03826b2fa8f05bb571a924
MD5 38682b56af369ce374e092efb2b97fc2
BLAKE2b-256 ddd72f5f1bea178ecf95d3aa927bb239c797d1b6e8a9dae3ec0422e2c4b3634c

See more details on using hashes here.

Provenance

The following attestation bundles were made for prlx-0.3.1-py3-none-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Khushiyant/parallax

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page