Differential debugger for CUDA/Triton GPU kernels

These details have not been verified by PyPI

Project description

PRLX

Differential debugger for CUDA and Triton GPU kernels.

Run your kernel twice with different inputs (or a known-good vs buggy version). PRLX instruments every branch, captures per-warp execution traces, and diffs them — telling you exactly which warp diverged, at which instruction, and what each lane saw:

Site 0xfbe6edc1  (branch_kernel:12)  2 warps affected

  Warp 1, event 3: Branch Direction
    A: TAKEN    B: NOT-TAKEN

    Operand Snapshot (icmp sgt):
    Lane       A:lhs    A:rhs       B:lhs    B:rhs
       0          32       10          32       64  <<<
       1          33       10          33       64  <<<
       2          34       10          34       64  <<<
      ...

The threshold changed from 10 to 64. Lanes 0–31 compared their value against the threshold; in run A they all passed, in run B they didn't. That's the bug.

Install

pip install prlx

Needs CUDA 12+ and LLVM 18/19/20 on the host (for prlx compile and Triton integration). The differ and Python trace reader have no external deps.

From source (NVIDIA)

cmake -B build && cmake --build build
cd differ && cargo build --release && cd ..
pip install -e .

Build deps: CMake 3.20+, LLVM/Clang 18–20, CUDA Toolkit, Rust stable.

From source (AMD ROCm)

cmake -B build -DPRLX_ENABLE_CUDA=OFF -DPRLX_ENABLE_HIP=ON
cmake --build build
cd differ && cargo build --release && cd ..
pip install -e .

Build deps: CMake 3.20+, LLVM/Clang 18–20, ROCm 5.0+, Rust stable. The LLVM pass supports AMDGPU targets. The HIP runtime targets wave32 (RDNA) GPUs.

Usage

CUDA C

prlx compile kernel.cu -o kernel
PRLX_TRACE=a.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-a
PRLX_TRACE=b.prlx PRLX_SNAPSHOT_DEPTH=32 ./kernel --input-b
prlx diff a.prlx b.prlx

Triton

import prlx
prlx.enable()  # hooks the Triton compiler — no kernel changes needed

import os, triton

os.environ["PRLX_TRACE"] = "a.prlx"
my_kernel[grid](...)

os.environ["PRLX_TRACE"] = "b.prlx"
my_kernel[grid](...)

Python API

from prlx import read_trace, diff_traces

# Read traces directly
trace = read_trace("a.prlx")
print(trace.header.kernel_name, trace.num_warps, "warps")
for w in trace.warps():
    for ev in w.events:
        if ev.is_branch:
            print(f"  warp {w.warp_idx}: site {ev.site_id:#x} {'T' if ev.branch_taken else 'NT'}")

# Or just run the differ
diff_traces("a.prlx", "b.prlx", history=True)

Multi-Kernel Pipelines

Capture and diff entire GPU pipelines (multiple kernel launches):

// In your code:
prlx_session_begin(NULL);
kernel_A<<<grid, block>>>(...);  // prlx_pre/post_launch called automatically
kernel_B<<<grid, block>>>(...);
prlx_session_end();

# Capture sessions
PRLX_SESSION=/tmp/session_a ./my_pipeline --param-a
PRLX_SESSION=/tmp/session_b ./my_pipeline --param-b

# Diff two sessions
prlx diff /tmp/session_a /tmp/session_b

# Or use the session subcommand:
prlx session diff /tmp/session_a /tmp/session_b

# Inspect a session manifest:
prlx session inspect /tmp/session_a

# Capture via CLI wrapper:
prlx session capture ./my_pipeline -o /tmp/session_a -- --param-a

Unmatched kernel launches between sessions are reported as warnings. Grid/block dimension mismatches are also flagged.

PyTorch

import prlx

# Hooks Triton (torch.compile) + load_inline (C++ extensions) automatically
prlx.enable_pytorch()

model = MyModel().cuda()
output = model(input_tensor)  # kernels are instrumented

# Or use the context manager for session tracing:
with prlx.pytorch_trace("my_model", output="/tmp/trace"):
    model(input_tensor)

# Run a script with PyTorch instrumentation
prlx pytorch run script.py

# NVBit fallback for pre-compiled ops (no recompilation needed)
prlx pytorch run --nvbit script.py

# Check integration status
prlx pytorch --info

Install the optional PyTorch dependency: pip install prlx[pytorch]

TUI

prlx diff a.prlx b.prlx --tui --map prlx-sites.json

Interactive terminal UI for navigating divergences across warps. Press s to toggle inline source view at divergence sites (requires --map for site-to-source mapping).

Key	Action
`j`/`k`	Scroll up/down
`n`/`N`	Next/previous divergence
`]`/`[`	Next/previous warp
`s`	Toggle source view
`Tab`	Switch pane focus
`/`	Jump to warp by number
`q`	Quit

CI Regression Gate

Automatically pass/fail based on divergence thresholds:

# Strict: zero divergences allowed (default)
prlx assert a.prlx b.prlx

# Tolerant: allow up to 5 divergences
prlx assert a.prlx b.prlx --max-divergences 5

# Golden mode: compare against a known-good trace
prlx assert --golden golden.prlx test.prlx

# JSON output for CI pipelines
prlx assert a.prlx b.prlx --json

# Ignore active mask differences (only count branch/path/value)
prlx assert a.prlx b.prlx --ignore-active-mask

Exit code 0 = pass, 1 = fail. Human-readable summary by default:

PRLX ASSERT: PASS (4 divergences, threshold: 5)
PRLX ASSERT: FAIL (4 divergences, threshold: 2)

Flamegraph Export

Export divergences to Chrome Trace Format for visual analysis:

prlx flamegraph a.prlx b.prlx -o divergences.json --map prlx-sites.json

Open divergences.json in chrome://tracing or ui.perfetto.dev. Each row is a warp (grouped by block), colored bars show divergence events, and counter tracks show per-site frequency heatmaps.

Environment Variables

Variable	Default	What it does
`PRLX_TRACE`	`trace.prlx`	Output path
`PRLX_SNAPSHOT_DEPTH`	`0`	Per-lane operand ring buffer size
`PRLX_HISTORY_DEPTH`	`0`	Time-travel value ring buffer size
`PRLX_SAMPLE_RATE`	`1`	Record 1 in N events
`PRLX_COMPRESS`	`0`	zstd compress the trace
`PRLX_ENABLED`	`1`	Kill switch
`PRLX_FILTER`	(none)	Comma-separated glob patterns for kernel names to instrument
`PRLX_SESSION`	(none)	Directory path for multi-launch session mode
`PRLX_SITES`	`prlx-sites.json`	Output path for site map
`PRLX_INSTRUMENT_STORES`	`0`	Instrument global memory stores (opt-in, produces large traces)
`PRLX_OPT_TIMEOUT`	`120`	Timeout (seconds) for llvm-link/opt in Triton hook

How It Works

PRLX has three backends for instrumenting GPU code:

LLVM pass (lib/pass/) — loaded as -fpass-plugin during compilation (clang) or injected between Triton's make_llir and make_ptx stages. Walks NVPTX or AMDGPU IR, inserts calls to __prlx_record_branch / __prlx_record_value at every branch and comparison. For Triton's branchless single-BB kernels, it detects predicated ops (icmp feeding inline PTX asm or select). Supports both NVIDIA (NVPTX) and AMD (AMDGPU) targets.
NVBit tool (lib/nvbit_tool/) (experimental) — SASS-level binary instrumentation via NVBit. Works on closed-source kernels where you don't have IR access. Less tested than the LLVM pass; use for cases where recompilation is not possible.
Runtime (lib/runtime/) — device-side ring buffers (one per warp) that record events, value history, and per-lane comparison operand snapshots. Host hooks (prlx_pre_launch / prlx_post_launch) manage allocation and readback.

Traces are written to .prlx files (custom binary format, optionally zstd-compressed). The differ (differ/, Rust) aligns event streams with bounded lookahead, classifies divergences (branch direction, path length, missing events), and can display per-lane operand diffs.

Layout

lib/pass/           LLVM instrumentation pass (libPrlxPass.so) — NVPTX + AMDGPU
lib/runtime/        device-side recording + host hooks (CUDA + HIP)
lib/nvbit_tool/     NVBit binary instrumentation backend (experimental)
lib/common/         shared trace format header
differ/             Rust differ + TUI + JSON/flamegraph export (prlx-diff)
python/prlx/        trace reader, Triton hook, PyTorch hook, runtime FFI, CLI
examples/           demo kernels (branch, loop, matmul, occupancy)
tools/              utilities (gen_demo_traces.py — synthetic trace generator)

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.4

Feb 17, 2026

0.4.3

Feb 15, 2026

0.4.2

Feb 15, 2026

This version

0.4.1

Feb 15, 2026

0.4.0

Feb 15, 2026

0.3.3

Feb 15, 2026

0.3.1

Feb 15, 2026

0.3.0

Feb 15, 2026

0.2.4

Feb 15, 2026

0.2.3

Feb 15, 2026

0.2.2

Feb 14, 2026

0.2.1

Feb 14, 2026

0.1.2

Feb 14, 2026

0.1.0

Feb 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl (1.3 MB view details)

Uploaded Feb 15, 2026 Python 3manylinux: glibc 2.28+ x86-64

File details

Details for the file prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl.

File metadata

Download URL: prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl
Upload date: Feb 15, 2026
Size: 1.3 MB
Tags: Python 3, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`e7f61d7d8eaf5ed6de25e83de39d2424e0fe92f97eb2f8355e9a1f99bd2f09ea`
MD5	`cb5d0e2253eeec74b439288a56248ae4`
BLAKE2b-256	`8511b27fa6fe8cd4583e607ba58da0d836f940886c87a6e6c0de529ee0334e52`

See more details on using hashes here.

Provenance

The following attestation bundles were made for prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl:

Publisher: release.yml on Khushiyant/parallax

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: prlx-0.4.1-py3-none-manylinux_2_28_x86_64.whl
- Subject digest: e7f61d7d8eaf5ed6de25e83de39d2424e0fe92f97eb2f8355e9a1f99bd2f09ea
- Sigstore transparency entry: 953622958
- Sigstore integration time: Feb 15, 2026
Source repository:
- Permalink: Khushiyant/parallax@d69ab147e24c9559912de4249a7578ac2d69695c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/Khushiyant
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d69ab147e24c9559912de4249a7578ac2d69695c
- Trigger Event: push

prlx 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PRLX

Install

Usage

CUDA C

Triton

Python API

Multi-Kernel Pipelines

PyTorch

TUI

CI Regression Gate

Flamegraph Export

Environment Variables

How It Works

Layout

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Provenance