Vendor-portable GPU decoders for quantum LDPC codes: Triton min-sum BP and Relay-BP on NVIDIA (CUDA) and AMD (ROCm), with CPU reference implementations, consuming any stim DetectorErrorModel or raw parity-check matrices.

These details have not been verified by PyPI

Project links

Project description

tridec

Badge honesty: CI is CPU-only (ubuntu + macos arm64; the macos lane binds the strict exact-count receipt gates). There are no GPU runners — the CUDA/ROCm kernel paths are validated by the carried H200/MI300X receipts in bench/receipts/, and the experimental Metal tier runs on a local machine.

An open, vendor-portable GPU decoder library for quantum LDPC codes — Triton min-sum BP and Relay-BP decoders that consume any stim DetectorErrorModel or raw parity-check matrices, with CPU reference implementations, validated against the standard CPU references (ldpc, relay-bp), running on NVIDIA (CUDA) and AMD (ROCm) GPUs.

The same Triton kernels run unmodified on both vendors: the Relay-BP kernel reproduces its logical-error-rate validation numbers identically on an NVIDIA H200 (CUDA 12.4, triton 3.0) and an AMD MI300X (ROCm 7.0, triton 3.4) — see docs/benchmark.md and the raw receipts in bench/receipts/. Validated scope is NVIDIA + AMD; Apple silicon runs the same kernels through triton-metal as an experimental backend (see below).

v0.2: the megakernel backend (opt-in)

A single-launch persistent megakernel — the entire Relay-BP decode (every BP iteration, every relay leg, in-kernel syndrome convergence + nconv stop + lowest-weight selection) in one kernel launch per decode_batch, with per-shot early exit, instead of the v0.1 host loop's thousands of launches. Validated on all three platforms against the v0.1 two-kernel path and the relay-bp Rust oracle (14/14 gates at BLOCK 128/256 on CUDA and ROCm; barriers verified honored on both):

Relay-BP megakernel vs v0.1 two-kernel	speedup
Apple M4 Max (Metal, triton-metal)	65× — 30.0 s → 0.46 s / 2000 shots
NVIDIA H200 (CUDA)	9–18× — batch-1 62.5 → 3.44 ms; 34.6 µs/syn @8192
AMD MI300X (ROCm)	11–22× — batch-1 8.48 ms; 46.0 µs/syn @8192

Opt-in today via tridec.backends.megakernel.{RelayBpMegaTriton, BpMegaTriton}; auto-dispatch from from_dem(..., backend="auto") lands in v0.2.1 once the public-API path is gated on a GPU (#5). Receipts: bench/receipts/megakernel_{h200,mi300x,metal}*.

Megakernel: honest limits + tuning

Plain-BP megakernel is a single-shot latency tool, not a throughput tool. At batch-1 it is ~1.7× faster than the two-kernel BP path (H200 0.61 vs 1.06 ms); at large batch it loses (plain BP has no early-exit lever) — the two-kernel BP path stays the throughput default. Use BpMegaTriton for low-latency bare BP, RelayBpMegaTriton for the accurate latency path.
Real-time / single-shot: H200 leads MI300X 2.47× at batch-1 (3.44 vs 8.48 ms) — wider than v0.1's ~9% two-kernel gap, because the single-CTA-per-shot design amplifies per-SM and codegen differences at batch-1. Batched, the gap is 1.25–1.33×; correctness is identical across vendors. (The pitch is vendor-portable + performant on both, never parity.)
Per-arch autotuning. v0.2 ships autotuned BLOCK/num_warps configs for H200, MI300X and M4 Max, pinned in _CUDA_TUNED keyed by gcnArchName/device name. AMD (wavefront-64) wants the opposite shape from NVIDIA warps — low warps for BP, max BLOCK+warps for relay.
Metal is BLOCK=32 only for now — a transient triton-metal barrier-drop bug (fix confirmed on its dev branch); the Metal autotune widens once that merges. fp32-only on Metal (no fp64), same as the two-kernel path, and the fp32 near-tie-flip caveat below applies to the megakernel unchanged.

Install

Most users want pip install "tridec[torch,decoders]" (CPU+GPU torch backend plus the reference adapters). The bare install is the numpy CPU reference only — correct but slow.

pip install tridec                # numpy CPU reference only
pip install "tridec[torch]"       # + batched torch backend (CPU/GPU)
pip install "tridec[gpu]"         # + Triton GPU kernels (CUDA or ROCm)
pip install "tridec[decoders]"    # + ldpc / relay-bp reference adapters
pip install "tridec[sinter]"      # + sinter.collect integration

Quickstart

import stim
import tridec

circuit = stim.Circuit.from_file("memory.stim")
dem = circuit.detector_error_model(decompose_errors=False)

decoder = tridec.from_dem(dem, backend="auto")   # triton > torch > numpy

dets, obs = circuit.compile_detector_sampler(seed=0).sample(
    100_000, separate_observables=True)
pred = decoder.decode_batch(dets)                      # (shots, n_obs) bool
print("logical error rate:", (pred != obs).any(axis=1).mean())

Raw matrices work too: tridec.from_matrices(H, priors, observables=Lo). Relay-BP: tridec.from_dem(dem, algorithm="relay") (Triton kernels only).

With sinter (the [sinter] extra):

import sinter
from tridec.sinter import sinter_decoders

stats = sinter.collect(
    num_workers=4, tasks=tasks,
    decoders=["tridec_bp", "pymatching"],
    custom_decoders=sinter_decoders(),
    max_shots=1_000_000)

Backend × algorithm matrix (honest availability)

Algorithm	`numpy`	`torch`	`triton`	`metal` (experimental)
min-sum BP	yes (CPU reference)	yes (CPU + CUDA/ROCm)	yes (CUDA + ROCm)	yes (fp32)
Relay-BP	no	no	yes (CUDA + ROCm)	yes (fp32, slow — see below)

There is no in-package CPU Relay-BP; its CPU reference is IBM's relay-bp Rust decoder, wrapped in tridec.adapters and used as the validation oracle for the Triton path.

What's validated where

Environment	Status
CPU (any)	numpy BP reference; torch BP bit-identical to numpy at fp64 (one iteration), LER-identical full decode
NVIDIA H200, CUDA 12.4, torch 2.4.1, triton 3.0.0	Triton BP: ≥99.5% hard-decision agreement vs fp64 references, LER-identical (156 = 156 = 156 fails / 2000 shots vs numpy/torch). Triton Relay-BP: LER-matches the `relay-bp` Rust oracle (31 vs 38 fails / 2000, overlapping Wilson CIs) — carried source-repo receipts
AMD MI300X, ROCm 7.0.0, torch 2.9, triton 3.4.0	Same kernels, unmodified: identical primitive-identity numbers (pre-leg posterior max-diff 1.8e-15) and the same oracle-vs-Triton LER identity (carried receipts) — and validated through the installed package for v0.1.0 (`bench/receipts/mi300x_packaged.json`): full suite 88 passed / 10 skipped on gfx942 (GPU tiers bind, darwin-only strict tiers skip), packaged-API BP 166 = numpy 166 fails / 2000, Relay-BP fp32 34 vs Rust oracle 31 (overlapping CIs), throughput within ±2.2% of the carried receipt
Apple silicon (M4 Max), triton-metal	Experimental, spike-validated only (`bench/receipts/metal_spike.md`): both kernels pass the same correctness gates at fp32; see the section below

Experimental: Apple silicon (Metal)

The same Triton kernels run on Apple-silicon GPUs through triton-metal, with zero changes to the kernel source. This is experimental: validated at spike level on one machine (M4 Max), fp32 only (Metal has no fp64), and not part of the official receipt set.

# triton-metal + a triton >= 3.6 build + torch must be importable, then:
pip install tridec
python -c "import tridec; print(tridec.available_backends())"  # ['metal', ...]

backend="auto" detects the triton-metal environment (darwin, triton + triton_metal importable, no CUDA/ROCm device) and selects "metal"; backend="triton" resolves to "metal" there too, and backend="metal" asserts the environment is present. The execution pattern is triton-metal's documented one — CPU torch tensors (zero-copy via unified memory; not mps) — so no device arguments are needed.

What the spike measured (2000 canonical shots, seed 0, M4 Max — re-validated through this API path in tests/test_metal.py):

min-sum BP (fp32): all correctness gates pass — one-iteration hard agreement 1.000 vs the fp64 numpy reference on both the surface-code and BB-code fixtures; LER 76 = 76 / 2000 (surface) and 167 vs 168 / 2000 (BB). Batched decode of 2000 shots in 28 ms (surface) / 167 ms (BB) — 37–56× the per-shot numpy baseline on the same machine.
Relay-BP (fp32): correct but slow — LER matches the relay-bp Rust oracle (31 vs 39 fails / 2000, per-shot agreement 99.3%), but decode_batch(2000) takes 31 s vs 1.26 s for the Rust CPU oracle: relay's per-iteration host loop (~7k small kernel launches) is launch-overhead dominated on Metal. Use it for validation, not production.
Relay-BP on metal enforces fp32: dtype="float64" raises with a clear error; the default resolves to float32.

No claims beyond the spike: no official LER receipts, no cross-machine validation, no performance tuning. CUDA/ROCm remain the supported GPU paths.

Compatibility floors in pyproject.toml; known-good pins: stim 1.15.0, ldpc 2.4.1, relay-bp 0.2.2, torch 2.4.1 / 2.9, triton 3.0 / 3.4.

Validation discipline

tridec.validation ships the matched-protocol harness the numbers were produced with: dem_hash (sha256 of the DEM's canonical bytes), run_matched (one shared DEM, one shot set, fail-fast DEM-identity and tie-break gates), Wilson/TOST statistics and a paired per-shot gap-to-MLE bootstrap. The test suite pins the extraction byte-for-byte: 8 canonical BB-code fixture circuits must hash to the exact DEM sha256s recorded in the carried zoo_grid.json receipt, and a full 16,667-shot cell must reproduce the recorded logical-failure counts of the ldpc reference adapters exactly.

For v0.1.0 the WHOLE grid was re-decoded in the receipt environment (bench/full_grid_noregression.py): 31 of 32 (cell, decoder) failure counts reproduce exactly — all 24 BP / BP-OSD-0 / BP-OSD-10 counts, and 7 of 8 BPLSD counts. The single deviation (BPLSD, p=0.002/X: 879 vs 880, one shot in 200,000) is attributed by a same-environment repeat experiment to run-to-run nondeterminism inside ldpc's BpLsdDecoder itself (identical shots, fresh instances: 879/880/879) — documented in bench/receipts/full_grid_noregression.json.

Status

0.2.0 — adds the opt-in megakernel backend (tri-platform validated; see above). v0.1.0 shipped the two-kernel BP/Relay-BP path + the validation discipline. The kernels and their receipts are stable; the public API surface is young and may still move before 1.0 — minor 0.x releases may rename or remove public API; 1.0 will lock the surface. Next (v0.2.1): megakernel auto-dispatch from from_dem(..., backend="auto"), gated on the public-API path running on a GPU (#5). GPU paths require triton

a CUDA/ROCm GPU (or the experimental triton-metal environment); the GPU/metal test tiers skip cleanly where unavailable.

License

Apache-2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jun 12, 2026

0.1.0

Jun 10, 2026

0.1.0a1 pre-release

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tridec-0.2.0.tar.gz (79.3 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tridec-0.2.0-py3-none-any.whl (67.4 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file tridec-0.2.0.tar.gz.

File metadata

Download URL: tridec-0.2.0.tar.gz
Upload date: Jun 12, 2026
Size: 79.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tridec-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d436e05e81fafa951de048d774e11212aaa6952b1abb40d6a2aeb9eead013f5c`
MD5	`19323c7c48346eb98101d1c09bd8a1b4`
BLAKE2b-256	`51a1aaf72f14ff5671cb8d03f757ea81ae69cad6867786c0eb92c88d99f36c36`

See more details on using hashes here.

File details

Details for the file tridec-0.2.0-py3-none-any.whl.

File metadata

Download URL: tridec-0.2.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 67.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tridec-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e735428e946dacc87e2256ea1fea1ab4b04cbcb28f4eb9d704f364a97f4f222d`
MD5	`6602a6f55f7d015a1260fcad1fc99564`
BLAKE2b-256	`2c0adee6e71aba76737d137c9e5b9fe7aa0f4df1fb2721994a8ad19339b40121`

See more details on using hashes here.

tridec 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tridec

v0.2: the megakernel backend (opt-in)

Megakernel: honest limits + tuning

Install

Quickstart

Backend × algorithm matrix (honest availability)

What's validated where

Experimental: Apple silicon (Metal)

Validation discipline

Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes