Vendor-portable GPU decoders for quantum LDPC codes: Triton min-sum BP and Relay-BP on NVIDIA (CUDA) and AMD (ROCm), with CPU reference implementations, consuming any stim DetectorErrorModel or raw parity-check matrices.
Project description
tridec
Badge honesty: CI is CPU-only (ubuntu + macos arm64; the macos lane binds the
strict exact-count receipt gates). There are no GPU runners — the CUDA/ROCm
kernel paths are validated by the carried H200/MI300X receipts in
bench/receipts/, and the experimental Metal tier runs on a local machine.
An open, vendor-portable GPU decoder library for quantum LDPC codes — Triton
min-sum BP and Relay-BP decoders that consume any stim DetectorErrorModel or
raw parity-check matrices, with CPU reference implementations, validated
against the standard CPU references (ldpc, relay-bp), running on NVIDIA
(CUDA) and AMD (ROCm) GPUs.
The same Triton kernels run unmodified on both vendors: the Relay-BP kernel
reproduces its logical-error-rate validation numbers identically on an NVIDIA
H200 (CUDA 12.4, triton 3.0) and an AMD MI300X (ROCm 7.0, triton 3.4) — see
docs/benchmark.md and the raw receipts in
bench/receipts/. Validated scope is NVIDIA + AMD; Apple silicon runs the
same kernels through triton-metal
as an experimental backend (see below).
v0.2: the megakernel backend (opt-in)
A single-launch persistent megakernel — the entire Relay-BP decode (every
BP iteration, every relay leg, in-kernel syndrome convergence + nconv stop +
lowest-weight selection) in one kernel launch per decode_batch, with
per-shot early exit, instead of the v0.1 host loop's thousands of launches.
Validated on all three platforms against the v0.1 two-kernel path and the
relay-bp Rust oracle (14/14 gates at BLOCK 128/256 on CUDA and ROCm; barriers
verified honored on both):
| Relay-BP megakernel vs v0.1 two-kernel | speedup |
|---|---|
| Apple M4 Max (Metal, triton-metal) | 65× — 30.0 s → 0.46 s / 2000 shots |
| NVIDIA H200 (CUDA) | 9–18× — batch-1 62.5 → 3.44 ms; 34.6 µs/syn @8192 |
| AMD MI300X (ROCm) | 11–22× — batch-1 8.48 ms; 46.0 µs/syn @8192 |
Opt-in today via tridec.backends.megakernel.{RelayBpMegaTriton, BpMegaTriton};
auto-dispatch from from_dem(..., backend="auto") lands in v0.2.1 once the
public-API path is gated on a GPU
(#5). Receipts:
bench/receipts/megakernel_{h200,mi300x,metal}*.
Megakernel: honest limits + tuning
- Plain-BP megakernel is a single-shot latency tool, not a throughput
tool. At batch-1 it is ~1.7× faster than the two-kernel BP path
(H200 0.61 vs 1.06 ms); at large batch it loses (plain BP has no early-exit
lever) — the two-kernel BP path stays the throughput default. Use
BpMegaTritonfor low-latency bare BP,RelayBpMegaTritonfor the accurate latency path. - Real-time / single-shot: H200 leads MI300X 2.47× at batch-1 (3.44 vs 8.48 ms) — wider than v0.1's ~9% two-kernel gap, because the single-CTA-per-shot design amplifies per-SM and codegen differences at batch-1. Batched, the gap is 1.25–1.33×; correctness is identical across vendors. (The pitch is vendor-portable + performant on both, never parity.)
- Per-arch autotuning. v0.2 ships autotuned
BLOCK/num_warpsconfigs for H200, MI300X and M4 Max, pinned in_CUDA_TUNEDkeyed bygcnArchName/device name. AMD (wavefront-64) wants the opposite shape from NVIDIA warps — low warps for BP, max BLOCK+warps for relay. - Metal is BLOCK=32 only for now — a transient
triton-metalbarrier-drop bug (fix confirmed on its dev branch); the Metal autotune widens once that merges. fp32-only on Metal (no fp64), same as the two-kernel path, and the fp32 near-tie-flip caveat below applies to the megakernel unchanged.
Install
Most users want pip install "tridec[torch,decoders]" (CPU+GPU torch backend
plus the reference adapters). The bare install is the numpy CPU reference
only — correct but slow.
pip install tridec # numpy CPU reference only
pip install "tridec[torch]" # + batched torch backend (CPU/GPU)
pip install "tridec[gpu]" # + Triton GPU kernels (CUDA or ROCm)
pip install "tridec[decoders]" # + ldpc / relay-bp reference adapters
pip install "tridec[sinter]" # + sinter.collect integration
Quickstart
import stim
import tridec
circuit = stim.Circuit.from_file("memory.stim")
dem = circuit.detector_error_model(decompose_errors=False)
decoder = tridec.from_dem(dem, backend="auto") # triton > torch > numpy
dets, obs = circuit.compile_detector_sampler(seed=0).sample(
100_000, separate_observables=True)
pred = decoder.decode_batch(dets) # (shots, n_obs) bool
print("logical error rate:", (pred != obs).any(axis=1).mean())
Raw matrices work too: tridec.from_matrices(H, priors, observables=Lo).
Relay-BP: tridec.from_dem(dem, algorithm="relay") (Triton kernels only).
With sinter (the [sinter] extra):
import sinter
from tridec.sinter import sinter_decoders
stats = sinter.collect(
num_workers=4, tasks=tasks,
decoders=["tridec_bp", "pymatching"],
custom_decoders=sinter_decoders(),
max_shots=1_000_000)
Backend × algorithm matrix (honest availability)
| Algorithm | numpy |
torch |
triton |
metal (experimental) |
|---|---|---|---|---|
| min-sum BP | yes (CPU reference) | yes (CPU + CUDA/ROCm) | yes (CUDA + ROCm) | yes (fp32) |
| Relay-BP | no | no | yes (CUDA + ROCm) | yes (fp32, slow — see below) |
There is no in-package CPU Relay-BP; its CPU reference is IBM's relay-bp
Rust decoder, wrapped in tridec.adapters and used as the validation
oracle for the Triton path.
What's validated where
| Environment | Status |
|---|---|
| CPU (any) | numpy BP reference; torch BP bit-identical to numpy at fp64 (one iteration), LER-identical full decode |
| NVIDIA H200, CUDA 12.4, torch 2.4.1, triton 3.0.0 | Triton BP: ≥99.5% hard-decision agreement vs fp64 references, LER-identical (156 = 156 = 156 fails / 2000 shots vs numpy/torch). Triton Relay-BP: LER-matches the relay-bp Rust oracle (31 vs 38 fails / 2000, overlapping Wilson CIs) — carried source-repo receipts |
| AMD MI300X, ROCm 7.0.0, torch 2.9, triton 3.4.0 | Same kernels, unmodified: identical primitive-identity numbers (pre-leg posterior max-diff 1.8e-15) and the same oracle-vs-Triton LER identity (carried receipts) — and validated through the installed package for v0.1.0 (bench/receipts/mi300x_packaged.json): full suite 88 passed / 10 skipped on gfx942 (GPU tiers bind, darwin-only strict tiers skip), packaged-API BP 166 = numpy 166 fails / 2000, Relay-BP fp32 34 vs Rust oracle 31 (overlapping CIs), throughput within ±2.2% of the carried receipt |
| Apple silicon (M4 Max), triton-metal | Experimental, spike-validated only (bench/receipts/metal_spike.md): both kernels pass the same correctness gates at fp32; see the section below |
Experimental: Apple silicon (Metal)
The same Triton kernels run on Apple-silicon GPUs through triton-metal, with zero changes to the kernel source. This is experimental: validated at spike level on one machine (M4 Max), fp32 only (Metal has no fp64), and not part of the official receipt set.
# triton-metal + a triton >= 3.6 build + torch must be importable, then:
pip install tridec
python -c "import tridec; print(tridec.available_backends())" # ['metal', ...]
backend="auto" detects the triton-metal environment (darwin, triton +
triton_metal importable, no CUDA/ROCm device) and selects "metal";
backend="triton" resolves to "metal" there too, and backend="metal"
asserts the environment is present. The execution pattern is triton-metal's
documented one — CPU torch tensors (zero-copy via unified memory; not
mps) — so no device arguments are needed.
What the spike measured (2000 canonical shots, seed 0, M4 Max — re-validated
through this API path in tests/test_metal.py):
- min-sum BP (fp32): all correctness gates pass — one-iteration hard agreement 1.000 vs the fp64 numpy reference on both the surface-code and BB-code fixtures; LER 76 = 76 / 2000 (surface) and 167 vs 168 / 2000 (BB). Batched decode of 2000 shots in 28 ms (surface) / 167 ms (BB) — 37–56× the per-shot numpy baseline on the same machine.
- Relay-BP (fp32): correct but slow — LER matches the
relay-bpRust oracle (31 vs 39 fails / 2000, per-shot agreement 99.3%), butdecode_batch(2000)takes 31 s vs 1.26 s for the Rust CPU oracle: relay's per-iteration host loop (~7k small kernel launches) is launch-overhead dominated on Metal. Use it for validation, not production. - Relay-BP on metal enforces fp32:
dtype="float64"raises with a clear error; the default resolves tofloat32.
No claims beyond the spike: no official LER receipts, no cross-machine validation, no performance tuning. CUDA/ROCm remain the supported GPU paths.
Compatibility floors in pyproject.toml; known-good pins: stim 1.15.0,
ldpc 2.4.1, relay-bp 0.2.2, torch 2.4.1 / 2.9, triton 3.0 / 3.4.
Validation discipline
tridec.validation ships the matched-protocol harness the numbers were
produced with: dem_hash (sha256 of the DEM's canonical bytes), run_matched
(one shared DEM, one shot set, fail-fast DEM-identity and tie-break gates),
Wilson/TOST statistics and a paired per-shot gap-to-MLE bootstrap. The test
suite pins the extraction byte-for-byte: 8 canonical BB-code fixture circuits
must hash to the exact DEM sha256s recorded in the carried zoo_grid.json
receipt, and a full 16,667-shot cell must reproduce the recorded
logical-failure counts of the ldpc reference adapters exactly.
For v0.1.0 the WHOLE grid was re-decoded in the receipt environment
(bench/full_grid_noregression.py): 31 of 32 (cell, decoder) failure counts
reproduce exactly — all 24 BP / BP-OSD-0 / BP-OSD-10 counts, and 7 of 8
BPLSD counts. The single deviation (BPLSD, p=0.002/X: 879 vs 880, one shot
in 200,000) is attributed by a same-environment repeat experiment to
run-to-run nondeterminism inside ldpc's BpLsdDecoder itself (identical
shots, fresh instances: 879/880/879) — documented in
bench/receipts/full_grid_noregression.json.
Status
0.2.0 — adds the opt-in megakernel backend (tri-platform validated; see
above). v0.1.0 shipped the two-kernel BP/Relay-BP path + the validation
discipline. The kernels and their receipts are stable; the public API surface
is young and may still move before 1.0 — minor 0.x releases may rename or
remove public API; 1.0 will lock the surface. Next (v0.2.1):
megakernel auto-dispatch from from_dem(..., backend="auto"), gated on the
public-API path running on a GPU
(#5). GPU paths require triton
- a CUDA/ROCm GPU (or the experimental triton-metal environment); the GPU/metal test tiers skip cleanly where unavailable.
License
Apache-2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tridec-0.2.0.tar.gz.
File metadata
- Download URL: tridec-0.2.0.tar.gz
- Upload date:
- Size: 79.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d436e05e81fafa951de048d774e11212aaa6952b1abb40d6a2aeb9eead013f5c
|
|
| MD5 |
19323c7c48346eb98101d1c09bd8a1b4
|
|
| BLAKE2b-256 |
51a1aaf72f14ff5671cb8d03f757ea81ae69cad6867786c0eb92c88d99f36c36
|
File details
Details for the file tridec-0.2.0-py3-none-any.whl.
File metadata
- Download URL: tridec-0.2.0-py3-none-any.whl
- Upload date:
- Size: 67.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e735428e946dacc87e2256ea1fea1ab4b04cbcb28f4eb9d704f364a97f4f222d
|
|
| MD5 |
6602a6f55f7d015a1260fcad1fc99564
|
|
| BLAKE2b-256 |
2c0adee6e71aba76737d137c9e5b9fe7aa0f4df1fb2721994a8ad19339b40121
|