Skip to main content

Vendor-portable GPU decoders for quantum LDPC codes: Triton min-sum BP and Relay-BP on NVIDIA (CUDA) and AMD (ROCm), with CPU reference implementations, consuming any stim DetectorErrorModel or raw parity-check matrices.

Project description

tridec

ci

Badge honesty: CI is CPU-only (ubuntu + macos arm64; the macos lane binds the strict exact-count receipt gates). There are no GPU runners — the CUDA/ROCm kernel paths are validated by the carried H200/MI300X receipts in bench/receipts/, and the experimental Metal tier runs on a local machine.

An open, vendor-portable GPU decoder library for quantum LDPC codes — Triton min-sum BP and Relay-BP decoders that consume any stim DetectorErrorModel or raw parity-check matrices, with CPU reference implementations, validated against the standard CPU references (ldpc, relay-bp), running on NVIDIA (CUDA) and AMD (ROCm) GPUs.

The same Triton kernels run unmodified on both vendors: the Relay-BP kernel reproduces its logical-error-rate validation numbers identically on an NVIDIA H200 (CUDA 12.4, triton 3.0) and an AMD MI300X (ROCm 7.0, triton 3.4) — see docs/benchmark.md and the raw receipts in bench/receipts/. Validated scope is NVIDIA + AMD; Apple silicon runs the same kernels through triton-metal as an experimental backend (see below).

v0.2: the megakernel backend (opt-in)

A single-launch persistent megakernel — the entire Relay-BP decode (every BP iteration, every relay leg, in-kernel syndrome convergence + nconv stop + lowest-weight selection) in one kernel launch per decode_batch, with per-shot early exit, instead of the v0.1 host loop's thousands of launches. Validated on all three platforms against the v0.1 two-kernel path and the relay-bp Rust oracle (14/14 gates at BLOCK 128/256 on CUDA and ROCm; barriers verified honored on both):

Relay-BP megakernel vs v0.1 two-kernel speedup
Apple M4 Max (Metal, triton-metal) 65× — 30.0 s → 0.46 s / 2000 shots
NVIDIA H200 (CUDA) 9–18× — batch-1 62.5 → 3.44 ms; 34.6 µs/syn @8192
AMD MI300X (ROCm) 11–22× — batch-1 8.48 ms; 46.0 µs/syn @8192

Opt-in today via tridec.backends.megakernel.{RelayBpMegaTriton, BpMegaTriton}; auto-dispatch from from_dem(..., backend="auto") lands in v0.2.1 once the public-API path is gated on a GPU (#5). Receipts: bench/receipts/megakernel_{h200,mi300x,metal}*.

Megakernel: honest limits + tuning

  • Plain-BP megakernel is a single-shot latency tool, not a throughput tool. At batch-1 it is ~1.7× faster than the two-kernel BP path (H200 0.61 vs 1.06 ms); at large batch it loses (plain BP has no early-exit lever) — the two-kernel BP path stays the throughput default. Use BpMegaTriton for low-latency bare BP, RelayBpMegaTriton for the accurate latency path.
  • Real-time / single-shot: H200 leads MI300X 2.47× at batch-1 (3.44 vs 8.48 ms) — wider than v0.1's ~9% two-kernel gap, because the single-CTA-per-shot design amplifies per-SM and codegen differences at batch-1. Batched, the gap is 1.25–1.33×; correctness is identical across vendors. (The pitch is vendor-portable + performant on both, never parity.)
  • Per-arch autotuning. v0.2 ships autotuned BLOCK/num_warps configs for H200, MI300X and M4 Max, pinned in _CUDA_TUNED keyed by gcnArchName/device name. AMD (wavefront-64) wants the opposite shape from NVIDIA warps — low warps for BP, max BLOCK+warps for relay.
  • Metal is BLOCK=32 only for now — a transient triton-metal barrier-drop bug (fix confirmed on its dev branch); the Metal autotune widens once that merges. fp32-only on Metal (no fp64), same as the two-kernel path, and the fp32 near-tie-flip caveat below applies to the megakernel unchanged.

Install

Most users want pip install "tridec[torch,decoders]" (CPU+GPU torch backend plus the reference adapters). The bare install is the numpy CPU reference only — correct but slow.

pip install tridec                # numpy CPU reference only
pip install "tridec[torch]"       # + batched torch backend (CPU/GPU)
pip install "tridec[gpu]"         # + Triton GPU kernels (CUDA or ROCm)
pip install "tridec[decoders]"    # + ldpc / relay-bp reference adapters
pip install "tridec[sinter]"      # + sinter.collect integration

Quickstart

import stim
import tridec

circuit = stim.Circuit.from_file("memory.stim")
dem = circuit.detector_error_model(decompose_errors=False)

decoder = tridec.from_dem(dem, backend="auto")   # triton > torch > numpy

dets, obs = circuit.compile_detector_sampler(seed=0).sample(
    100_000, separate_observables=True)
pred = decoder.decode_batch(dets)                      # (shots, n_obs) bool
print("logical error rate:", (pred != obs).any(axis=1).mean())

Raw matrices work too: tridec.from_matrices(H, priors, observables=Lo). Relay-BP: tridec.from_dem(dem, algorithm="relay") (Triton kernels only).

With sinter (the [sinter] extra):

import sinter
from tridec.sinter import sinter_decoders

stats = sinter.collect(
    num_workers=4, tasks=tasks,
    decoders=["tridec_bp", "pymatching"],
    custom_decoders=sinter_decoders(),
    max_shots=1_000_000)

Backend × algorithm matrix (honest availability)

Algorithm numpy torch triton metal (experimental)
min-sum BP yes (CPU reference) yes (CPU + CUDA/ROCm) yes (CUDA + ROCm) yes (fp32)
Relay-BP no no yes (CUDA + ROCm) yes (fp32, slow — see below)

There is no in-package CPU Relay-BP; its CPU reference is IBM's relay-bp Rust decoder, wrapped in tridec.adapters and used as the validation oracle for the Triton path.

What's validated where

Environment Status
CPU (any) numpy BP reference; torch BP bit-identical to numpy at fp64 (one iteration), LER-identical full decode
NVIDIA H200, CUDA 12.4, torch 2.4.1, triton 3.0.0 Triton BP: ≥99.5% hard-decision agreement vs fp64 references, LER-identical (156 = 156 = 156 fails / 2000 shots vs numpy/torch). Triton Relay-BP: LER-matches the relay-bp Rust oracle (31 vs 38 fails / 2000, overlapping Wilson CIs) — carried source-repo receipts
AMD MI300X, ROCm 7.0.0, torch 2.9, triton 3.4.0 Same kernels, unmodified: identical primitive-identity numbers (pre-leg posterior max-diff 1.8e-15) and the same oracle-vs-Triton LER identity (carried receipts) — and validated through the installed package for v0.1.0 (bench/receipts/mi300x_packaged.json): full suite 88 passed / 10 skipped on gfx942 (GPU tiers bind, darwin-only strict tiers skip), packaged-API BP 166 = numpy 166 fails / 2000, Relay-BP fp32 34 vs Rust oracle 31 (overlapping CIs), throughput within ±2.2% of the carried receipt
Apple silicon (M4 Max), triton-metal Experimental, spike-validated only (bench/receipts/metal_spike.md): both kernels pass the same correctness gates at fp32; see the section below

Experimental: Apple silicon (Metal)

The same Triton kernels run on Apple-silicon GPUs through triton-metal, with zero changes to the kernel source. This is experimental: validated at spike level on one machine (M4 Max), fp32 only (Metal has no fp64), and not part of the official receipt set.

# triton-metal + a triton >= 3.6 build + torch must be importable, then:
pip install tridec
python -c "import tridec; print(tridec.available_backends())"  # ['metal', ...]

backend="auto" detects the triton-metal environment (darwin, triton + triton_metal importable, no CUDA/ROCm device) and selects "metal"; backend="triton" resolves to "metal" there too, and backend="metal" asserts the environment is present. The execution pattern is triton-metal's documented one — CPU torch tensors (zero-copy via unified memory; not mps) — so no device arguments are needed.

What the spike measured (2000 canonical shots, seed 0, M4 Max — re-validated through this API path in tests/test_metal.py):

  • min-sum BP (fp32): all correctness gates pass — one-iteration hard agreement 1.000 vs the fp64 numpy reference on both the surface-code and BB-code fixtures; LER 76 = 76 / 2000 (surface) and 167 vs 168 / 2000 (BB). Batched decode of 2000 shots in 28 ms (surface) / 167 ms (BB) — 37–56× the per-shot numpy baseline on the same machine.
  • Relay-BP (fp32): correct but slow — LER matches the relay-bp Rust oracle (31 vs 39 fails / 2000, per-shot agreement 99.3%), but decode_batch(2000) takes 31 s vs 1.26 s for the Rust CPU oracle: relay's per-iteration host loop (~7k small kernel launches) is launch-overhead dominated on Metal. Use it for validation, not production.
  • Relay-BP on metal enforces fp32: dtype="float64" raises with a clear error; the default resolves to float32.

No claims beyond the spike: no official LER receipts, no cross-machine validation, no performance tuning. CUDA/ROCm remain the supported GPU paths.

Compatibility floors in pyproject.toml; known-good pins: stim 1.15.0, ldpc 2.4.1, relay-bp 0.2.2, torch 2.4.1 / 2.9, triton 3.0 / 3.4.

Validation discipline

tridec.validation ships the matched-protocol harness the numbers were produced with: dem_hash (sha256 of the DEM's canonical bytes), run_matched (one shared DEM, one shot set, fail-fast DEM-identity and tie-break gates), Wilson/TOST statistics and a paired per-shot gap-to-MLE bootstrap. The test suite pins the extraction byte-for-byte: 8 canonical BB-code fixture circuits must hash to the exact DEM sha256s recorded in the carried zoo_grid.json receipt, and a full 16,667-shot cell must reproduce the recorded logical-failure counts of the ldpc reference adapters exactly.

For v0.1.0 the WHOLE grid was re-decoded in the receipt environment (bench/full_grid_noregression.py): 31 of 32 (cell, decoder) failure counts reproduce exactly — all 24 BP / BP-OSD-0 / BP-OSD-10 counts, and 7 of 8 BPLSD counts. The single deviation (BPLSD, p=0.002/X: 879 vs 880, one shot in 200,000) is attributed by a same-environment repeat experiment to run-to-run nondeterminism inside ldpc's BpLsdDecoder itself (identical shots, fresh instances: 879/880/879) — documented in bench/receipts/full_grid_noregression.json.

Status

0.2.0 — adds the opt-in megakernel backend (tri-platform validated; see above). v0.1.0 shipped the two-kernel BP/Relay-BP path + the validation discipline. The kernels and their receipts are stable; the public API surface is young and may still move before 1.0 — minor 0.x releases may rename or remove public API; 1.0 will lock the surface. Next (v0.2.1): megakernel auto-dispatch from from_dem(..., backend="auto"), gated on the public-API path running on a GPU (#5). GPU paths require triton

  • a CUDA/ROCm GPU (or the experimental triton-metal environment); the GPU/metal test tiers skip cleanly where unavailable.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tridec-0.2.0.tar.gz (79.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tridec-0.2.0-py3-none-any.whl (67.4 kB view details)

Uploaded Python 3

File details

Details for the file tridec-0.2.0.tar.gz.

File metadata

  • Download URL: tridec-0.2.0.tar.gz
  • Upload date:
  • Size: 79.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tridec-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d436e05e81fafa951de048d774e11212aaa6952b1abb40d6a2aeb9eead013f5c
MD5 19323c7c48346eb98101d1c09bd8a1b4
BLAKE2b-256 51a1aaf72f14ff5671cb8d03f757ea81ae69cad6867786c0eb92c88d99f36c36

See more details on using hashes here.

File details

Details for the file tridec-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tridec-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 67.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tridec-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e735428e946dacc87e2256ea1fea1ab4b04cbcb28f4eb9d704f364a97f4f222d
MD5 6602a6f55f7d015a1260fcad1fc99564
BLAKE2b-256 2c0adee6e71aba76737d137c9e5b9fe7aa0f4df1fb2721994a8ad19339b40121

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page