Unapologetically SM120-only CuTe DSL kernels for NVFP4 GEMM and MoE.

These details have not been verified by PyPI

Project description

b12x

b12x is an SM120-only CuTe DSL kernel library for Blackwell NVFP4 dense GEMM and routed Mixture-of-Experts inference.

It is intentionally narrow. This is not a generic CUDA kernel collection or a full model-serving stack. It does not intend to target any other GPU architectures, including SM100. It is a focused package for a small number of hand-tuned, high-performance SM120 kernels plus the runtime glue needed to launch them cleanly from PyTorch and sglang.

Installation

Runtime install

python -m pip install b12x

Development install from source

git clone <repo-url>
cd b12x
python -m pip install -e '.[dev]'

Requirements

Blackwell SM120 GPU
CUDA 13 toolchain
Python >=3.10,<4.0
CUDA 13 PyTorch, torch>=2.10.0
nvidia-cutlass-dsl[cu13]==4.4.1
FlashInfer available if you want reference and benchmark comparisons, but it's not a runtime dependency
Qwen3.5-397B A17B NVFP4 checkpoint available through B12X_MODEL_PATH for the end-to-end MoE benchmark

Package layout

b12x.cute
- Low-level CuTe and FP4 helpers
b12x.gemm
- Standalone dense NVFP4 GEMM
b12x.integration
- Public runtime entrypoints such as b12x_moe_fp4
b12x.moe.fused
- Static and dynamic fused MoE kernels and reference paths
b12x.quant
- Torch-side NVFP4 packing and quantization helpers
b12x.sglang
- Thin sglang integration shims

MoE runtime contract

b12x.integration.tp_moe.b12x_moe_fp4 requires a caller-owned workspace.
b12x selects its fused MoE backend from shape alone:
- compact routed workloads use the static backend
- all larger routed workloads use dynamic
Use allocate_tp_moe_workspace(...) for one exact unchunked launch shape.
Use allocate_tp_moe_workspace_pool() for variable-size or chunked workloads.
Keep one workspace pool per process/device, and let the pool partition scratch by CUDA stream internally.
During CUDA graph capture, output= must also be caller-owned and stable across replays.

Benchmarks and tests

Benchmarks

benchmarks/benchmark_moe.py
- End-to-end Qwen3.5-397B TP=4 MoE benchmark
- micro batch profile: [1, 2, 4, 8]
- sglang-single-request batch profile: [1, 23, 80]
- chunked-prefill batch profile: [8192, 16384, 24576, 32768]
benchmarks/benchmark_dense_gemm.py
- Dense FP4 GEMM vs FlashInfer/cuDNN/CUTLASS

Tests

tests/test_tp_moe_reference.py
- Independent oracle-backed MoE correctness test
tests/test_moe_equivalence.py
- Real-weight smoke and CUDA-graph replay routing-safety checks
tests/test_gemm_stack.py
- Dense GEMM exactness vs FlashInfer/cuDNN

Common commands

# Graph-first benchmark defaults with auto-dispatch
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py

# Measure eager launches instead of CUDA graph replay
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --no-cuda-graph

# Include routing in the timed region
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --include-routing

# Use the recorded single-request sglang profile
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --batch-size-profile sglang-single-request

# Graph-first prefill-scale sweep aligned with chunked-prefill serving
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --batch-size-profile chunked-prefill

# Multi-layer CUDA-graph replay validation with real consecutive MoE layers
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --graph-mode multi-layer --reference none --validate none

# Dense GEMM microbenchmark
python benchmarks/benchmark_dense_gemm.py

# Oracle-backed MoE correctness
python tests/test_tp_moe_reference.py --impls b12x --scale-contract per-expert

# Real-weight CUDA-graph smoke
pytest tests/test_moe_equivalence.py

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.13.1

May 7, 2026

0.13.0

May 6, 2026

0.12.4

May 5, 2026

0.12.3

May 4, 2026

0.12.1

May 4, 2026

0.12.0

May 4, 2026

0.11.1

May 1, 2026

0.11.0

Apr 30, 2026

0.10.2

Apr 28, 2026

0.10.1

Apr 27, 2026

0.10.0

Apr 25, 2026

0.9.8

Apr 25, 2026

0.9.7

Apr 21, 2026

0.9.6

Apr 19, 2026

0.9.5

Apr 19, 2026

0.9.1

Apr 17, 2026

0.9.0

Apr 16, 2026

0.8.8

Apr 16, 2026

0.8.7

Apr 15, 2026

0.8.6

Apr 14, 2026

0.8.5 yanked

Apr 13, 2026

Reason this release was yanked:

broken

0.8.3

Apr 10, 2026

0.8.2

Apr 10, 2026

0.8.1

Apr 10, 2026

0.8.0

Apr 10, 2026

0.7.6

Apr 8, 2026

0.7.5 yanked

Apr 8, 2026

Reason this release was yanked:

broken

0.7.4 yanked

Apr 8, 2026

Reason this release was yanked:

broken

0.7.3 yanked

Apr 8, 2026

Reason this release was yanked:

broken

0.7.2

Apr 2, 2026

0.7.1

Apr 2, 2026

0.7.0

Apr 1, 2026

0.6.0

Mar 29, 2026

0.5.1

Mar 21, 2026

0.5.0

Mar 21, 2026

This version

0.4.0

Mar 20, 2026

0.3.0

Mar 20, 2026

0.2.1

Mar 19, 2026

0.2.0

Mar 19, 2026

0.1.0

Mar 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

b12x-0.4.0.tar.gz (88.4 kB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

b12x-0.4.0-py3-none-any.whl (91.2 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file b12x-0.4.0.tar.gz.

File metadata

Download URL: b12x-0.4.0.tar.gz
Upload date: Mar 20, 2026
Size: 88.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for b12x-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`a7db73fbddda35ec14da3f8c0a7fd453b8b5797926b92f0a6e029c2b005f39a2`
MD5	`a5209257adf330675f0a3eeac0e80de2`
BLAKE2b-256	`2c991a565379d245c382c5dcb5bf8efab1e94b27e96cf61f539e70a762ca3263`

See more details on using hashes here.

File details

Details for the file b12x-0.4.0-py3-none-any.whl.

File metadata

Download URL: b12x-0.4.0-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 91.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for b12x-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b13cf8cb7b7fc9650dd4bbaad4e453b1d66795a5107c5b83d58bb852a0c30901`
MD5	`8c5ea40ae5c0cd9b9c068253eaeda616`
BLAKE2b-256	`7fe76b1611b20c381efdf7813e10a7f4116905b2411b6cbfe2cf4604c967ef71`

See more details on using hashes here.

b12x 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

b12x

Installation

Runtime install

Development install from source

Requirements

Package layout

MoE runtime contract

Benchmarks and tests

Benchmarks

Tests

Common commands

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes