Unapologetically SM120-only CuTe DSL kernels for NVFP4 GEMM and MoE.
Project description
b12x
b12x is an SM120-only CuTe DSL kernel library for Blackwell NVFP4 dense GEMM
and routed Mixture-of-Experts inference.
It is intentionally narrow. This is not a generic CUDA kernel collection or a
full model-serving stack. It does not intend to target any other GPU architectures,
including SM100. It is a focused package for a small number of hand-tuned, high-performance
SM120 kernels plus the runtime glue needed to launch them cleanly from PyTorch and sglang.
Installation
Runtime install
python -m pip install b12x
Development install from source
git clone <repo-url>
cd b12x
python -m pip install -e '.[dev]'
Requirements
- Blackwell SM120 GPU
- CUDA 13 toolchain
- Python
>=3.10,<4.0 - CUDA 13 PyTorch,
torch>=2.10.0 nvidia-cutlass-dsl[cu13]==4.4.1- FlashInfer available if you want reference and benchmark comparisons, but it's not a runtime dependency
- Qwen3.5-397B A17B NVFP4 checkpoint available through
B12X_MODEL_PATHfor the end-to-end MoE benchmark
Package layout
b12x.cute- Low-level CuTe and FP4 helpers
b12x.gemm- Standalone dense NVFP4 GEMM
b12x.integration- Public runtime entrypoints such as
b12x_moe_fp4
- Public runtime entrypoints such as
b12x.moe.fused- Static and dynamic fused MoE kernels, scheduler, and reference paths
b12x.quant- Torch-side NVFP4 packing and quantization helpers
b12x.sglang- Thin
sglangintegration shims
- Thin
Benchmarks and tests
Benchmarks
benchmarks/benchmark_moe.py- End-to-end Qwen3.5-397B TP=4 MoE benchmark
microbatch profile:[1, 2, 4, 8]sglang-single-requestbatch profile:[1, 23, 80]chunked-prefillbatch profile:[8192, 16384, 24576, 32768]
benchmarks/benchmark_dense_gemm.py- Dense FP4 GEMM vs FlashInfer/cuDNN/CUTLASS
Tests
tests/test_tp_moe_reference.py- Independent oracle-backed MoE correctness test
tests/test_moe_equivalence.py- Real-weight smoke and CUDA-graph replay routing-safety checks
tests/test_gemm_stack.py- Dense GEMM exactness vs FlashInfer/cuDNN
Common commands
# Static backend, graph-first benchmark defaults
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static
# Dynamic backend, same benchmark harness
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend dynamic
# Measure eager launches instead of CUDA graph replay
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --no-cuda-graph
# Include routing in the timed region
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --include-routing
# Use the recorded single-request sglang profile
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --batch-size-profile sglang-single-request
# Graph-first prefill-scale sweep aligned with chunked-prefill serving
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --batch-size-profile chunked-prefill
# Multi-layer CUDA-graph replay validation with real consecutive MoE layers
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --graph-mode multi-layer --reference none --validate none
# Dense GEMM microbenchmark
python benchmarks/benchmark_dense_gemm.py
# Oracle-backed MoE correctness
python tests/test_tp_moe_reference.py --impls static dynamic --scale-contract per-expert
# Real-weight CUDA-graph smoke
pytest tests/test_moe_equivalence.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file b12x-0.2.1.tar.gz.
File metadata
- Download URL: b12x-0.2.1.tar.gz
- Upload date:
- Size: 69.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f7258931239c881993a2448b69b5dbe3bb84ae0d8c8b42feabc613cd25b9879
|
|
| MD5 |
de20464452cb9606a02f3f7115caef8f
|
|
| BLAKE2b-256 |
15b99d6aad030648bb917de70b3663ffa429c157c9ad573a4badcf4d6227d40c
|
File details
Details for the file b12x-0.2.1-py3-none-any.whl.
File metadata
- Download URL: b12x-0.2.1-py3-none-any.whl
- Upload date:
- Size: 71.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84d73181446e00a09b9a15ed7ed484c714b0a65ce6aa40714f595e84f2acf277
|
|
| MD5 |
da1d8fae5a534bef4598d083d6daeed8
|
|
| BLAKE2b-256 |
88f1673b6bfa509fc36104da928f280ca57600f66deaedb9f9f70b0a0cf829ec
|