Skip to main content

Unapologetically SM120-only CuTe DSL kernels for NVFP4 GEMM and MoE.

Project description

b12x

b12x is an SM120-only CuTe DSL kernel library for Blackwell NVFP4 dense GEMM and routed Mixture-of-Experts inference.

It is intentionally narrow. This is not a generic CUDA kernel collection or a full model-serving stack. It does not intend to target any other GPU architectures, including SM100. It is a focused package for a small number of hand-tuned, high-performance SM120 kernels plus the runtime glue needed to launch them cleanly from PyTorch and sglang.

Installation

Runtime install

python -m pip install b12x

Development install from source

git clone <repo-url>
cd b12x
python -m pip install -e '.[dev]'

Requirements

  • Blackwell SM120 GPU
  • CUDA 13 toolchain
  • Python >=3.10,<4.0
  • CUDA 13 PyTorch, torch>=2.10.0
  • nvidia-cutlass-dsl[cu13]==4.4.1
  • FlashInfer available if you want reference and benchmark comparisons, but it's not a runtime dependency
  • Qwen3.5-397B A17B NVFP4 checkpoint available through B12X_MODEL_PATH for the end-to-end MoE benchmark

Package layout

  • b12x.cute
    • Low-level CuTe and FP4 helpers
  • b12x.gemm
    • Standalone dense NVFP4 GEMM
  • b12x.integration
    • Public runtime entrypoints such as b12x_moe_fp4
  • b12x.moe.fused
    • Static and dynamic fused MoE kernels, scheduler, and reference paths
  • b12x.quant
    • Torch-side NVFP4 packing and quantization helpers
  • b12x.sglang
    • Thin sglang integration shims

Benchmarks and tests

Benchmarks

  • benchmarks/benchmark_moe.py
    • End-to-end Qwen3.5-397B TP=4 MoE benchmark
    • micro batch profile: [1, 2, 4, 8]
    • sglang-single-request batch profile: [1, 23, 80]
    • chunked-prefill batch profile: [8192, 16384, 24576, 32768]
  • benchmarks/benchmark_dense_gemm.py
    • Dense FP4 GEMM vs FlashInfer/cuDNN/CUTLASS

Tests

  • tests/test_tp_moe_reference.py
    • Independent oracle-backed MoE correctness test
  • tests/test_moe_equivalence.py
    • Real-weight smoke and CUDA-graph replay routing-safety checks
  • tests/test_gemm_stack.py
    • Dense GEMM exactness vs FlashInfer/cuDNN

Common commands

# Static backend, graph-first benchmark defaults
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static

# Dynamic backend, same benchmark harness
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend dynamic

# Measure eager launches instead of CUDA graph replay
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --no-cuda-graph

# Include routing in the timed region
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --include-routing

# Use the recorded single-request sglang profile
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --batch-size-profile sglang-single-request

# Graph-first prefill-scale sweep aligned with chunked-prefill serving
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --batch-size-profile chunked-prefill

# Multi-layer CUDA-graph replay validation with real consecutive MoE layers
B12X_MODEL_PATH=/path/to/Qwen3.5-397B-A17B-NVFP4 python benchmarks/benchmark_moe.py --backend static --graph-mode multi-layer --reference none --validate none

# Dense GEMM microbenchmark
python benchmarks/benchmark_dense_gemm.py

# Oracle-backed MoE correctness
python tests/test_tp_moe_reference.py --impls static dynamic --scale-contract per-expert

# Real-weight CUDA-graph smoke
pytest tests/test_moe_equivalence.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

b12x-0.2.1.tar.gz (69.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

b12x-0.2.1-py3-none-any.whl (71.9 kB view details)

Uploaded Python 3

File details

Details for the file b12x-0.2.1.tar.gz.

File metadata

  • Download URL: b12x-0.2.1.tar.gz
  • Upload date:
  • Size: 69.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for b12x-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0f7258931239c881993a2448b69b5dbe3bb84ae0d8c8b42feabc613cd25b9879
MD5 de20464452cb9606a02f3f7115caef8f
BLAKE2b-256 15b99d6aad030648bb917de70b3663ffa429c157c9ad573a4badcf4d6227d40c

See more details on using hashes here.

File details

Details for the file b12x-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: b12x-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 71.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for b12x-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 84d73181446e00a09b9a15ed7ed484c714b0a65ce6aa40714f595e84f2acf277
MD5 da1d8fae5a534bef4598d083d6daeed8
BLAKE2b-256 88f1673b6bfa509fc36104da928f280ca57600f66deaedb9f9f70b0a0cf829ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page