Quantization GEMM Kernel
Project description
Humming
Humming is a high-performance, lightweight, and highly flexible JIT (Just-In-Time) compiled GEMM kernel library specifically designed for quantized inference.
Key Features
- High Flexibility
- Supports inference for any weight type under 8-bit across FP16 / BF16 / FP8 / FP4 / INT8 / INT4 activations (provided the activation's dynamic range covers the weight type).
- Supports various quantization strategies.
- Supports various scale types (BF16, FP16, E4M3, E5M2, and UE8M0).
- Supports both Dense GEMM and MoE GEMM.
- High Compatibility: supports all NVIDIA GPUs from SM75+ (Turing architecture) and beyond.
- High Performance
- Delivers State-of-the-Art (SOTA) throughput and efficiency across a wide range of computational scenarios.
- Ultra-Lightweight
- Minimal dependencies: Requires only PyTorch and NVCC.
- Compact footprint: The package size is only 100+KB.
Support Matrix
| Activation Type | Supported Devices | Supported Weight Types |
|---|---|---|
| FP16 (e5m10) | SM75+ | • Symmetric INT1-8 • INT1-8 with dynamic zero point • Arbitrary signed FP (kBits ≤ 8, kExp ≤ 5) |
| BF16 (e8m7) | SM80+ | • Symmetric INT1-8 • INT1-8 with dynamic zero point • Arbitrary signed FP (kBits ≤ 8) |
| FP8 (e4m3) | SM89+ | • Symmetric INT1-5 • INT1-4 with dynamic zero point • Arbitrary signed FP (kExp ≤ 4, kMan ≤ 3) |
| FP8 (e5m2) | SM89+ | • Symmetric INT1-4 • INT1-3 with dynamic zero point • Arbitrary signed FP (kExp ≤ 5, kMan ≤ 2) |
| FP4 (e2m1) | SM120+ | • Symmetric INT1-3 • INT1-2 with dynamic zero point • Arbitrary signed FP (kExp ≤ 2, kMan ≤ 1) |
| INT8 | SM75+ | • Symmetric INT1-8 • INT1-7 with dynamic zero point |
| INT4 | SM80+ | • Symmetric INT1-4 • INT1-3 with dynamic zero point |
Getting Started
Installation
pip install git+https://github.com/inclusionAI/humming.git
Usage Example
import torch
from humming.layer import HummingLayer
layer = HummingLayer(
shape_n=8192,
shape_k=8192,
weight_config={"dtype": "int6"},
torch_dtype=torch.float16,
).cuda()
weight = torch.randn((8192, 8192), dtype=torch.float16, device="cuda:0")
inputs = torch.randn((128, 8192), dtype=torch.float16, device="cuda:0")
# Load unquantized weight and quantize to layer quantization format
layer.load_from_unquantized(weight)
# Transform weight to humming format and prepare default kernels
layer.transform()
# Run quantized GEMM (tuning_config is optional, auto-selected by default)
output = layer(inputs)
print("Quantized GEMM Output:")
print(output)
print("\nReference Output:")
print(inputs.matmul(weight.T))
Acknowledgement
This project is highly inspired by
- DeepGEMM
- Marlin Kernel and vLLM Marlin Kernel
- lmdeploy GEMM kernel
- CUTLASS
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
humming_kernels-0.1.1.tar.gz
(202.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file humming_kernels-0.1.1.tar.gz.
File metadata
- Download URL: humming_kernels-0.1.1.tar.gz
- Upload date:
- Size: 202.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e598aac4fbfab7c224097af74102f8243c98fcf4f278caf75b308ecd2062e1f5
|
|
| MD5 |
293a6ec460bf54e87aa764530ce7895d
|
|
| BLAKE2b-256 |
9887a88dd957f3ce3b55e97f9d5062c982cd0cb27566d44835a8541c5fb9bd4c
|
File details
Details for the file humming_kernels-0.1.1-py3-none-any.whl.
File metadata
- Download URL: humming_kernels-0.1.1-py3-none-any.whl
- Upload date:
- Size: 161.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbbbe658191752a9a929c54d6b2dd6e6034a314a0f807ccbb57c0202c1ea177f
|
|
| MD5 |
3a5d268d3a3c7752b987647261d91c62
|
|
| BLAKE2b-256 |
db2a586eb474ebfd3d89bb53c625ba02b066308baa03d862b6dfe55cc5de31e6
|