Quantization GEMM Kernel
Project description
Humming
Humming is a high-performance, lightweight, and highly flexible JIT (Just-In-Time) compiled GEMM kernel library specifically designed for quantized inference.
Key Features
- High Flexibility
- Supports inference for any weight type under 8-bit across FP16 / BF16 / FP8 / FP4 / INT8 / INT4 activations (provided the activation's dynamic range covers the weight type).
- Supports various quantization strategies.
- Supports various scale types (BF16, FP16, E4M3, E5M2, and UE8M0).
- Supports both Dense GEMM and MoE GEMM.
- High Compatibility: supports all NVIDIA GPUs from SM75+ (Turing architecture) and beyond.
- High Performance
- Delivers State-of-the-Art (SOTA) throughput and efficiency across a wide range of computational scenarios.
- Ultra-Lightweight
- Minimal dependencies: Requires only PyTorch and NVCC.
- Compact footprint: The package size is only 100+KB.
Support Matrix
| Activation Type | Supported Devices | Supported Weight Types |
|---|---|---|
| FP16 (e5m10) | SM75+ | • Symmetric INT1-8 • INT1-8 with dynamic zero point • Arbitrary signed FP (kBits ≤ 8, kExp ≤ 5) |
| BF16 (e8m7) | SM80+ | • Symmetric INT1-8 • INT1-8 with dynamic zero point • Arbitrary signed FP (kBits ≤ 8) |
| FP8 (e4m3) | SM89+ | • Symmetric INT1-5 • INT1-4 with dynamic zero point • Arbitrary signed FP (kExp ≤ 4, kMan ≤ 3) |
| FP8 (e5m2) | SM89+ | • Symmetric INT1-4 • INT1-3 with dynamic zero point • Arbitrary signed FP (kExp ≤ 5, kMan ≤ 2) |
| FP4 (e2m1) | SM120+ | • Symmetric INT1-3 • INT1-2 with dynamic zero point • Arbitrary signed FP (kExp ≤ 2, kMan ≤ 1) |
| INT8 | SM75+ | • Symmetric INT1-8 • INT1-7 with dynamic zero point |
| INT4 | SM80+ | • Symmetric INT1-4 • INT1-3 with dynamic zero point |
Getting Started
Installation
pip install git+https://github.com/inclusionAI/humming.git
Usage Example
import torch
from humming.layer import HummingLayer
layer = HummingLayer(
shape_n=8192,
shape_k=8192,
weight_config={"dtype": "int6"},
torch_dtype=torch.float16,
).cuda()
weight = torch.randn((8192, 8192), dtype=torch.float16, device="cuda:0")
inputs = torch.randn((128, 8192), dtype=torch.float16, device="cuda:0")
# Load unquantized weight and quantize to layer quantization format
layer.load_from_unquantized(weight)
# Transform weight to humming format and prepare default kernels
layer.transform()
# Run quantized GEMM (tuning_config is optional, auto-selected by default)
output = layer(inputs)
print("Quantized GEMM Output:")
print(output)
print("\nReference Output:")
print(inputs.matmul(weight.T))
Acknowledgement
This project is highly inspired by
- DeepGEMM
- Marlin Kernel and vLLM Marlin Kernel
- lmdeploy GEMM kernel
- CUTLASS
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file humming_kernels-0.1.2.tar.gz.
File metadata
- Download URL: humming_kernels-0.1.2.tar.gz
- Upload date:
- Size: 117.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7894c80061c7866591bef12617da720ac4e925636ffc99464af433a5dcb035eb
|
|
| MD5 |
308409f7a2d09f64f8c47e22c76d6f65
|
|
| BLAKE2b-256 |
06f4e141f45697b7d0d38bfaf8766a7362d8f0136e3cff2620624f24f68e2700
|
Provenance
The following attestation bundles were made for humming_kernels-0.1.2.tar.gz:
Publisher:
publish.yml on inclusionAI/humming
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
humming_kernels-0.1.2.tar.gz -
Subject digest:
7894c80061c7866591bef12617da720ac4e925636ffc99464af433a5dcb035eb - Sigstore transparency entry: 1615355595
- Sigstore integration time:
-
Permalink:
inclusionAI/humming@0fe5db59db346c170a52d8fe2a2942e08b970679 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/inclusionAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0fe5db59db346c170a52d8fe2a2942e08b970679 -
Trigger Event:
release
-
Statement type:
File details
Details for the file humming_kernels-0.1.2-py3-none-any.whl.
File metadata
- Download URL: humming_kernels-0.1.2-py3-none-any.whl
- Upload date:
- Size: 161.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7434b0424946445ef5ad5682bcabf309d97721818ed5bdc4c6f61de3c6b9d2f
|
|
| MD5 |
956a38a50eb771ff8c0013f93bed491d
|
|
| BLAKE2b-256 |
6d41288bf756d921dbe98982eeb3ec4c20e7cb5224ea6dcb164f2df3d2f68a7f
|
Provenance
The following attestation bundles were made for humming_kernels-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on inclusionAI/humming
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
humming_kernels-0.1.2-py3-none-any.whl -
Subject digest:
f7434b0424946445ef5ad5682bcabf309d97721818ed5bdc4c6f61de3c6b9d2f - Sigstore transparency entry: 1615355606
- Sigstore integration time:
-
Permalink:
inclusionAI/humming@0fe5db59db346c170a52d8fe2a2942e08b970679 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/inclusionAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0fe5db59db346c170a52d8fe2a2942e08b970679 -
Trigger Event:
release
-
Statement type: