Quantization GEMM Kernel

Project description

Humming

Humming is a high-performance, lightweight, and highly flexible JIT (Just-In-Time) compiled GEMM kernel library specifically designed for quantized inference.

Key Features

High Flexibility
- Supports inference for any weight type under 8-bit across FP16 / BF16 / FP8 / FP4 / INT8 / INT4 activations (provided the activation's dynamic range covers the weight type).
- Supports various quantization strategies.
- Supports various scale types (BF16, FP16, E4M3, E5M2, and UE8M0).
- Supports both Dense GEMM and MoE GEMM.
High Compatibility: supports all NVIDIA GPUs from SM75+ (Turing architecture) and beyond.
High Performance
- Delivers State-of-the-Art (SOTA) throughput and efficiency across a wide range of computational scenarios.
Ultra-Lightweight
- Minimal dependencies: Requires only PyTorch and NVCC.
- Compact footprint: The package size is only 100+KB.

Support Matrix

Activation Type	Supported Devices	Supported Weight Types
FP16 (e5m10)	SM75+	• Symmetric INT1-8 • INT1-8 with dynamic zero point • Arbitrary signed FP (kBits ≤ 8, kExp ≤ 5)
BF16 (e8m7)	SM80+	• Symmetric INT1-8 • INT1-8 with dynamic zero point • Arbitrary signed FP (kBits ≤ 8)
FP8 (e4m3)	SM89+	• Symmetric INT1-5 • INT1-4 with dynamic zero point • Arbitrary signed FP (kExp ≤ 4, kMan ≤ 3)
FP8 (e5m2)	SM89+	• Symmetric INT1-4 • INT1-3 with dynamic zero point • Arbitrary signed FP (kExp ≤ 5, kMan ≤ 2)
FP4 (e2m1)	SM120+	• Symmetric INT1-3 • INT1-2 with dynamic zero point • Arbitrary signed FP (kExp ≤ 2, kMan ≤ 1)
INT8	SM75+	• Symmetric INT1-8 • INT1-7 with dynamic zero point
INT4	SM80+	• Symmetric INT1-4 • INT1-3 with dynamic zero point

Getting Started

Installation

pip install git+https://github.com/inclusionAI/humming.git

Usage Example

import torch
from humming.layer import HummingLayer

layer = HummingLayer(
    shape_n=8192,
    shape_k=8192,
    weight_config={"dtype": "int6"},
    torch_dtype=torch.float16,
).cuda()

weight = torch.randn((8192, 8192), dtype=torch.float16, device="cuda:0")
inputs = torch.randn((128, 8192), dtype=torch.float16, device="cuda:0")

# Load unquantized weight and quantize to layer quantization format
layer.load_from_unquantized(weight)
# Transform weight to humming format and prepare default kernels
layer.transform()

# Run quantized GEMM (tuning_config is optional, auto-selected by default)
output = layer(inputs)

print("Quantized GEMM Output:")
print(output)
print("\nReference Output:")
print(inputs.matmul(weight.T))

Acknowledgement

This project is highly inspired by

DeepGEMM
Marlin Kernel and vLLM Marlin Kernel
lmdeploy GEMM kernel
CUTLASS

Project details

Release history Release notifications | RSS feed

0.1.2

May 23, 2026

This version

0.1.1

May 22, 2026

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

humming_kernels-0.1.1.tar.gz (202.0 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

humming_kernels-0.1.1-py3-none-any.whl (161.0 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file humming_kernels-0.1.1.tar.gz.

File metadata

Download URL: humming_kernels-0.1.1.tar.gz
Upload date: May 22, 2026
Size: 202.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for humming_kernels-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e598aac4fbfab7c224097af74102f8243c98fcf4f278caf75b308ecd2062e1f5`
MD5	`293a6ec460bf54e87aa764530ce7895d`
BLAKE2b-256	`9887a88dd957f3ce3b55e97f9d5062c982cd0cb27566d44835a8541c5fb9bd4c`

See more details on using hashes here.

File details

Details for the file humming_kernels-0.1.1-py3-none-any.whl.

File metadata

Download URL: humming_kernels-0.1.1-py3-none-any.whl
Upload date: May 22, 2026
Size: 161.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for humming_kernels-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dbbbe658191752a9a929c54d6b2dd6e6034a314a0f807ccbb57c0202c1ea177f`
MD5	`3a5d268d3a3c7752b987647261d91c62`
BLAKE2b-256	`db2a586eb474ebfd3d89bb53c625ba02b066308baa03d862b6dfe55cc5de31e6`

See more details on using hashes here.

humming-kernels 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Humming

Key Features

Support Matrix

Getting Started

Installation

Usage Example

Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes