Skip to main content

Quantization GEMM Kernel

Project description

Humming

Humming is a high-performance, lightweight, and highly flexible JIT (Just-In-Time) compiled GEMM kernel library specifically designed for quantized inference.

Key Features

  • High Flexibility
    • Supports inference for any weight type under 8-bit across FP16 / BF16 / FP8 / FP4 / INT8 / INT4 activations (provided the activation's dynamic range covers the weight type).
    • Supports various quantization strategies.
    • Supports various scale types (BF16, FP16, E4M3, E5M2, and UE8M0).
    • Supports both Dense GEMM and MoE GEMM.
  • High Compatibility: supports all NVIDIA GPUs from SM75+ (Turing architecture) and beyond.
  • High Performance
    • Delivers State-of-the-Art (SOTA) throughput and efficiency across a wide range of computational scenarios.
  • Ultra-Lightweight
    • Minimal dependencies: Requires only PyTorch and NVCC.
    • Compact footprint: The package size is only 100+KB.

Support Matrix

Activation Type Supported Devices Supported Weight Types
FP16 (e5m10) SM75+ • Symmetric INT1-8
• INT1-8 with dynamic zero point
• Arbitrary signed FP (kBits ≤ 8, kExp ≤ 5)
BF16 (e8m7) SM80+ • Symmetric INT1-8
• INT1-8 with dynamic zero point
• Arbitrary signed FP (kBits ≤ 8)
FP8 (e4m3) SM89+ • Symmetric INT1-5
• INT1-4 with dynamic zero point
• Arbitrary signed FP (kExp ≤ 4, kMan ≤ 3)
FP8 (e5m2) SM89+ • Symmetric INT1-4
• INT1-3 with dynamic zero point
• Arbitrary signed FP (kExp ≤ 5, kMan ≤ 2)
FP4 (e2m1) SM120+ • Symmetric INT1-3
• INT1-2 with dynamic zero point
• Arbitrary signed FP (kExp ≤ 2, kMan ≤ 1)
INT8 SM75+ • Symmetric INT1-8
• INT1-7 with dynamic zero point
INT4 SM80+ • Symmetric INT1-4
• INT1-3 with dynamic zero point

Getting Started

Installation

pip install git+https://github.com/inclusionAI/humming.git

Usage Example

import torch
from humming.layer import HummingLayer

layer = HummingLayer(
    shape_n=8192,
    shape_k=8192,
    weight_config={"dtype": "int6"},
    torch_dtype=torch.float16,
).cuda()

weight = torch.randn((8192, 8192), dtype=torch.float16, device="cuda:0")
inputs = torch.randn((128, 8192), dtype=torch.float16, device="cuda:0")

# Load unquantized weight and quantize to layer quantization format
layer.load_from_unquantized(weight)
# Transform weight to humming format and prepare default kernels
layer.transform()

# Run quantized GEMM (tuning_config is optional, auto-selected by default)
output = layer(inputs)

print("Quantized GEMM Output:")
print(output)
print("\nReference Output:")
print(inputs.matmul(weight.T))

Acknowledgement

This project is highly inspired by

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

humming_kernels-0.1.2.tar.gz (117.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

humming_kernels-0.1.2-py3-none-any.whl (161.0 kB view details)

Uploaded Python 3

File details

Details for the file humming_kernels-0.1.2.tar.gz.

File metadata

  • Download URL: humming_kernels-0.1.2.tar.gz
  • Upload date:
  • Size: 117.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for humming_kernels-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7894c80061c7866591bef12617da720ac4e925636ffc99464af433a5dcb035eb
MD5 308409f7a2d09f64f8c47e22c76d6f65
BLAKE2b-256 06f4e141f45697b7d0d38bfaf8766a7362d8f0136e3cff2620624f24f68e2700

See more details on using hashes here.

Provenance

The following attestation bundles were made for humming_kernels-0.1.2.tar.gz:

Publisher: publish.yml on inclusionAI/humming

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file humming_kernels-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: humming_kernels-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 161.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for humming_kernels-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7434b0424946445ef5ad5682bcabf309d97721818ed5bdc4c6f61de3c6b9d2f
MD5 956a38a50eb771ff8c0013f93bed491d
BLAKE2b-256 6d41288bf756d921dbe98982eeb3ec4c20e7cb5224ea6dcb164f2df3d2f68a7f

See more details on using hashes here.

Provenance

The following attestation bundles were made for humming_kernels-0.1.2-py3-none-any.whl:

Publisher: publish.yml on inclusionAI/humming

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page