Skip to main content

Quantization GEMM Kernel

Project description

Humming

Humming is a high-performance, lightweight, and highly flexible JIT (Just-In-Time) compiled GEMM kernel library specifically designed for quantized inference.

Key Features

  • High Flexibility
    • Supports inference for any weight type under 8-bit across FP16 / BF16 / FP8 / FP4 / INT8 / INT4 activations (provided the activation's dynamic range covers the weight type).
    • Supports various quantization strategies.
    • Supports various scale types (BF16, FP16, E4M3, E5M2, and UE8M0).
    • Supports both Dense GEMM and MoE GEMM.
  • High Compatibility: supports all NVIDIA GPUs from SM75+ (Turing architecture) and beyond.
  • High Performance
    • Delivers State-of-the-Art (SOTA) throughput and efficiency across a wide range of computational scenarios.
  • Ultra-Lightweight
    • Minimal dependencies: Requires only PyTorch and NVCC.
    • Compact footprint: The package size is only 100+KB.

Support Matrix

Activation Type Supported Devices Supported Weight Types
FP16 (e5m10) SM75+ • Symmetric INT1-8
• INT1-8 with dynamic zero point
• Arbitrary signed FP (kBits ≤ 8, kExp ≤ 5)
BF16 (e8m7) SM80+ • Symmetric INT1-8
• INT1-8 with dynamic zero point
• Arbitrary signed FP (kBits ≤ 8)
FP8 (e4m3) SM89+ • Symmetric INT1-5
• INT1-4 with dynamic zero point
• Arbitrary signed FP (kExp ≤ 4, kMan ≤ 3)
FP8 (e5m2) SM89+ • Symmetric INT1-4
• INT1-3 with dynamic zero point
• Arbitrary signed FP (kExp ≤ 5, kMan ≤ 2)
FP4 (e2m1) SM120+ • Symmetric INT1-3
• INT1-2 with dynamic zero point
• Arbitrary signed FP (kExp ≤ 2, kMan ≤ 1)
INT8 SM75+ • Symmetric INT1-8
• INT1-7 with dynamic zero point
INT4 SM80+ • Symmetric INT1-4
• INT1-3 with dynamic zero point

Getting Started

Installation

pip install git+https://github.com/inclusionAI/humming.git

Usage Example

import torch
from humming.layer import HummingLayer

layer = HummingLayer(
    shape_n=8192,
    shape_k=8192,
    weight_config={"dtype": "int6"},
    torch_dtype=torch.float16,
).cuda()

weight = torch.randn((8192, 8192), dtype=torch.float16, device="cuda:0")
inputs = torch.randn((128, 8192), dtype=torch.float16, device="cuda:0")

# Load unquantized weight and quantize to layer quantization format
layer.load_from_unquantized(weight)
# Transform weight to humming format and prepare default kernels
layer.transform()

# Run quantized GEMM (tuning_config is optional, auto-selected by default)
output = layer(inputs)

print("Quantized GEMM Output:")
print(output)
print("\nReference Output:")
print(inputs.matmul(weight.T))

Acknowledgement

This project is highly inspired by

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

humming_kernels-0.1.3.tar.gz (117.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

humming_kernels-0.1.3-py3-none-any.whl (161.3 kB view details)

Uploaded Python 3

File details

Details for the file humming_kernels-0.1.3.tar.gz.

File metadata

  • Download URL: humming_kernels-0.1.3.tar.gz
  • Upload date:
  • Size: 117.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for humming_kernels-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b16292d89ce15fee187de638cfba70637a1317e604d0040efc1b7891879f2e8d
MD5 a4e035dc30e65653b91378de7de45c06
BLAKE2b-256 b0610b4393184f165e75b727487104aa33cf06a9897fff5eae09386d1a5ec9e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for humming_kernels-0.1.3.tar.gz:

Publisher: publish.yml on inclusionAI/humming

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file humming_kernels-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: humming_kernels-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 161.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for humming_kernels-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1c7ffedac0959bbabe9e8def3340839104633328aa1d8d6c5f7a4f12ad5ed983
MD5 02c7815f259957480c14d6e47e63abf5
BLAKE2b-256 d9ab315515e7c087ee4b0d4c2ddf56bdf29c33b46d480376cd0cb3cd558441e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for humming_kernels-0.1.3-py3-none-any.whl:

Publisher: publish.yml on inclusionAI/humming

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page