E8 lattice codebook quantization for LLM weights

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

GLQ

Post-training weight quantization for LLMs using E8 lattice codebooks.

GLQ encodes weights into 8-dimensional E8 lattice points via nearest-neighbor lookup. A Randomized Hadamard Transform (RHT) makes the Hessian approximately diagonal so that Euclidean nearest-neighbor is near-optimal under the proxy loss.

Results

SmolLM3-3B-Base on WikiText-2 (128 calibration samples, NVIDIA A10G):

Method	Eff. BPW	Size (MB)	Perplexity	vs bf16	GPU MB	tok/s
bf16	16.00	6150	7.90	1.00x	6151	33.9
GLQ 4-bit	4.00	1538	8.11	1.03x	-	-
AWQ 4-bit	5.60	2152	8.15	1.03x	-	-
QuIP+GPTQ 4-bit	4.76	1829	8.17	1.03x	-	-
GLQ 3-bit	3.00	1153	8.91	1.13x	-	-
QuIP+GPTQ 3-bit	3.70	1423	9.30	1.18x	-	-
GLQ 2-bit	2.00	769	11.35	1.44x	2540	13.4

Mistral-7B-v0.3 on WikiText-2 (16 calibration samples, NVIDIA A10G):

Method	BPW	Perplexity	vs bf16	GPU MB	tok/s
bf16	16	4.20	1.00x	14505	28.1
GLQ 3-bit	3	4.41	1.05x	4436	9.7

Ministral-3-3B-Base-2512 on WikiText-2 (16 calibration samples, NVIDIA A10G):

Method	BPW	Perplexity	vs bf16	GPU MB	tok/s
bf16	16	5.91	1.00x	7348	37.0
GLQ 3-bit	3	6.47	1.09x	3788	11.4

Llama-3.2-3B on WikiText-2 (16 calibration samples, NVIDIA A10G):

Method	BPW	Perplexity	vs bf16	GPU MB	tok/s
bf16	16	6.17	1.00x	6137	37.6
GLQ 3-bit	3	6.78	1.10x	3529	10.8
GLQ 2-bit	2	8.49	1.38x	3526	11.0

SmolLM2-360M on WikiText-2 (128 calibration samples, NVIDIA A10G):

Method	Eff. BPW	Perplexity	vs bf16	GPU MB	tok/s
bf16 baseline	16.00	11.48	1.00x	724	37.6
GLQ 4-bit	4.00	11.82	1.03x	-	-
QuIP+GPTQ 4-bit	4.75	12.06	1.05x	-	-
GLQ 3-bit	3.00	13.38	1.17x	-	-
QuIP+GPTQ 3-bit	3.69	14.84	1.29x	-	-
GLQ 2-bit	2.00	17.70	1.54x	356	15.5
GPTQ 3-bit	9.48	18.61	1.62x	-	-

GLQ uses a single global scale per layer rather than per-group scales, so effective bit widths match the nominal rate exactly. GLQ 2-bit (17.70) beats GPTQ 3-bit (18.61) at less than 1/4 the storage. GLQ 4-bit (11.82) beats QuIP+GPTQ 4-bit (12.06) at lower effective bpw (4.00 vs 4.75).

How it works

E8 lattice codebook: 65536 vectors from the first 7 shells of the E8 lattice. Each 8-weight group maps to a 16-bit index (2 bpw). For 3/4 bpw, a second-stage residual codebook adds 8 or 16 more bits.
Randomized Hadamard Transform (RHT): Random sign flips + Fast Walsh-Hadamard Transform applied to both weights and Hessian. This spreads weight magnitude evenly across dimensions, making the Hessian block-diagonal approximately proportional to identity. After RHT, Euclidean nearest-neighbor in the codebook is close to Hessian-optimal.
LDLQ error feedback: Block-LDL decomposition of the Hessian drives a sequential quantization sweep (like GPTQ but over 8-dim blocks instead of scalar columns). Quantization error from each block propagates forward to correct subsequent blocks.
Fused Triton inference kernel: On CUDA, a custom Triton kernel reads codebook indices directly from HBM and gathers from the L2-cached codebook (65536 x 8 fp16 = 1 MB) without ever materializing the full weight matrix. This provides real GPU memory savings proportional to the compression ratio.

Install

Requires Python 3.10+ and PyTorch 2.0+. Install PyTorch first (pytorch.org), then:

# Full install (includes transformers, datasets, etc. for glq-quantize CLI):
pip install 'glq[quantize]'

# Or minimal install (inference only, no quantization dependencies):
pip install glq

Triton is bundled with PyTorch on CUDA and will be used automatically when available. On CPU, GLQ falls back to a naive dequantize-then-matmul path.

Quickstart

Quantizing a model

Command line

# 2-bit quantization (smallest model, ~1.5x perplexity)
glq-quantize \
    --model HuggingFaceTB/SmolLM2-360M \
    --output ./smollm2-glq-2bpw \
    --bpw 2 \
    --nsamples 128 \
    --device cuda

# 3-bit quantization (good balance of size and quality)
glq-quantize \
    --model HuggingFaceTB/SmolLM2-360M \
    --output ./smollm2-glq-3bpw \
    --bpw 3 \
    --nsamples 128 \
    --device cuda

# 4-bit quantization (near-lossless, ~1.03x perplexity)
glq-quantize \
    --model HuggingFaceTB/SmolLM2-360M \
    --output ./smollm2-glq-4bpw \
    --bpw 4 \
    --nsamples 128 \
    --device cuda

All CLI options:

glq-quantize --help
  --model        HuggingFace model ID or local path (required)
  --output       Output directory for quantized model (required)
  --bpw          Bits per weight: 2, 3, or 4 (default: 2)
  --tune-iters   LDLQ refinement iterations (default: 0)
  --nsamples     Calibration samples from WikiText-2 (default: 16)
  --seqlen       Calibration sequence length (default: 2048)
  --device       cuda or cpu (default: cuda)

Python API

from glq import quantize

quantize(
    model_name="HuggingFaceTB/SmolLM2-360M",
    output_dir="./smollm2-glq-4bpw",
    bpw=4,
    nsamples=128,
    device="cuda",
)

The quantize() function handles the full pipeline: load model, capture Hessians via calibration data, quantize each linear layer with E8+RHT+LDLQ, and save the result as a standard HuggingFace model directory (safetensors + config.json + tokenizer).

Loading and running a quantized model

import glq.hf_integration  # registers GLQ with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "./smollm2-glq-4bpw",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("./smollm2-glq-4bpw")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The import glq.hf_integration line registers GLQ as a quantization method with HuggingFace Transformers. After that, from_pretrained automatically:

Reads quantization_config.quant_method = "glq" from config.json
Replaces nn.Linear modules with E8RHTLinear
Loads the quantized weights (codebook indices + sign vectors)
Builds the E8 codebook and attaches it to all quantized layers

On CUDA, inference automatically uses the fused Triton kernel. On CPU, it falls back to dequantize-then-matmul.

Bit widths

BPW	Encoding	Bits per 8 weights	Storage
2	16-bit codebook index	16	Global scale only
3	16-bit primary + 8-bit residual index	24	Global scale + residual scale
4	16-bit primary + 16-bit residual index	32	Global scale + residual scale

All bit widths use a single global scale per layer (no group-size parameter), so effective bit widths match the nominal rate exactly.

For 3/4 bpw, GLQ uses a two-stage residual vector quantization (RVQ): the primary codebook (65536 entries) encodes the bulk of the weight, and a secondary codebook (256 entries for 3 bpw, 65536 for 4 bpw) encodes the residual error scaled by a learned factor.

Triton inference kernel

The fused Triton kernel (glq/inference_kernel.py) is the core of GLQ's inference performance. It computes Y = X @ dequant(W)^T without materializing the full weight matrix.

How it works

Instead of the naive approach (decode all indices into a dense bf16 matrix, then matmul), the kernel:

Iterates over N/8 codebook blocks per output row
Loads int16 indices from HBM and gathers 8-element vectors from the L2-cached codebook
Accumulates dot products (matvec) or Tensor Core matmuls (prefill) against the gathered codebook vectors
Applies the global scale factor and writes the output

This means GPU memory holds only the compressed indices (2 bytes per 8 weights) rather than the full fp16 weight matrix (16 bytes per 8 weights) — an 8x reduction at 2 bpw.

Kernel variants

Tensor Core matmul kernel (_glq_dequant_matmul_tc_kernel): For batch sizes >= 2 (prefill). Processes pairs of codebook blocks to form K=16 tiles for tl.dot (mma.m16n8k16 Tensor Core instructions). Autotuned over BLOCK_B and BLOCK_M.
Matvec kernel (_glq_dequant_matvec_kernel): For B=1 (autoregressive decode). Autotuned over BLOCK_M and num_warps for the memory-bound single-token case.
Fused RHT kernels (_input_rht_kernel, _output_rht_kernel): Fuse pad + sign vector + Fast Hadamard Transform into single kernel launches, eliminating per-layer Python overhead.

All kernels support two-stage RVQ for 3/4 bpw via a HAS_STAGE2 compile-time constant.

Using the kernel directly

The kernel is used automatically by E8RHTLinear.forward() when running on CUDA with Triton available. You can also call it directly:

from glq.inference_kernel import glq_dequant_matmul

# 2bpw: single codebook
y = glq_dequant_matmul(
    x,          # (B, N) input activations, fp16/fp32
    Qidxs,      # (M, N//8) codebook indices, int16
    codebook,    # (65536, 8) codebook vectors, fp16
    Wscale,      # float, global scale factor
)

# 3/4bpw: two-stage with residual codebook
y = glq_dequant_matmul(
    x, Qidxs, codebook, Wscale,
    Qidxs2=Qidxs2,              # (M, N//8) secondary indices, int16
    codebook2=codebook2,          # (K2, 8) secondary codebook, fp16
    inv_resid_scale=inv_rs,       # float, 1.0 / residual_scale
)

Falls back to naive dequantize+matmul on CPU or when Triton is not available.

Requirements

CUDA GPU
Triton (bundled with pip install torch on CUDA, or pip install 'glq[cuda]')
PyTorch 2.0+

Architecture

glq/
  codebook.py          # E8ShellCodebook: enumeration, encode/decode, make_small()
  hadamard.py          # Fast Walsh-Hadamard Transform
  rht.py               # Randomized Hadamard Transform (sign flips + FHT)
  ldlq.py              # Block-LDL quantization with error feedback
  quantize_model.py    # Full model quantization pipeline + CLI
  quantized_linear.py  # E8RHTLinear: drop-in nn.Linear replacement
  inference_kernel.py  # Fused Triton dequant+matmul kernels
  hf_integration.py    # HuggingFace Transformers integration

Acknowledgments

The RHT incoherence approach follows QuIP# (Tseng et al., 2024)
E8 lattice geometry from Conway & Sloane, Sphere Packings, Lattices and Groups
LDLQ error feedback from GPTQ (Frantar et al., 2022)

License

Apache 2.0

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

0.5.0

May 31, 2026

0.3.5

May 18, 2026

0.3.4

May 18, 2026

0.3.3

May 17, 2026

0.3.2

May 17, 2026

0.3.1

May 17, 2026

0.3.0

May 16, 2026

0.2.20

May 13, 2026

0.2.19

May 12, 2026

0.2.18

May 6, 2026

0.2.17

May 6, 2026

0.2.16

May 5, 2026

0.2.15

May 2, 2026

0.2.14

May 1, 2026

0.2.13

Apr 25, 2026

0.2.12

Apr 25, 2026

0.2.11

Apr 18, 2026

0.2.10

Apr 16, 2026

0.2.9

Apr 15, 2026

0.2.8

Apr 3, 2026

0.2.7

Mar 22, 2026

0.2.6

Mar 21, 2026

0.2.5

Mar 21, 2026

0.2.2

Mar 21, 2026

0.2.1

Mar 19, 2026

0.2.0

Mar 18, 2026

0.1.9

Mar 16, 2026

0.1.8

Mar 15, 2026

This version

0.1.7

Mar 14, 2026

0.1.6

Mar 14, 2026

0.1.5

Mar 14, 2026

0.1.4

Mar 14, 2026

0.1.3

Mar 12, 2026

0.1.2

Mar 10, 2026

0.1.1

Mar 9, 2026

0.1.0

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glq-0.1.7.tar.gz (264.4 kB view details)

Uploaded Mar 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

glq-0.1.7-py3-none-any.whl (273.6 kB view details)

Uploaded Mar 14, 2026 Python 3

File details

Details for the file glq-0.1.7.tar.gz.

File metadata

Download URL: glq-0.1.7.tar.gz
Upload date: Mar 14, 2026
Size: 264.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for glq-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`280a1c681209bdfb16f0c5e0c52a546ae3791cf14e8be347de9c1afc15449cce`
MD5	`a0ea933797852493e4ef8b4e514f1ef9`
BLAKE2b-256	`2718019870db3ca7a7c13e7f713b0844b1b738b673bba0f76e89101f58278bce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for glq-0.1.7.tar.gz:

Publisher: publish.yml on cnygaard/glq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: glq-0.1.7.tar.gz
- Subject digest: 280a1c681209bdfb16f0c5e0c52a546ae3791cf14e8be347de9c1afc15449cce
- Sigstore transparency entry: 1104220671
- Sigstore integration time: Mar 14, 2026
Source repository:
- Permalink: cnygaard/glq@04909af7f34fd9d666ee720bf63cb22de9787bfd
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/cnygaard
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@04909af7f34fd9d666ee720bf63cb22de9787bfd
- Trigger Event: push

File details

Details for the file glq-0.1.7-py3-none-any.whl.

File metadata

Download URL: glq-0.1.7-py3-none-any.whl
Upload date: Mar 14, 2026
Size: 273.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for glq-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7998ef91843d46232852c199425dead1a7225f2b4ae4b8cf79f3fb114d3e63c8`
MD5	`da473406a7b4a11238113808f343a0f5`
BLAKE2b-256	`5708c2b81eab91c0849ad8f41863f1ddc3023288e3461dd7ae3d10f0e5dfb76e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for glq-0.1.7-py3-none-any.whl:

Publisher: publish.yml on cnygaard/glq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: glq-0.1.7-py3-none-any.whl
- Subject digest: 7998ef91843d46232852c199425dead1a7225f2b4ae4b8cf79f3fb114d3e63c8
- Sigstore transparency entry: 1104220736
- Sigstore integration time: Mar 14, 2026
Source repository:
- Permalink: cnygaard/glq@04909af7f34fd9d666ee720bf63cb22de9787bfd
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/cnygaard
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@04909af7f34fd9d666ee720bf63cb22de9787bfd
- Trigger Event: push

glq 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

GLQ

Results

How it works

Install

Quickstart

Quantizing a model

Command line

Python API

Loading and running a quantized model

Bit widths

Triton inference kernel

How it works

Kernel variants

Using the kernel directly

Requirements

Architecture

Acknowledgments

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance