Skip to main content

Nested-lattice KV-cache compression for LLM inference: Zamir-Feder D4 and E8 variants with shaping gain over scalar quantisation.

Project description

kakeyalattice

Nested-lattice KV-cache compression for LLM inference.

pip install kakeyalattice

Ready-to-use Python codec that reduces the KV-cache memory footprint of transformer-based LLM inference via a Zamir--Feder nested-lattice quantiser. Two variants are exposed: $D_4$ (block dim 4, $+0.37,$dB shaping gain over $\mathbb{Z}^4$) and $E_8$ (block dim 8, $+0.65,$dB over $\mathbb{Z}^8$, $+0.29,$dB over $D_4$).

Quickstart — raw codec

import torch
from kakeyalattice import V15KakeyaZamirE8GPU

codec = V15KakeyaZamirE8GPU(D=128, q_range=38, device="cuda")

# x: any tensor whose last dim equals D
x = torch.randn(16, 128, device="cuda")
x_reconstructed = codec.roundtrip(x)

# Compression info
codec.bits_per_token_per_head   # 832 for D=128 Q=38
codec.shaping_gain_db           # 0.65 (E8 vs Z^8)

HuggingFace transformers integration

pip install kakeyalattice[hf]

Drop-in replacement for DynamicCache on any HF causal LM whose head_dim is divisible by 4 (D4) or 8 (E8):

from transformers import AutoModelForCausalLM, AutoTokenizer
from kakeyalattice.hf import KakeyaLatticeCache

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-1.5B", torch_dtype="bfloat16", device_map="cuda",
)

cache = KakeyaLatticeCache(
    variant="e8", q_range=38,                 # balanced operating point
    num_hidden_layers=model.config.num_hidden_layers,
    head_dim=model.config.head_dim,           # 128 for Qwen2-1.5B
    device="cuda",
)

inputs = tok("Explain nested-lattice quantisation:", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=200, past_key_values=cache)
print(tok.decode(out[0], skip_special_tokens=True))

Typical per-token KV memory vs bf16 baseline:

  • variant="e8" q_range=10 (aggressive): ~3.4× compression
  • variant="e8" q_range=38 (balanced, recommended): ~2.5× compression
  • variant="e8" q_range=152 (near-lossless): ~1.9× compression

What it is (and what it is not)

What it IS

  • A Python, GPU-first reference implementation of the $D_4$/$E_8$ nested-lattice codec from the paper KakeyaLattice: Nested-Lattice KV-Cache Compression with Kakeya-Style Discrete Codebooks.
  • Five engineering levers (unit-norm factorisation, Sylvester--Hadamard rotation, per-vector adaptive $q_\mathrm{max}$, joint scaling, clamp)
    • one lattice (D4 or E8) closest-point. All on GPU via PyTorch tensor ops; no hand-written CUDA required.
  • Bit-level reproducibility: benchmarks/e8_parity_and_smoke.py pins SHA256 hashes of codec output at fixed seeds, so regressions are caught in CI.

What it is NOT

  • Not a fused decode kernel: each call runs PyTorch-level ops. A Triton-fused E8 closest-point kernel would reduce decode latency by ~3× on H200 but is out of scope for this release.
  • Not a drop-in replacement for FP8 on arbitrary attention kernels: FlashAttention / FlashMLA / paged-attention kernels expect FP8 or BF16 KV. KakeyaLatticeCache stores the codec-roundtripped tensor at the same dtype as the model's KV, which means the memory saving is currently nominal (bytes saved in storage format) unless you also change the cache dtype. A follow-up vLLM integration (see vllm_backend/kakeya_v1_4_snapshot/) changes the cache storage dtype for real HBM savings.
  • Not claiming to beat every quantisation baseline: it beats FP8 per-64-block on DeepSeek-V4-Flash's highly anisotropic KV by ~12% rel-MSE at ~22% fewer bits (see Stage 0.75 findings). On near-Gaussian KV from small dense models it is comparable to or slightly behind well-tuned scalar quantisers.

Head-dim compatibility

model family head_dim D4 compatible E8 compatible
LLaMA-3.x 128
Qwen2/Qwen3 (hidden-size scale) 64–256 (must be divisible by 8 for E8) ✅ (all) ✅ (128, 256); ✗ (96, 176)
Mistral / Mixtral 128
DeepSeek-V2/V3 (MLA) 128 + 64 rope
DeepSeek-V4-Flash (MLA + CSA/HCA) 512 shared-latent
Gemma-3 / Gemma-4 256 (full attn), 256 (sliding)

If head_dim % 4 != 0, the codec raises ValueError by design (no silent fallback). Models like legacy GPT-NeoX variants with head_dim = 96 need a different block structure.

Reproducibility

pip install kakeyalattice[hf,dev]
git clone https://github.com/FluffyAIcode/LLM-KV--Cache-compress
cd LLM-KV--Cache-compress
pytest benchmarks/e8_parity_and_smoke.py -v   # frozen SHA256 parity
pytest benchmarks/ablation_parity_check.py -v  # 6-variant ablation

benchmarks/frozen_parity.json contains 8 pinned SHA256 hashes. Any change to the codec that breaks bit-level parity fails CI.

Citation

If you use this package, please cite:

@misc{li2026kakeyalattice,
  author = {Li, Allen},
  title  = {KakeyaLattice: Nested-Lattice KV-Cache Compression with
            Kakeya-Style Discrete Codebooks},
  year   = {2026},
  url    = {https://github.com/FluffyAIcode/LLM-KV--Cache-compress/tree/main/reports/paper},
}

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kakeyalattice-1.5.0.tar.gz (29.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kakeyalattice-1.5.0-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file kakeyalattice-1.5.0.tar.gz.

File metadata

  • Download URL: kakeyalattice-1.5.0.tar.gz
  • Upload date:
  • Size: 29.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for kakeyalattice-1.5.0.tar.gz
Algorithm Hash digest
SHA256 dbf6318f08855c016844bb2cc634c6f155b02da44fa490551d1e6233bb7fb878
MD5 5ff00be38daef6ae893ba9a00558189e
BLAKE2b-256 42e38b78b5071f30f16640b947f137735fd4db2ad9df7851fa48b4f31b1d3822

See more details on using hashes here.

File details

Details for the file kakeyalattice-1.5.0-py3-none-any.whl.

File metadata

  • Download URL: kakeyalattice-1.5.0-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for kakeyalattice-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a35e519ba8bf11c7bca1d6a7311e62d72e73842a64d067f3369613149992062a
MD5 244d42738887fb107f08654d5f4832aa
BLAKE2b-256 e64fe89b91149c96f921d54d2b17b3b02f825003c442f4df3234d44e7a7dfa22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page