Skip to main content

Training-free KV cache compression via E8 lattice quantization and attention-aware token eviction

Project description

NexusQuant

Compress your LLM's KV cache 10-33x. Training-free. One line of code.

License Python Stars


Token eviction + E8 lattice quantization, applied once after prefill. No training, no calibration data, no model modifications.

Install

pip install nexusquant
pip install "nexusquant[hf]"  # with HuggingFace transformers

Quickstart

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):
    output = model.generate(input_ids, max_new_tokens=512)

Why

Without NexusQuant With NexusQuant
128K context → 80 GB KV cache 128K context → 5 GB KV cache (17x)
OOM at 32K on a single A100 500K+ tokens on one A100
Needs 8× A100 cluster for long context Single GPU, single machine
Deploy a fine-tuned retrieval model One with block, no code changes

Quality presets

Measured on Mistral-7B, A100, FP16. Compression ratios include all overhead (scales, indices, metadata).

Preset Compression PPL degradation Context on 80 GB
high 10x +0.4% ~1.3M tokens
balanced 17x +1.3% ~2.2M tokens
max 33x +2.6% ~4.2M tokens

Validated on Mistral-7B, TinyLlama-1.1B, Llama-3-8B across academic, technical, and creative text.

How it works

  1. Importance scoring — rank tokens by cross-head attention weight (key-key dot product)
  2. Token eviction — drop lowest-scoring tokens; always keep BOS and a recent sliding window
  3. RoPE removal — undo rotary embeddings on keys so they share a common subspace, reducing quantization error ~0.7 pp
  4. Hadamard rotation — spread energy uniformly across dimensions so no outlier dominates the quantization scale
  5. E8 lattice quantization — quantize 8-float groups onto the E8 root lattice (densest sphere packing in 8D), 2 bits/dim
  6. Delta coding + zstd — consecutive tokens produce similar lattice indices; storing deltas then compressing with zstd yields another 2-3x on the index stream

Token eviction reduces count (2.5x at 60% eviction). E8 quantization reduces precision (~7x after entropy coding). Combined: 17x.

Compared to

Method Compression PPL degradation Training required
NexusQuant 10-33x +0.4-2.6% No
TurboQuant (Google) ~5-6x ~0% No
KVTC (NVIDIA) up to 20x <1% Yes (calibration, ~10 min)
CommVQ (Apple) ~8x ~0% Yes (full retraining)
Palu 11x ~25% rel Yes (calibration)

NexusQuant is the highest-compression training-free method. KVTC achieves comparable ratios with better quality but requires calibration data. Competitor numbers are from their published papers, not reproduced on our hardware.

Supported models

Any HuggingFace causal LM using split-half RoPE (the standard since Llama-2):

  • Llama family (Llama-2, Llama-3, Llama-3.1)
  • Mistral / Mixtral
  • Qwen
  • Phi
  • Gemma

Not yet supported: models with interleaved RoPE (GPT-NeoX, GPT-J).

Limitations

  • Quality is text-dependent. Creative/narrative text degrades more than structured/technical text at the same compression ratio. Test on your actual workload before deploying.
  • Short prefixes hurt. Prefixes under 500 tokens see more degradation than the numbers above, which were measured at 1600-3500 tokens. The importance scorer needs enough tokens to distinguish signal from noise.
  • E8 quantization is CPU-bound. A production deployment needs Triton/CUDA kernels for the quantization step. The current implementation writes dequantized values back to the cache for compatibility — actual GPU memory savings require native compact storage.
  • Eviction is permanent. Evicted tokens are gone. If your task requires precise recall of a specific token, measure eviction sensitivity on that task first.

Citation

@software{nexusquant2025,
  author  = {Marques, Joao},
  title   = {{NexusQuant}: Training-Free {KV} Cache Compression via {E8} Lattice Quantization},
  year    = {2025},
  url     = {https://github.com/jagmarques/nexusquant},
  license = {Apache-2.0},
}

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nexusquant_kv-0.4.0.tar.gz (52.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nexusquant_kv-0.4.0-py3-none-any.whl (55.0 kB view details)

Uploaded Python 3

File details

Details for the file nexusquant_kv-0.4.0.tar.gz.

File metadata

  • Download URL: nexusquant_kv-0.4.0.tar.gz
  • Upload date:
  • Size: 52.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for nexusquant_kv-0.4.0.tar.gz
Algorithm Hash digest
SHA256 eab03e11b14f878038566709d6dab562a3ec04b02f3556c9ab1e61e884003465
MD5 d2d50200f7a6e0dcdd49c436912420d4
BLAKE2b-256 801c6cab00a557f8a2c1654fcb1e584ddfe76301b57930376f73af259c44f194

See more details on using hashes here.

File details

Details for the file nexusquant_kv-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: nexusquant_kv-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 55.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for nexusquant_kv-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f78490aaa89dcb7bce2b57d0b1bd6303c1383ef40accc55b3d73074d8c6e7ef
MD5 3e2d9be20aa07990046403932bccf165
BLAKE2b-256 7f2cc63b4cb2c0132c64b769a6c3d7aee850d7b883bb50bfab776b4b7d315d5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page