Training-free KV cache compression via E8 lattice quantization and attention-aware token eviction
Project description
NexusQuant
Compress your LLM's KV cache 10-33x. Training-free. One line of code.
Token eviction + E8 lattice quantization, applied once after prefill. No training, no calibration data, no model modifications.
Install
pip install nexusquant
pip install "nexusquant[hf]" # with HuggingFace transformers
Quickstart
from nexusquant import nexusquant_evict
with nexusquant_evict(model, quality="balanced"):
output = model.generate(input_ids, max_new_tokens=512)
Why
| Without NexusQuant | With NexusQuant |
|---|---|
| 128K context → 80 GB KV cache | 128K context → 5 GB KV cache (17x) |
| OOM at 32K on a single A100 | 500K+ tokens on one A100 |
| Needs 8× A100 cluster for long context | Single GPU, single machine |
| Deploy a fine-tuned retrieval model | One with block, no code changes |
Quality presets
Measured on Mistral-7B, A100, FP16. Compression ratios include all overhead (scales, indices, metadata).
| Preset | Compression | PPL degradation | Context on 80 GB |
|---|---|---|---|
high |
10x | +0.4% | ~1.3M tokens |
balanced |
17x | +1.3% | ~2.2M tokens |
max |
33x | +2.6% | ~4.2M tokens |
Validated on Mistral-7B, TinyLlama-1.1B, Llama-3-8B across academic, technical, and creative text.
How it works
- Importance scoring — rank tokens by cross-head attention weight (key-key dot product)
- Token eviction — drop lowest-scoring tokens; always keep BOS and a recent sliding window
- RoPE removal — undo rotary embeddings on keys so they share a common subspace, reducing quantization error ~0.7 pp
- Hadamard rotation — spread energy uniformly across dimensions so no outlier dominates the quantization scale
- E8 lattice quantization — quantize 8-float groups onto the E8 root lattice (densest sphere packing in 8D), 2 bits/dim
- Delta coding + zstd — consecutive tokens produce similar lattice indices; storing deltas then compressing with zstd yields another 2-3x on the index stream
Token eviction reduces count (2.5x at 60% eviction). E8 quantization reduces precision (~7x after entropy coding). Combined: 17x.
Compared to
| Method | Compression | PPL degradation | Training required |
|---|---|---|---|
| NexusQuant | 10-33x | +0.4-2.6% | No |
| TurboQuant (Google) | ~5-6x | ~0% | No |
| KVTC (NVIDIA) | up to 20x | <1% | Yes (calibration, ~10 min) |
| CommVQ (Apple) | ~8x | ~0% | Yes (full retraining) |
| Palu | 11x | ~25% rel | Yes (calibration) |
NexusQuant is the highest-compression training-free method. KVTC achieves comparable ratios with better quality but requires calibration data. Competitor numbers are from their published papers, not reproduced on our hardware.
Supported models
Any HuggingFace causal LM using split-half RoPE (the standard since Llama-2):
- Llama family (Llama-2, Llama-3, Llama-3.1)
- Mistral / Mixtral
- Qwen
- Phi
- Gemma
Not yet supported: models with interleaved RoPE (GPT-NeoX, GPT-J).
Limitations
- Quality is text-dependent. Creative/narrative text degrades more than structured/technical text at the same compression ratio. Test on your actual workload before deploying.
- Short prefixes hurt. Prefixes under 500 tokens see more degradation than the numbers above, which were measured at 1600-3500 tokens. The importance scorer needs enough tokens to distinguish signal from noise.
- E8 quantization is CPU-bound. A production deployment needs Triton/CUDA kernels for the quantization step. The current implementation writes dequantized values back to the cache for compatibility — actual GPU memory savings require native compact storage.
- Eviction is permanent. Evicted tokens are gone. If your task requires precise recall of a specific token, measure eviction sensitivity on that task first.
Citation
@software{nexusquant2025,
author = {Marques, Joao},
title = {{NexusQuant}: Training-Free {KV} Cache Compression via {E8} Lattice Quantization},
year = {2025},
url = {https://github.com/jagmarques/nexusquant},
license = {Apache-2.0},
}
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nexusquant_kv-0.4.0.tar.gz.
File metadata
- Download URL: nexusquant_kv-0.4.0.tar.gz
- Upload date:
- Size: 52.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eab03e11b14f878038566709d6dab562a3ec04b02f3556c9ab1e61e884003465
|
|
| MD5 |
d2d50200f7a6e0dcdd49c436912420d4
|
|
| BLAKE2b-256 |
801c6cab00a557f8a2c1654fcb1e584ddfe76301b57930376f73af259c44f194
|
File details
Details for the file nexusquant_kv-0.4.0-py3-none-any.whl.
File metadata
- Download URL: nexusquant_kv-0.4.0-py3-none-any.whl
- Upload date:
- Size: 55.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f78490aaa89dcb7bce2b57d0b1bd6303c1383ef40accc55b3d73074d8c6e7ef
|
|
| MD5 |
3e2d9be20aa07990046403932bccf165
|
|
| BLAKE2b-256 |
7f2cc63b4cb2c0132c64b769a6c3d7aee850d7b883bb50bfab776b4b7d315d5c
|