nexusquant-kv

Training-free KV cache compression via E8 lattice quantization and attention-aware token eviction

These details have not been verified by PyPI

Project links

Repository

Project description

NexusQuant

Compress your LLM's KV cache 10-33x. Training-free. One line of code.

Token eviction + E8 lattice quantization, applied once after prefill. No training, no calibration data, no model modifications.

Install

pip install nexusquant
pip install "nexusquant[hf]"  # with HuggingFace transformers

Quickstart

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):
    output = model.generate(input_ids, max_new_tokens=512)

Why

Without NexusQuant	With NexusQuant
128K context → 80 GB KV cache	128K context → 5 GB KV cache (17x)
OOM at 32K on a single A100	500K+ tokens on one A100
Needs 8× A100 cluster for long context	Single GPU, single machine
Deploy a fine-tuned retrieval model	One `with` block, no code changes

Quality presets

Measured on Mistral-7B, A100, FP16. Compression ratios include all overhead (scales, indices, metadata).

Preset	Compression	PPL degradation	Context on 80 GB
`high`	10x	+0.4%	~1.3M tokens
`balanced`	17x	+1.3%	~2.2M tokens
`max`	33x	+2.6%	~4.2M tokens

Validated on Mistral-7B, TinyLlama-1.1B, Llama-3-8B across academic, technical, and creative text.

How it works

Importance scoring — rank tokens by cross-head attention weight (key-key dot product)
Token eviction — drop lowest-scoring tokens; always keep BOS and a recent sliding window
RoPE removal — undo rotary embeddings on keys so they share a common subspace, reducing quantization error ~0.7 pp
Hadamard rotation — spread energy uniformly across dimensions so no outlier dominates the quantization scale
E8 lattice quantization — quantize 8-float groups onto the E8 root lattice (densest sphere packing in 8D), 2 bits/dim
Delta coding + zstd — consecutive tokens produce similar lattice indices; storing deltas then compressing with zstd yields another 2-3x on the index stream

Token eviction reduces count (2.5x at 60% eviction). E8 quantization reduces precision (~7x after entropy coding). Combined: 17x.

Compared to

Method	Compression	PPL degradation	Training required
NexusQuant	10-33x	+0.4-2.6%	No
TurboQuant (Google)	~5-6x	~0%	No
KVTC (NVIDIA)	up to 20x	<1%	Yes (calibration, ~10 min)
CommVQ (Apple)	~8x	~0%	Yes (full retraining)
Palu	11x	~25% rel	Yes (calibration)

NexusQuant is the highest-compression training-free method. KVTC achieves comparable ratios with better quality but requires calibration data. Competitor numbers are from their published papers, not reproduced on our hardware.

Supported models

Any HuggingFace causal LM using split-half RoPE (the standard since Llama-2):

Llama family (Llama-2, Llama-3, Llama-3.1)
Mistral / Mixtral
Qwen
Phi
Gemma

Not yet supported: models with interleaved RoPE (GPT-NeoX, GPT-J).

Limitations

Quality is text-dependent. Creative/narrative text degrades more than structured/technical text at the same compression ratio. Test on your actual workload before deploying.
Short prefixes hurt. Prefixes under 500 tokens see more degradation than the numbers above, which were measured at 1600-3500 tokens. The importance scorer needs enough tokens to distinguish signal from noise.
E8 quantization is CPU-bound. A production deployment needs Triton/CUDA kernels for the quantization step. The current implementation writes dequantized values back to the cache for compatibility — actual GPU memory savings require native compact storage.
Eviction is permanent. Evicted tokens are gone. If your task requires precise recall of a specific token, measure eviction sensitivity on that task first.

Citation

@software{nexusquant2025,
  author  = {Marques, Joao},
  title   = {{NexusQuant}: Training-Free {KV} Cache Compression via {E8} Lattice Quantization},
  year    = {2025},
  url     = {https://github.com/jagmarques/nexusquant},
  license = {Apache-2.0},
}

License

Apache 2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.4.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nexusquant_kv-0.4.0.tar.gz (52.4 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nexusquant_kv-0.4.0-py3-none-any.whl (55.0 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file nexusquant_kv-0.4.0.tar.gz.

File metadata

Download URL: nexusquant_kv-0.4.0.tar.gz
Upload date: Apr 7, 2026
Size: 52.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for nexusquant_kv-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`eab03e11b14f878038566709d6dab562a3ec04b02f3556c9ab1e61e884003465`
MD5	`d2d50200f7a6e0dcdd49c436912420d4`
BLAKE2b-256	`801c6cab00a557f8a2c1654fcb1e584ddfe76301b57930376f73af259c44f194`

See more details on using hashes here.

File details

Details for the file nexusquant_kv-0.4.0-py3-none-any.whl.

File metadata

Download URL: nexusquant_kv-0.4.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 55.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for nexusquant_kv-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f78490aaa89dcb7bce2b57d0b1bd6303c1383ef40accc55b3d73074d8c6e7ef`
MD5	`3e2d9be20aa07990046403932bccf165`
BLAKE2b-256	`7f2cc63b4cb2c0132c64b769a6c3d7aee850d7b883bb50bfab776b4b7d315d5c`

See more details on using hashes here.

nexusquant-kv 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Install

Quickstart

Why

Quality presets

How it works

Compared to

Supported models

Limitations

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes