Eigenspectral KV cache compression for transformer inference. Up to 6.55x compression with FP16-equivalent quality, drop-in for HuggingFace LLMs and vision transformers.
Project description
SpectralQuant
Eigenspectral KV cache compression for transformer inference. Up to 6.55x compression of the KV cache with FP16-equivalent output quality.
pip install spectralquant
What it does
Modern LLM inference is bottlenecked by the size of the KV cache. The cache grows linearly with sequence length and consumes more memory than the model weights themselves at long context. SpectralQuant compresses that cache by exploiting the fact that, after a per-head spectral rotation, only a small number of dimensions actually carry information.
A short calibration step measures the eigenstructure of each attention head. Each head's keys and values are then split into a high-variance "semantic" band and a low-variance "tail" band. The semantic band gets a generous bit budget; the tail gets one or two bits. Total cache size shrinks by 6.55x with output quality indistinguishable from FP16.
The package ships pure-PyTorch kernels and HuggingFace integrations. There are no custom CUDA dependencies. It runs anywhere torch runs.
Quickstart
import torch
import spectralquant as sq
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
torch_dtype=torch.float16,
device_map="auto",
)
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
engine = sq.SpectralQuant(compression="high") # 6.55x preset
out = engine.generate(
model, tok,
"Explain water-filling bit allocation in two sentences.",
max_new_tokens=120,
)
print(out["text"])
print(f"{out['stats']['ratio']:.2f}x compression, "
f"{out['stats']['tokens_per_second']:.1f} tok/s")
The first call to engine.generate(...) runs a one-time calibration with a
bundled 64-sentence corpus. Subsequent calls reuse it. You can also pass your
own domain-specific corpus.
Compression presets
print(sq.describe_presets())
| preset | ratio | risk | notes |
|---|---|---|---|
standard |
5.95x | safe | Paper baseline. Production default. |
high |
6.55x | safe | Validated on Mistral 7B and Qwen 2.5 7B. |
max |
6.68x | edge | First paragraph clean. Light repetition possible. |
You can also override individual dials when you need them:
engine = sq.SpectralQuant(
compression="high",
d_eff_variance=0.93, # override one knob
)
The dials are avg_bits, noise_bits, value_noise_bits, and
d_eff_variance. Anything unset falls back to the named preset.
Supported models
Tested and verified:
| family | example | works |
|---|---|---|
| Mistral | mistralai/Mistral-7B-Instruct-v0.3 |
yes |
| Qwen 2.5 | Qwen/Qwen2.5-7B-Instruct |
yes |
| Llama 3.x | NousResearch/Meta-Llama-3.1-8B-Instruct |
yes |
| SmolLM2 | HuggingFaceTB/SmolLM2-135M |
yes |
| Gemma 2 | google/gemma-2-9b |
expected |
The cache-level integration works with any HuggingFace causal LM that uses
DynamicCache (transformers >= 4.40). RoPE-based architectures with grouped
query attention are the primary target.
For non-LLM transformers (ViT, ESMFold, VideoMAE, AlphaFold) see the modules
in spectralquant.integrations. Vision transformers can actually see a
quality improvement over FP16 because the eigenspectral filtering removes
noise in the low-variance directions.
Hardware
| GPU | memory | recommended for |
|---|---|---|
| H100 / H200 | 80–141 GB | 7B, 13B, 70B inference, batch decode |
| A100 80 GB | 80 GB | 7B and 13B inference |
| A100 40 GB / A6000 | 40–48 GB | 7B inference, short context |
| RTX 4090 / 4080 / 3090 | 24 GB | 7B inference at FP16, short context |
| T4 / RTX 3060 | 12–16 GB | smaller models, demo runs |
| CPU | n/a | works, but slow |
The compression ratios above were measured on H200 with Mistral 7B and Qwen 2.5 7B at sequence length 512. Compression is sequence-length agnostic so ratios hold at longer contexts; speed gains scale with context length because the FP16 baseline gets slower while the SQ decode stays linear.
Generating with a pre-compressed prefix
Useful when you want to keep one compressed cache and reuse it across many completions of the same long prefix.
result = engine.compress_prefill(model, tok, long_prefix)
cache = result["cache"] # a fresh DynamicCache, FP16 surface
print(f"prefix compression: {result['stats']['ratio']:.2f}x")
# Use cache as past_key_values for any number of follow-ups:
inputs = tok(question, return_tensors="pt").to(model.device)
ids = model.generate(
**inputs,
past_key_values=cache,
max_new_tokens=200,
)
Custom calibration
The bundled corpus works for general English. For domain-specific workloads (code, biomedical text, legal filings), pass your own:
my_corpus = [...] # 32–128 representative samples
engine = sq.SpectralQuant(compression="high")
engine.calibrate(model, tok, my_corpus)
Calibration takes a few seconds on H200. You can persist it once and reload in any future process:
engine.save_calibration("/path/to/calib")
fresh = sq.SpectralQuant(compression="high")
fresh.load_calibration("/path/to/calib", head_dim=128)
How it works (one paragraph)
For each attention head, calibration accumulates the key and value covariance
matrices and eigendecomposes them. The eigenvectors define a per-head
rotation that aligns coordinates with directions of decreasing variance.
After rotation, a water-filling allocator distributes bits across
coordinates so that high-variance dimensions get more bits and tail
dimensions get fewer. Two bit budgets are used: a "semantic" budget
(avg_bits) for the high-variance band and a "tail" budget (noise_bits,
value_noise_bits) for the rest. Each coordinate is quantized with a
Lloyd-Max scalar codebook fit to a Gaussian whose variance equals that
coordinate's eigenvalue. Decode rotates back, dequantizes, and the rest of
attention proceeds at full FP16. The math is in
engine.py.
Demo notebook
A full end-to-end notebook is included at
notebooks/spectralquant_demo.ipynb.
It walks through:
- Install + GPU sanity check
- The three presets
- Loading Mistral 7B
- Side-by-side FP16 vs SpectralQuant on four diverse prompts, for each preset
- Power-user override
- Custom calibration
- Final summary table
- Save / load round-trip
To run it on a fresh GPU instance:
unzip -oq spectralquant.zip -d spectralquant
pip install -e ./spectralquant
jupyter notebook notebooks/spectralquant_demo.ipynb
API surface
sq.SpectralQuant(
compression="standard" | "high" | "max",
device=None, # "cuda" | "mps" | "cpu" | None (auto)
head_dim=None, # inferred from model
avg_bits=None, noise_bits=None,
value_noise_bits=None,
d_eff_variance=None,
)
engine.generate(model, tokenizer, prompt, *, max_new_tokens=128, ...)
engine.compress_prefill(model, tokenizer, prompt)
engine.calibrate(model, tokenizer, calibration_texts=None)
engine.compression_stats()
engine.save_calibration(path)
engine.load_calibration(path, head_dim=128)
The lower-level sq.SpectralQuantEngine is also exported for users who want
direct access to per-head bit allocations or to use the legacy
attention-level monkey-patch path.
Measuring quality
The package reports four metrics in engine.compression_stats() and in the
stats field returned by .generate(...):
ratio— observed prefix-cache compression vs FP16 (bytes / bytes)tokens_per_second— measured decode throughputseconds— wall clock for the decode stepcompressed_bytes,fp16_bytes— raw byte counts
For independent quality validation you can run perplexity on WikiText:
python examples/run_perplexity.py --model mistralai/Mistral-7B-Instruct-v0.3
Or sweep parameters to find the sweet spot for a model not in our test set:
python examples/sweep_compression.py --model <hf_repo>
Authors
- Anirudh Bharadwaj Vangara — anirudh@sentra.app
- Ashwin Gopinath — ashwin@sentra.app
Bug reports, feature requests, and pull requests are welcome on GitHub.
License
MIT.
Citation
@misc{spectralquant2026,
title = {SpectralQuant: Eigenspectral KV Cache Compression},
author = {Vangara, Anirudh Bharadwaj and Gopinath, Ashwin},
year = {2026},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spectralquant-0.3.0.tar.gz.
File metadata
- Download URL: spectralquant-0.3.0.tar.gz
- Upload date:
- Size: 224.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f767d70d94f7afca54d08b53a7d740da837346b3bf86a588556cc1164172ee91
|
|
| MD5 |
d7f4c5092df7e5dcbc3ec9012dd67dac
|
|
| BLAKE2b-256 |
59e556a118e213fd2d1492d2c0382d23b3b468c10cb4bab99944cde9a6b37aff
|
File details
Details for the file spectralquant-0.3.0-py3-none-any.whl.
File metadata
- Download URL: spectralquant-0.3.0-py3-none-any.whl
- Upload date:
- Size: 74.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec11532604d86d7dbc9c87f5ba9fb208c0eba456f9f7014327532625b47f5e36
|
|
| MD5 |
1fb904df4dff8af0e12c5ff51ebe5984
|
|
| BLAKE2b-256 |
6a1fd04e72471ccc918fd49103804cd2a512b9b2484f456332824ebe1a410cb3
|