Skip to main content

Eigenspectral KV cache compression for transformer inference. Up to 6.55x compression with FP16-equivalent quality, drop-in for HuggingFace LLMs and vision transformers.

Project description

SpectralQuant

SpectralQuant

Eigenspectral KV cache compression for transformer inference. Up to 6.55x compression of the KV cache with FP16-equivalent output quality.

pip install spectralquant

What it does

Modern LLM inference is bottlenecked by the size of the KV cache. The cache grows linearly with sequence length and consumes more memory than the model weights themselves at long context. SpectralQuant compresses that cache by exploiting the fact that, after a per-head spectral rotation, only a small number of dimensions actually carry information.

A short calibration step measures the eigenstructure of each attention head. Each head's keys and values are then split into a high-variance "semantic" band and a low-variance "tail" band. The semantic band gets a generous bit budget; the tail gets one or two bits. Total cache size shrinks by 6.55x with output quality indistinguishable from FP16.

The package ships pure-PyTorch kernels and HuggingFace integrations. There are no custom CUDA dependencies. It runs anywhere torch runs.

Quickstart

import torch
import spectralquant as sq
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

engine = sq.SpectralQuant(compression="high")  # 6.55x preset

out = engine.generate(
    model, tok,
    "Explain water-filling bit allocation in two sentences.",
    max_new_tokens=120,
)

print(out["text"])
print(f"{out['stats']['ratio']:.2f}x compression, "
      f"{out['stats']['tokens_per_second']:.1f} tok/s")

The first call to engine.generate(...) runs a one-time calibration with a bundled 64-sentence corpus. Subsequent calls reuse it. You can also pass your own domain-specific corpus.

Compression presets

print(sq.describe_presets())
preset ratio risk notes
standard 5.95x safe Paper baseline. Production default.
high 6.55x safe Validated on Mistral 7B and Qwen 2.5 7B.
max 6.68x edge First paragraph clean. Light repetition possible.

You can also override individual dials when you need them:

engine = sq.SpectralQuant(
    compression="high",
    d_eff_variance=0.93,   # override one knob
)

The dials are avg_bits, noise_bits, value_noise_bits, and d_eff_variance. Anything unset falls back to the named preset.

Supported models

Tested and verified:

family example works
Mistral mistralai/Mistral-7B-Instruct-v0.3 yes
Qwen 2.5 Qwen/Qwen2.5-7B-Instruct yes
Llama 3.x NousResearch/Meta-Llama-3.1-8B-Instruct yes
SmolLM2 HuggingFaceTB/SmolLM2-135M yes
Gemma 2 google/gemma-2-9b expected

The cache-level integration works with any HuggingFace causal LM that uses DynamicCache (transformers >= 4.40). RoPE-based architectures with grouped query attention are the primary target.

For non-LLM transformers (ViT, ESMFold, VideoMAE, AlphaFold) see the modules in spectralquant.integrations. Vision transformers can actually see a quality improvement over FP16 because the eigenspectral filtering removes noise in the low-variance directions.

Hardware

GPU memory recommended for
H100 / H200 80–141 GB 7B, 13B, 70B inference, batch decode
A100 80 GB 80 GB 7B and 13B inference
A100 40 GB / A6000 40–48 GB 7B inference, short context
RTX 4090 / 4080 / 3090 24 GB 7B inference at FP16, short context
T4 / RTX 3060 12–16 GB smaller models, demo runs
CPU n/a works, but slow

The compression ratios above were measured on H200 with Mistral 7B and Qwen 2.5 7B at sequence length 512. Compression is sequence-length agnostic so ratios hold at longer contexts; speed gains scale with context length because the FP16 baseline gets slower while the SQ decode stays linear.

Generating with a pre-compressed prefix

Useful when you want to keep one compressed cache and reuse it across many completions of the same long prefix.

result = engine.compress_prefill(model, tok, long_prefix)
cache  = result["cache"]                 # a fresh DynamicCache, FP16 surface
print(f"prefix compression: {result['stats']['ratio']:.2f}x")

# Use cache as past_key_values for any number of follow-ups:
inputs = tok(question, return_tensors="pt").to(model.device)
ids = model.generate(
    **inputs,
    past_key_values=cache,
    max_new_tokens=200,
)

Custom calibration

The bundled corpus works for general English. For domain-specific workloads (code, biomedical text, legal filings), pass your own:

my_corpus = [...]   # 32–128 representative samples
engine = sq.SpectralQuant(compression="high")
engine.calibrate(model, tok, my_corpus)

Calibration takes a few seconds on H200. You can persist it once and reload in any future process:

engine.save_calibration("/path/to/calib")
fresh = sq.SpectralQuant(compression="high")
fresh.load_calibration("/path/to/calib", head_dim=128)

How it works (one paragraph)

For each attention head, calibration accumulates the key and value covariance matrices and eigendecomposes them. The eigenvectors define a per-head rotation that aligns coordinates with directions of decreasing variance. After rotation, a water-filling allocator distributes bits across coordinates so that high-variance dimensions get more bits and tail dimensions get fewer. Two bit budgets are used: a "semantic" budget (avg_bits) for the high-variance band and a "tail" budget (noise_bits, value_noise_bits) for the rest. Each coordinate is quantized with a Lloyd-Max scalar codebook fit to a Gaussian whose variance equals that coordinate's eigenvalue. Decode rotates back, dequantizes, and the rest of attention proceeds at full FP16. The math is in engine.py.

Demo notebook

A full end-to-end notebook is included at notebooks/spectralquant_demo.ipynb. It walks through:

  1. Install + GPU sanity check
  2. The three presets
  3. Loading Mistral 7B
  4. Side-by-side FP16 vs SpectralQuant on four diverse prompts, for each preset
  5. Power-user override
  6. Custom calibration
  7. Final summary table
  8. Save / load round-trip

To run it on a fresh GPU instance:

unzip -oq spectralquant.zip -d spectralquant
pip install -e ./spectralquant
jupyter notebook notebooks/spectralquant_demo.ipynb

API surface

sq.SpectralQuant(
    compression="standard" | "high" | "max",
    device=None,                       # "cuda" | "mps" | "cpu" | None (auto)
    head_dim=None,                     # inferred from model
    avg_bits=None, noise_bits=None,
    value_noise_bits=None,
    d_eff_variance=None,
)

engine.generate(model, tokenizer, prompt, *, max_new_tokens=128, ...)
engine.compress_prefill(model, tokenizer, prompt)
engine.calibrate(model, tokenizer, calibration_texts=None)
engine.compression_stats()
engine.save_calibration(path)
engine.load_calibration(path, head_dim=128)

The lower-level sq.SpectralQuantEngine is also exported for users who want direct access to per-head bit allocations or to use the legacy attention-level monkey-patch path.

Measuring quality

The package reports four metrics in engine.compression_stats() and in the stats field returned by .generate(...):

  • ratio — observed prefix-cache compression vs FP16 (bytes / bytes)
  • tokens_per_second — measured decode throughput
  • seconds — wall clock for the decode step
  • compressed_bytes, fp16_bytes — raw byte counts

For independent quality validation you can run perplexity on WikiText:

python examples/run_perplexity.py --model mistralai/Mistral-7B-Instruct-v0.3

Or sweep parameters to find the sweet spot for a model not in our test set:

python examples/sweep_compression.py --model <hf_repo>

Authors

Bug reports, feature requests, and pull requests are welcome on GitHub.

License

MIT.

Citation

@misc{spectralquant2026,
  title  = {SpectralQuant: Eigenspectral KV Cache Compression},
  author = {Vangara, Anirudh Bharadwaj and Gopinath, Ashwin},
  year   = {2026},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spectralquant-0.3.0.tar.gz (224.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spectralquant-0.3.0-py3-none-any.whl (74.3 kB view details)

Uploaded Python 3

File details

Details for the file spectralquant-0.3.0.tar.gz.

File metadata

  • Download URL: spectralquant-0.3.0.tar.gz
  • Upload date:
  • Size: 224.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for spectralquant-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f767d70d94f7afca54d08b53a7d740da837346b3bf86a588556cc1164172ee91
MD5 d7f4c5092df7e5dcbc3ec9012dd67dac
BLAKE2b-256 59e556a118e213fd2d1492d2c0382d23b3b468c10cb4bab99944cde9a6b37aff

See more details on using hashes here.

File details

Details for the file spectralquant-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: spectralquant-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 74.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for spectralquant-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec11532604d86d7dbc9c87f5ba9fb208c0eba456f9f7014327532625b47f5e36
MD5 1fb904df4dff8af0e12c5ff51ebe5984
BLAKE2b-256 6a1fd04e72471ccc918fd49103804cd2a512b9b2484f456332824ebe1a410cb3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page