Skip to main content

Eigenspectral KV cache compression for transformer inference. Up to 6.55x compression with FP16-equivalent quality, drop-in for HuggingFace LLMs and vision transformers.

Project description

SpectralQuant

SpectralQuant

Eigenspectral KV cache compression for transformer inference. Up to 6.55x compression of the KV cache with FP16-equivalent output quality.

pip install spectralquant

What it does

Modern LLM inference is bottlenecked by the size of the KV cache. The cache grows linearly with sequence length and consumes more memory than the model weights themselves at long context. SpectralQuant compresses that cache by exploiting the fact that, after a per-head spectral rotation, only a small number of dimensions actually carry information.

A short calibration step measures the eigenstructure of each attention head. Each head's keys and values are then split into a high-variance "semantic" band and a low-variance "tail" band. The semantic band gets a generous bit budget; the tail gets one or two bits. Total cache size shrinks by 6.55x with output quality indistinguishable from FP16.

The package ships pure-PyTorch kernels and HuggingFace integrations. There are no custom CUDA dependencies. It runs anywhere torch runs.

Quickstart

import torch
import spectralquant as sq
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

engine = sq.SpectralQuant(compression="high")  # 6.55x preset

out = engine.generate(
    model, tok,
    "Explain water-filling bit allocation in two sentences.",
    max_new_tokens=120,
)

print(out["text"])
print(f"{out['stats']['ratio']:.2f}x compression, "
      f"{out['stats']['tokens_per_second']:.1f} tok/s")

The first call to engine.generate(...) runs a one-time calibration with a bundled 64-sentence corpus. Subsequent calls reuse it. You can also pass your own domain-specific corpus.

Compression presets

print(sq.describe_presets())
preset ratio risk notes
standard 5.95x safe Paper baseline. Production default.
high 6.55x safe Validated on Mistral 7B and Qwen 2.5 7B.
max 6.68x edge First paragraph clean. Light repetition possible.

You can also override individual dials when you need them:

engine = sq.SpectralQuant(
    compression="high",
    d_eff_variance=0.93,   # override one knob
)

The dials are avg_bits, noise_bits, value_noise_bits, and d_eff_variance. Anything unset falls back to the named preset.

Supported models

Tested and verified:

family example works
Mistral mistralai/Mistral-7B-Instruct-v0.3 yes
Qwen 2.5 Qwen/Qwen2.5-7B-Instruct yes
Llama 3.x NousResearch/Meta-Llama-3.1-8B-Instruct yes
SmolLM2 HuggingFaceTB/SmolLM2-135M yes
Gemma 2 google/gemma-2-9b expected

The cache-level integration works with any HuggingFace causal LM that uses DynamicCache (transformers >= 4.40). RoPE-based architectures with grouped query attention are the primary target.

For non-LLM transformers (ViT, ESMFold, VideoMAE, AlphaFold) see the modules in spectralquant.integrations. Vision transformers can actually see a quality improvement over FP16 because the eigenspectral filtering removes noise in the low-variance directions.

Hardware

GPU memory recommended for
H100 / H200 80–141 GB 7B, 13B, 70B inference, batch decode
A100 80 GB 80 GB 7B and 13B inference
A100 40 GB / A6000 40–48 GB 7B inference, short context
RTX 4090 / 4080 / 3090 24 GB 7B inference at FP16, short context
T4 / RTX 3060 12–16 GB smaller models, demo runs
CPU n/a works, but slow

The compression ratios above were measured on H200 with Mistral 7B and Qwen 2.5 7B at sequence length 512. Compression is sequence-length agnostic so ratios hold at longer contexts; speed gains scale with context length because the FP16 baseline gets slower while the SQ decode stays linear.

Generating with a pre-compressed prefix

Useful when you want to keep one compressed cache and reuse it across many completions of the same long prefix.

result = engine.compress_prefill(model, tok, long_prefix)
cache  = result["cache"]                 # a fresh DynamicCache, FP16 surface
print(f"prefix compression: {result['stats']['ratio']:.2f}x")

# Use cache as past_key_values for any number of follow-ups:
inputs = tok(question, return_tensors="pt").to(model.device)
ids = model.generate(
    **inputs,
    past_key_values=cache,
    max_new_tokens=200,
)

Custom calibration

The bundled corpus works for general English. For domain-specific workloads (code, biomedical text, legal filings), pass your own:

my_corpus = [...]   # 32–128 representative samples
engine = sq.SpectralQuant(compression="high")
engine.calibrate(model, tok, my_corpus)

Calibration takes a few seconds on H200. You can persist it once and reload in any future process:

engine.save_calibration("/path/to/calib")
fresh = sq.SpectralQuant(compression="high")
fresh.load_calibration("/path/to/calib", head_dim=128)

How it works (one paragraph)

For each attention head, calibration accumulates the key and value covariance matrices and eigendecomposes them. The eigenvectors define a per-head rotation that aligns coordinates with directions of decreasing variance. After rotation, a water-filling allocator distributes bits across coordinates so that high-variance dimensions get more bits and tail dimensions get fewer. Two bit budgets are used: a "semantic" budget (avg_bits) for the high-variance band and a "tail" budget (noise_bits, value_noise_bits) for the rest. Each coordinate is quantized with a Lloyd-Max scalar codebook fit to a Gaussian whose variance equals that coordinate's eigenvalue. Decode rotates back, dequantizes, and the rest of attention proceeds at full FP16. The math is in engine.py.

Demo notebook

A full end-to-end notebook is included at notebooks/spectralquant_demo.ipynb. It walks through:

  1. Install + GPU sanity check
  2. The three presets
  3. Loading Mistral 7B
  4. Side-by-side FP16 vs SpectralQuant on four diverse prompts, for each preset
  5. Power-user override
  6. Custom calibration
  7. Final summary table
  8. Save / load round-trip

To run it on a fresh GPU instance:

unzip -oq spectralquant.zip -d spectralquant
pip install -e ./spectralquant
jupyter notebook notebooks/spectralquant_demo.ipynb

API surface

sq.SpectralQuant(
    compression="standard" | "high" | "max",
    device=None,                       # "cuda" | "mps" | "cpu" | None (auto)
    head_dim=None,                     # inferred from model
    avg_bits=None, noise_bits=None,
    value_noise_bits=None,
    d_eff_variance=None,
)

engine.generate(model, tokenizer, prompt, *, max_new_tokens=128, ...)
engine.compress_prefill(model, tokenizer, prompt)
engine.calibrate(model, tokenizer, calibration_texts=None)
engine.compression_stats()
engine.save_calibration(path)
engine.load_calibration(path, head_dim=128)

The lower-level sq.SpectralQuantEngine is also exported for users who want direct access to per-head bit allocations or to use the legacy attention-level monkey-patch path.

Measuring quality

The package reports four metrics in engine.compression_stats() and in the stats field returned by .generate(...):

  • ratio — observed prefix-cache compression vs FP16 (bytes / bytes)
  • tokens_per_second — measured decode throughput
  • seconds — wall clock for the decode step
  • compressed_bytes, fp16_bytes — raw byte counts

For independent quality validation you can run perplexity on WikiText:

python examples/run_perplexity.py --model mistralai/Mistral-7B-Instruct-v0.3

Or sweep parameters to find the sweet spot for a model not in our test set:

python examples/sweep_compression.py --model <hf_repo>

Authors

Bug reports, feature requests, and pull requests are welcome on GitHub.

License

MIT.

Citation

@misc{spectralquant2026,
  title  = {SpectralQuant: Eigenspectral KV Cache Compression},
  author = {Vangara, Anirudh Bharadwaj and Gopinath, Ashwin},
  year   = {2026},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spectralquant-0.3.1.tar.gz (225.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spectralquant-0.3.1-py3-none-any.whl (74.3 kB view details)

Uploaded Python 3

File details

Details for the file spectralquant-0.3.1.tar.gz.

File metadata

  • Download URL: spectralquant-0.3.1.tar.gz
  • Upload date:
  • Size: 225.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for spectralquant-0.3.1.tar.gz
Algorithm Hash digest
SHA256 b6bab3bf81457aed44aebf7dd0e175b0df52b89db533a517bb0068f76dfefaef
MD5 9777e093eb5973d31f5388afa0e193c3
BLAKE2b-256 6d28b556d1548d4df40575fd1926f652871adaa96175a6f80bae8273c44f2f32

See more details on using hashes here.

File details

Details for the file spectralquant-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: spectralquant-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 74.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for spectralquant-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ce834c1a5ad8a33f1c94eec9c103e46d193e439b7074421b4dccf90254f8249b
MD5 30532e23806c2333da99d3864e6b89bb
BLAKE2b-256 bf43362c2f4925b3f64f8b9d9beabd483b9164c15713ecaa2a4f472e3d086ac3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page