spectralquant

Eigenspectral KV cache compression for transformer inference. Up to 6.55x compression with FP16-equivalent quality, drop-in for HuggingFace LLMs and vision transformers.

These details have not been verified by PyPI

Project links

Project description

SpectralQuant

Eigenspectral KV cache compression for transformer inference. Up to 6.55x compression of the KV cache with FP16-equivalent output quality.

pip install spectralquant

What it does

Modern LLM inference is bottlenecked by the size of the KV cache. The cache grows linearly with sequence length and consumes more memory than the model weights themselves at long context. SpectralQuant compresses that cache by exploiting the fact that, after a per-head spectral rotation, only a small number of dimensions actually carry information.

A short calibration step measures the eigenstructure of each attention head. Each head's keys and values are then split into a high-variance "semantic" band and a low-variance "tail" band. The semantic band gets a generous bit budget; the tail gets one or two bits. Total cache size shrinks by 6.55x with output quality indistinguishable from FP16.

The package ships pure-PyTorch kernels and HuggingFace integrations. There are no custom CUDA dependencies. It runs anywhere torch runs.

Quickstart

import torch
import spectralquant as sq
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

engine = sq.SpectralQuant(compression="high")  # 6.55x preset

out = engine.generate(
    model, tok,
    "Explain water-filling bit allocation in two sentences.",
    max_new_tokens=120,
)

print(out["text"])
print(f"{out['stats']['ratio']:.2f}x compression, "
      f"{out['stats']['tokens_per_second']:.1f} tok/s")

The first call to engine.generate(...) runs a one-time calibration with a bundled 64-sentence corpus. Subsequent calls reuse it. You can also pass your own domain-specific corpus.

Compression presets

print(sq.describe_presets())

preset	ratio	risk	notes
`standard`	5.95x	safe	Paper baseline. Production default.
`high`	6.55x	safe	Validated on Mistral 7B and Qwen 2.5 7B.
`max`	6.68x	edge	First paragraph clean. Light repetition possible.

You can also override individual dials when you need them:

engine = sq.SpectralQuant(
    compression="high",
    d_eff_variance=0.93,   # override one knob
)

The dials are avg_bits, noise_bits, value_noise_bits, and d_eff_variance. Anything unset falls back to the named preset.

Supported models

Tested and verified:

family	example	works
Mistral	`mistralai/Mistral-7B-Instruct-v0.3`	yes
Qwen 2.5	`Qwen/Qwen2.5-7B-Instruct`	yes
Llama 3.x	`NousResearch/Meta-Llama-3.1-8B-Instruct`	yes
SmolLM2	`HuggingFaceTB/SmolLM2-135M`	yes
Gemma 2	`google/gemma-2-9b`	expected

The cache-level integration works with any HuggingFace causal LM that uses DynamicCache (transformers >= 4.40). RoPE-based architectures with grouped query attention are the primary target.

For non-LLM transformers (ViT, ESMFold, VideoMAE, AlphaFold) see the modules in spectralquant.integrations. Vision transformers can actually see a quality improvement over FP16 because the eigenspectral filtering removes noise in the low-variance directions.

Hardware

GPU	memory	recommended for
H100 / H200	80–141 GB	7B, 13B, 70B inference, batch decode
A100 80 GB	80 GB	7B and 13B inference
A100 40 GB / A6000	40–48 GB	7B inference, short context
RTX 4090 / 4080 / 3090	24 GB	7B inference at FP16, short context
T4 / RTX 3060	12–16 GB	smaller models, demo runs
CPU	n/a	works, but slow

The compression ratios above were measured on H200 with Mistral 7B and Qwen 2.5 7B at sequence length 512. Compression is sequence-length agnostic so ratios hold at longer contexts; speed gains scale with context length because the FP16 baseline gets slower while the SQ decode stays linear.

Generating with a pre-compressed prefix

Useful when you want to keep one compressed cache and reuse it across many completions of the same long prefix.

result = engine.compress_prefill(model, tok, long_prefix)
cache  = result["cache"]                 # a fresh DynamicCache, FP16 surface
print(f"prefix compression: {result['stats']['ratio']:.2f}x")

# Use cache as past_key_values for any number of follow-ups:
inputs = tok(question, return_tensors="pt").to(model.device)
ids = model.generate(
    **inputs,
    past_key_values=cache,
    max_new_tokens=200,
)

Custom calibration

The bundled corpus works for general English. For domain-specific workloads (code, biomedical text, legal filings), pass your own:

my_corpus = [...]   # 32–128 representative samples
engine = sq.SpectralQuant(compression="high")
engine.calibrate(model, tok, my_corpus)

Calibration takes a few seconds on H200. You can persist it once and reload in any future process:

engine.save_calibration("/path/to/calib")
fresh = sq.SpectralQuant(compression="high")
fresh.load_calibration("/path/to/calib", head_dim=128)

How it works (one paragraph)

For each attention head, calibration accumulates the key and value covariance matrices and eigendecomposes them. The eigenvectors define a per-head rotation that aligns coordinates with directions of decreasing variance. After rotation, a water-filling allocator distributes bits across coordinates so that high-variance dimensions get more bits and tail dimensions get fewer. Two bit budgets are used: a "semantic" budget (avg_bits) for the high-variance band and a "tail" budget (noise_bits, value_noise_bits) for the rest. Each coordinate is quantized with a Lloyd-Max scalar codebook fit to a Gaussian whose variance equals that coordinate's eigenvalue. Decode rotates back, dequantizes, and the rest of attention proceeds at full FP16. The math is in engine.py.

Demo notebook

A full end-to-end notebook is included at notebooks/spectralquant_demo.ipynb. It walks through:

Install + GPU sanity check
The three presets
Loading Mistral 7B
Side-by-side FP16 vs SpectralQuant on four diverse prompts, for each preset
Power-user override
Custom calibration
Final summary table
Save / load round-trip

To run it on a fresh GPU instance:

unzip -oq spectralquant.zip -d spectralquant
pip install -e ./spectralquant
jupyter notebook notebooks/spectralquant_demo.ipynb

API surface

sq.SpectralQuant(
    compression="standard" | "high" | "max",
    device=None,                       # "cuda" | "mps" | "cpu" | None (auto)
    head_dim=None,                     # inferred from model
    avg_bits=None, noise_bits=None,
    value_noise_bits=None,
    d_eff_variance=None,
)

engine.generate(model, tokenizer, prompt, *, max_new_tokens=128, ...)
engine.compress_prefill(model, tokenizer, prompt)
engine.calibrate(model, tokenizer, calibration_texts=None)
engine.compression_stats()
engine.save_calibration(path)
engine.load_calibration(path, head_dim=128)

The lower-level sq.SpectralQuantEngine is also exported for users who want direct access to per-head bit allocations or to use the legacy attention-level monkey-patch path.

Measuring quality

The package reports four metrics in engine.compression_stats() and in the stats field returned by .generate(...):

ratio — observed prefix-cache compression vs FP16 (bytes / bytes)
tokens_per_second — measured decode throughput
seconds — wall clock for the decode step
compressed_bytes, fp16_bytes — raw byte counts

For independent quality validation you can run perplexity on WikiText:

python examples/run_perplexity.py --model mistralai/Mistral-7B-Instruct-v0.3

Or sweep parameters to find the sweet spot for a model not in our test set:

python examples/sweep_compression.py --model <hf_repo>

Authors

Anirudh Bharadwaj Vangara — anirudh@sentra.app
Ashwin Gopinath — ashwin@sentra.app

Bug reports, feature requests, and pull requests are welcome on GitHub.

License

MIT.

Citation

@misc{spectralquant2026,
  title  = {SpectralQuant: Eigenspectral KV Cache Compression},
  author = {Vangara, Anirudh Bharadwaj and Gopinath, Ashwin},
  year   = {2026},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

May 31, 2026

This version

0.3.0

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spectralquant-0.3.0.tar.gz (224.9 kB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spectralquant-0.3.0-py3-none-any.whl (74.3 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file spectralquant-0.3.0.tar.gz.

File metadata

Download URL: spectralquant-0.3.0.tar.gz
Upload date: May 31, 2026
Size: 224.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for spectralquant-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`f767d70d94f7afca54d08b53a7d740da837346b3bf86a588556cc1164172ee91`
MD5	`d7f4c5092df7e5dcbc3ec9012dd67dac`
BLAKE2b-256	`59e556a118e213fd2d1492d2c0382d23b3b468c10cb4bab99944cde9a6b37aff`

See more details on using hashes here.

File details

Details for the file spectralquant-0.3.0-py3-none-any.whl.

File metadata

Download URL: spectralquant-0.3.0-py3-none-any.whl
Upload date: May 31, 2026
Size: 74.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for spectralquant-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec11532604d86d7dbc9c87f5ba9fb208c0eba456f9f7014327532625b47f5e36`
MD5	`1fb904df4dff8af0e12c5ff51ebe5984`
BLAKE2b-256	`6a1fd04e72471ccc918fd49103804cd2a512b9b2484f456332824ebe1a410cb3`

See more details on using hashes here.

spectralquant 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SpectralQuant

What it does

Quickstart

Compression presets

Supported models

Hardware

Generating with a pre-compressed prefix

Custom calibration

How it works (one paragraph)

Demo notebook

API surface

Measuring quality

Authors

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes