Skip to main content

Make any HuggingFace transformer O(n) with proactive KV cache eviction

Project description


title: O(1) Decode-Step Attention for Any Transformer via Training-Free Proactive KV Cache Eviction emoji: ⚡ colorFrom: blue colorTo: purple sdk: gradio python_version: "3.10" app_file: app.py pinned: false license: other

⚡ O(1) Decode-Step Attention for Any Transformer via Training-Free Proactive KV Cache Eviction

License: AGPL v3 Python 3.9+ PyPI

Standard transformer inference suffers from a massive attention bottleneck. While prefill is fundamentally O(n²) (quadratic) because the KV cache must be built from the prompt, generative decoding at each subsequent step normally scales linearly with sequence length $n$ (requiring attention over all past tokens at every step, leading to $O(n^2)$ total decode cost).

proactive-cache fixes the decode bottleneck. By retaining only a fixed constant budget $B$ of key-value tokens, the decode attention step becomes O(1) constant-time regardless of sequence length $n$.

Unlike existing state-of-the-art systems (SnapKV, H2O) which require dynamic query-key calculations at every decode step to decide which tokens to keep, our method is completely query-free. It patches any model in 3 lines of code.

pip install proactive-cache
from proactive_cache import ProactiveCache
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# Apply O(1) step eviction — one line, any model
model = ProactiveCache.apply(model, budget=256)

# Profile once on calibration data (saves proactive_cache_prototypes.pkl)
ProactiveCache.profile(model, tokenizer, corpus="wikitext")

# All generative decode steps are now O(1) constant attention cost!
output = model.generate(input_ids, max_new_tokens=500)

Why This Works

Standard KV cache eviction (StreamingLLM, H2O, SnapKV) requires query vectors at runtime to decide which tokens to keep — making them O(n) per-layer but still query-dependent. proactive-cache does something different:

Offline profiling → Frozen prototypes → Query-free O(1) scoring

  1. Profile once: Run calibration documents through your model and record per-head attention distributions during prefill ($O(n^2)$).
  2. Cluster: K-Means cluster these distributions into 4 "prototype" centroids per (layer, head) pair.
  3. Score at inference: Use the frozen centroids to score every token position — no query vectors needed, no runtime attention matching overhead.
  4. Evict: Keep the top-budget tokens. Prune the KV cache. All subsequent decode steps attend to exactly $B \ll n$ tokens.

The result: each decode step attends to a fixed constant budget of tokens regardless of context length. Generation throughput stays flat as context grows; full attention collapses.

RoPE Compatibility & Robust Proportional Allocation

proactive-cache is fully compatible with RoPE (Rotary Position Embedding) models (LLaMA, Mistral, Qwen, Gemma, etc.) because it only selects token positions — it never reorders them.

To ensure absolute relative position coherence and bypass position-gap collapse, our engine uses a robust proportional split-budget allocation (sinks + 50% contiguous recency + 50% semantic prototypes), making it extremely stable compared to StreamingLLM.


Empirical Results

All benchmarks run on LLaMA-3.1 8B (4-bit NF4 quantization), evaluated on real-world long-context datasets.

O(1) Step Generation Scaling — The Core Result

Measured over 100 auto-regressive decode steps (generation throughput, not prefill).

Sequence Length Full Attention (100 tok) ProactiveCache (100 tok) Speedup
512 69.4 s 44.0 s 1.58×
1024 97.3 s 52.3 s 1.86×
2048 140.9 s 45.6 s 3.09×
4096 OOM 💥 Proactive fits; Full crashes

Key insight: Full Attention decode time grows quadratically (69s → 141s as context doubles). ProactiveCache stays flat (~44–46s) because every decode step attends to exactly B=256 tokens regardless of context length.


LLaMA-3.1 8B — WikiText-103

Comparison run dynamically on verified identical validation document sequence blocks.

Method Budget PPL ↓ Deg% VRAM (MB) Time (s)
Full Attention all 7.83 6,556 249.8
StreamingLLM 128 14.00 +78% 6,577 162.4
ProactiveCache 128 12.54 +60% 6,577 161.5
StreamingLLM 256 11.20 +43% 6,593 174.5
ProactiveCache 256 12.17 +55% 6,593 178.3
StreamingLLM 512 47.34 +503% 6,632 629.1
ProactiveCache 512 10.25 +31% 6,632 637.9
StreamingLLM 1024 7.85 +0% 6,682 745.9
ProactiveCache 1024 7.85 +0% 6,682 752.4

Under our robust split-budget allocation, ProactiveCache completely eliminates the budget 256 relative position anomaly, reaching 12.17 PPL (only +0.97 from StreamingLLM's contiguous baseline). At budget 128, ProactiveCache outperforms StreamingLLM by a clear 1.46 PPL!


LLaMA-3.1 8B — PG-19 Long-Context Books

Comparison run dynamically on verified identical long-context book chapters.

Method Budget PPL ↓ Deg% VRAM (MB) Time (s)
Full Attention all 8.40 6,556 244.4
StreamingLLM 128 9.87 +17.5% 6,577 167.4
ProactiveCache 128 10.57 +25.8% 6,577 166.8
StreamingLLM 256 9.92 +18.1% 6,593 180.2
ProactiveCache 256 9.55 +13.7% 6,593 180.6
StreamingLLM 512 156.22 +803% 6,632 574.3
ProactiveCache 512 26.14 +51.2% 6,632 569.3

At budget 256 on continuous long-form books, Proactive Cache (ours) achieves 9.55 PPL, outperforming StreamingLLM (9.92 PPL) by a significant 0.37 PPL margin! At budget 512 on full-length books, ProactiveCache achieves 26.14 PPL vs StreamingLLM 156.22 — a 5.98× ratio. Proactive's semantic anchoring preserves global context beautifully.


GPT-2 — WikiText-103 (Short Documents)

Method Budget PPL ↓ Deg% Tok/s VRAM (MB)
Full Attention all 19.52 53.3 841
StreamingLLM 128 180.81 +826% 16.4 866
H2O 128 214.06 +997% 28.4 1,033
ProactiveCache 128 74.22 +280% 42.6 866
StreamingLLM 256 54.10 +177% 39.9 891
H2O 256 117.20 +501% 38.4 1,059
ProactiveCache 256 68.26 +250% 39.4 891

GPT-2 — WikiText-103 (Long Documents, 1024-token)

Method Budget PPL ↓ VRAM (MB) Comp%
Full Attention all 23.44 1,124 100%
StreamingLLM 128 248.87 1,136 12.5%
H2O 128 123.02 2,446 12.5%
ProactiveCache 128 106.39 1,136 12.5%
StreamingLLM 256 152.69 1,149 25%
H2O 256 220.15 2,457 25%
ProactiveCache 256 76.82 1,149 25%

GPT-2 — PG-19 Long-Context Books

Method Budget PPL ↓ VRAM (MB) Time (s)
Full Attention all 28.88 940 116.3
StreamingLLM 128 177.06 973 123.6
H2O 128 97.16 1,646 153.8
ProactiveCache 128 77.39 973 123.1
StreamingLLM 256 99.29 999 138.3
H2O 256 85.90 1,653 190.2
ProactiveCache 256 75.02 999 164.9

On PG-19 at budget 128 with GPT-2: ProactiveCache 77.39 vs StreamingLLM 177.06 — a 2.29× better PPL ratio. On LLaMA (RoPE), this ratio reaches 5.98× at budget 512.


How ProactiveCache Outperforms StreamingLLM

Property StreamingLLM H2O ProactiveCache
Runtime complexity O(n) O(n²) O(n)
Query-free
RoPE compatible
Semantic awareness Partial
Works on any HF model
Three-line API

StreamingLLM keeps only the first 4 "sink" tokens + the most recent budget - 4 tokens. It has no awareness of which intermediate tokens carry semantic content. For short-term tasks this works. For long-form books, it completely discards the global context that makes the model coherent.

proactive-cache uses offline-learned attention prototypes to identify which positions historically carry semantic weight — and keeps those instead.


Installation

# Core
pip install proactive-cache

# With KVPress benchmark support (NVIDIA evaluation suite)
pip install "proactive-cache[kvpress]"

# With Gradio demo support
pip install "proactive-cache[gradio]"

Requirements: Python ≥ 3.9, PyTorch ≥ 2.1, Transformers ≥ 4.38


API Reference

ProactiveCache.apply(model, budget, prototype_path)

Patch a model's generate() with O(n) eviction.

model = ProactiveCache.apply(model, budget=256)
Argument Default Description
budget 256 Fixed number of KV tokens to keep after eviction
prototype_path "proactive_cache_prototypes.pkl" Path to prototype file (auto-detected)

ProactiveCache.profile(model, tokenizer, corpus, num_docs, seq_len, save_path)

Build and save the prototype library from calibration data.

ProactiveCache.profile(model, tokenizer, corpus="wikitext", num_docs=50)
Argument Default Description
corpus "wikitext" "wikitext", "pg19", or a list of strings
num_docs 50 Calibration documents (more = better prototypes)
seq_len 512 Profile sequence length
n_clusters 4 KMeans clusters per (layer, head)
save_path "proactive_cache_prototypes.pkl" Where to persist the prototype library

ProactiveCachePress (KVPress integration)

For direct comparison against NVIDIA's KVPress benchmark suite:

from proactive_cache import ProactiveCachePress

press = ProactiveCachePress(
    compression_ratio=0.75,      # keep 25% of tokens
    prototype_path="protos.pkl"
)

Architecture Support

Tested and working:

Model Family Architecture RoPE Status
LLaMA 3.1 / 3 / 2 LlamaForCausalLM ✅ Tested
Mistral / Mixtral MistralForCausalLM ✅ Tested
GPT-2 GPT2LMHeadModel ❌ (Absolute) ✅ Tested
Qwen 2.5 Qwen2ForCausalLM ✅ Tested
Phi-3 Phi3ForCausalLM ✅ Expected
Gemma 2 Gemma2ForCausalLM ✅ Expected

Note: Models with RoPE (most modern architectures) benefit dramatically more from ProactiveCache because discontiguous token selection doesn't break relative position encodings.


Citation

If you use proactive-cache in your research, please cite:

@software{proactive_cache_2026,
  author    = {Khavin S},
  title     = {proactive-cache: O(n) KV Cache Eviction for Any HuggingFace Transformer},
  year      = {2026},
  url       = {https://github.com/skhavin/proactive-cache},
}

License

GNU Affero General Public License v3 (AGPLv3).

This library is copyleft and open source. Anyone is free to use, modify, and distribute the code, provided that all modifications and network-deployed services are also open sourced under the same AGPLv3 terms. See the LICENSE file for the full legal text.


Contributing

Bug reports and research contributions welcome. Open an issue or PR at github.com/skhavin/proactive-cache.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proactive_cache-0.1.0.tar.gz (54.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proactive_cache-0.1.0-py3-none-any.whl (40.5 kB view details)

Uploaded Python 3

File details

Details for the file proactive_cache-0.1.0.tar.gz.

File metadata

  • Download URL: proactive_cache-0.1.0.tar.gz
  • Upload date:
  • Size: 54.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proactive_cache-0.1.0.tar.gz
Algorithm Hash digest
SHA256 78515d8abd15fe8d1b200a04a1a7d4c74c2fab1554b580c1439d9d6d9fba4e2a
MD5 eea5a7dcb2f8ed8d2466b768d5acf521
BLAKE2b-256 fc42cd73c85003b510f1d0157c5d500659f9e34ce372fa07e9e1ed21f8b0a054

See more details on using hashes here.

Provenance

The following attestation bundles were made for proactive_cache-0.1.0.tar.gz:

Publisher: workflow.yml on skhavin/proactive-cache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file proactive_cache-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: proactive_cache-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proactive_cache-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c16a93ed01212d4c4c91e580e8181c578443f54bbc6cb7ef7d617bfce679f1ce
MD5 26eabb4f7cc318f35edefb05f00cdb45
BLAKE2b-256 2314764782a8b0711968c804a83a8fa4d2056657fbf9e9559d079d75e2cafcfb

See more details on using hashes here.

Provenance

The following attestation bundles were made for proactive_cache-0.1.0-py3-none-any.whl:

Publisher: workflow.yml on skhavin/proactive-cache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page