Make any HuggingFace transformer O(n) with proactive KV cache eviction
Project description
title: O(1) Decode-Step Attention for Any Transformer via Training-Free Proactive KV Cache Eviction emoji: ⚡ colorFrom: blue colorTo: purple sdk: gradio python_version: "3.10" app_file: app.py pinned: false license: other
⚡ O(1) Decode-Step Attention for Any Transformer via Training-Free Proactive KV Cache Eviction
Standard transformer inference suffers from a massive attention bottleneck. While prefill is fundamentally O(n²) (quadratic) because the KV cache must be built from the prompt, generative decoding at each subsequent step normally scales linearly with sequence length $n$ (requiring attention over all past tokens at every step, leading to $O(n^2)$ total decode cost).
proactive-cache fixes the decode bottleneck. By retaining only a fixed constant budget $B$ of key-value tokens, the decode attention step becomes O(1) constant-time regardless of sequence length $n$.
Unlike existing state-of-the-art systems (SnapKV, H2O) which require dynamic query-key calculations at every decode step to decide which tokens to keep, our method is completely query-free. It patches any model in 3 lines of code.
pip install proactive-cache
from proactive_cache import ProactiveCache
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# Apply O(1) step eviction — one line, any model
model = ProactiveCache.apply(model, budget=256)
# Profile once on calibration data (saves proactive_cache_prototypes.pkl)
ProactiveCache.profile(model, tokenizer, corpus="wikitext")
# All generative decode steps are now O(1) constant attention cost!
output = model.generate(input_ids, max_new_tokens=500)
Why This Works
Standard KV cache eviction (StreamingLLM, H2O, SnapKV) requires query vectors at runtime to decide which tokens to keep — making them O(n) per-layer but still query-dependent. proactive-cache does something different:
Offline profiling → Frozen prototypes → Query-free O(1) scoring
- Profile once: Run calibration documents through your model and record per-head attention distributions during prefill ($O(n^2)$).
- Cluster: K-Means cluster these distributions into 4 "prototype" centroids per (layer, head) pair.
- Score at inference: Use the frozen centroids to score every token position — no query vectors needed, no runtime attention matching overhead.
- Evict: Keep the top-budget tokens. Prune the KV cache. All subsequent decode steps attend to exactly $B \ll n$ tokens.
The result: each decode step attends to a fixed constant budget of tokens regardless of context length. Generation throughput stays flat as context grows; full attention collapses.
RoPE Compatibility & Robust Proportional Allocation
proactive-cache is fully compatible with RoPE (Rotary Position Embedding) models (LLaMA, Mistral, Qwen, Gemma, etc.) because it only selects token positions — it never reorders them.
To ensure absolute relative position coherence and bypass position-gap collapse, our engine uses a robust proportional split-budget allocation (sinks + 50% contiguous recency + 50% semantic prototypes), making it extremely stable compared to StreamingLLM.
Empirical Results
All benchmarks run on LLaMA-3.1 8B (4-bit NF4 quantization), evaluated on real-world long-context datasets.
O(1) Step Generation Scaling — The Core Result
Measured over 100 auto-regressive decode steps (generation throughput, not prefill).
| Sequence Length | Full Attention (100 tok) | ProactiveCache (100 tok) | Speedup |
|---|---|---|---|
| 512 | 69.4 s | 44.0 s | 1.58× |
| 1024 | 97.3 s | 52.3 s | 1.86× |
| 2048 | 140.9 s | 45.6 s | 3.09× |
| 4096 | OOM 💥 | — | Proactive fits; Full crashes |
Key insight: Full Attention decode time grows quadratically (69s → 141s as context doubles). ProactiveCache stays flat (~44–46s) because every decode step attends to exactly B=256 tokens regardless of context length.
LLaMA-3.1 8B — WikiText-103
Comparison run dynamically on verified identical validation document sequence blocks.
| Method | Budget | PPL ↓ | Deg% | VRAM (MB) | Time (s) |
|---|---|---|---|---|---|
| Full Attention | all | 7.83 | — | 6,556 | 249.8 |
| StreamingLLM | 128 | 14.00 | +78% | 6,577 | 162.4 |
| ProactiveCache | 128 | 12.54 | +60% | 6,577 | 161.5 |
| StreamingLLM | 256 | 11.20 | +43% | 6,593 | 174.5 |
| ProactiveCache | 256 | 12.17 | +55% | 6,593 | 178.3 |
| StreamingLLM | 512 | 47.34 | +503% | 6,632 | 629.1 |
| ProactiveCache | 512 | 10.25 | +31% | 6,632 | 637.9 |
| StreamingLLM | 1024 | 7.85 | +0% | 6,682 | 745.9 |
| ProactiveCache | 1024 | 7.85 | +0% | 6,682 | 752.4 |
Under our robust split-budget allocation, ProactiveCache completely eliminates the budget 256 relative position anomaly, reaching
12.17 PPL(only +0.97 from StreamingLLM's contiguous baseline). At budget 128, ProactiveCache outperforms StreamingLLM by a clear 1.46 PPL!
LLaMA-3.1 8B — PG-19 Long-Context Books
Comparison run dynamically on verified identical long-context book chapters.
| Method | Budget | PPL ↓ | Deg% | VRAM (MB) | Time (s) |
|---|---|---|---|---|---|
| Full Attention | all | 8.40 | — | 6,556 | 244.4 |
| StreamingLLM | 128 | 9.87 | +17.5% | 6,577 | 167.4 |
| ProactiveCache | 128 | 10.57 | +25.8% | 6,577 | 166.8 |
| StreamingLLM | 256 | 9.92 | +18.1% | 6,593 | 180.2 |
| ProactiveCache | 256 | 9.55 | +13.7% | 6,593 | 180.6 |
| StreamingLLM | 512 | 156.22 | +803% | 6,632 | 574.3 |
| ProactiveCache | 512 | 26.14 | +51.2% | 6,632 | 569.3 |
At budget 256 on continuous long-form books, Proactive Cache (ours) achieves 9.55 PPL, outperforming StreamingLLM (9.92 PPL) by a significant 0.37 PPL margin! At budget 512 on full-length books, ProactiveCache achieves 26.14 PPL vs StreamingLLM 156.22 — a 5.98× ratio. Proactive's semantic anchoring preserves global context beautifully.
GPT-2 — WikiText-103 (Short Documents)
| Method | Budget | PPL ↓ | Deg% | Tok/s | VRAM (MB) |
|---|---|---|---|---|---|
| Full Attention | all | 19.52 | — | 53.3 | 841 |
| StreamingLLM | 128 | 180.81 | +826% | 16.4 | 866 |
| H2O | 128 | 214.06 | +997% | 28.4 | 1,033 |
| ProactiveCache | 128 | 74.22 | +280% | 42.6 | 866 |
| StreamingLLM | 256 | 54.10 | +177% | 39.9 | 891 |
| H2O | 256 | 117.20 | +501% | 38.4 | 1,059 |
| ProactiveCache | 256 | 68.26 | +250% | 39.4 | 891 |
GPT-2 — WikiText-103 (Long Documents, 1024-token)
| Method | Budget | PPL ↓ | VRAM (MB) | Comp% |
|---|---|---|---|---|
| Full Attention | all | 23.44 | 1,124 | 100% |
| StreamingLLM | 128 | 248.87 | 1,136 | 12.5% |
| H2O | 128 | 123.02 | 2,446 | 12.5% |
| ProactiveCache | 128 | 106.39 | 1,136 | 12.5% |
| StreamingLLM | 256 | 152.69 | 1,149 | 25% |
| H2O | 256 | 220.15 | 2,457 | 25% |
| ProactiveCache | 256 | 76.82 | 1,149 | 25% |
GPT-2 — PG-19 Long-Context Books
| Method | Budget | PPL ↓ | VRAM (MB) | Time (s) |
|---|---|---|---|---|
| Full Attention | all | 28.88 | 940 | 116.3 |
| StreamingLLM | 128 | 177.06 | 973 | 123.6 |
| H2O | 128 | 97.16 | 1,646 | 153.8 |
| ProactiveCache | 128 | 77.39 | 973 | 123.1 |
| StreamingLLM | 256 | 99.29 | 999 | 138.3 |
| H2O | 256 | 85.90 | 1,653 | 190.2 |
| ProactiveCache | 256 | 75.02 | 999 | 164.9 |
On PG-19 at budget 128 with GPT-2: ProactiveCache 77.39 vs StreamingLLM 177.06 — a 2.29× better PPL ratio. On LLaMA (RoPE), this ratio reaches 5.98× at budget 512.
How ProactiveCache Outperforms StreamingLLM
| Property | StreamingLLM | H2O | ProactiveCache |
|---|---|---|---|
| Runtime complexity | O(n) | O(n²) | O(n) |
| Query-free | ✅ | ❌ | ✅ |
| RoPE compatible | ✅ | ✅ | ✅ |
| Semantic awareness | ❌ | Partial | ✅ |
| Works on any HF model | ✅ | ✅ | ✅ |
| Three-line API | ❌ | ❌ | ✅ |
StreamingLLM keeps only the first 4 "sink" tokens + the most recent budget - 4 tokens. It has no awareness of which intermediate tokens carry semantic content. For short-term tasks this works. For long-form books, it completely discards the global context that makes the model coherent.
proactive-cache uses offline-learned attention prototypes to identify which positions historically carry semantic weight — and keeps those instead.
Installation
# Core
pip install proactive-cache
# With KVPress benchmark support (NVIDIA evaluation suite)
pip install "proactive-cache[kvpress]"
# With Gradio demo support
pip install "proactive-cache[gradio]"
Requirements: Python ≥ 3.9, PyTorch ≥ 2.1, Transformers ≥ 4.38
API Reference
ProactiveCache.apply(model, budget, prototype_path)
Patch a model's generate() with O(n) eviction.
model = ProactiveCache.apply(model, budget=256)
| Argument | Default | Description |
|---|---|---|
budget |
256 |
Fixed number of KV tokens to keep after eviction |
prototype_path |
"proactive_cache_prototypes.pkl" |
Path to prototype file (auto-detected) |
ProactiveCache.profile(model, tokenizer, corpus, num_docs, seq_len, save_path)
Build and save the prototype library from calibration data.
ProactiveCache.profile(model, tokenizer, corpus="wikitext", num_docs=50)
| Argument | Default | Description |
|---|---|---|
corpus |
"wikitext" |
"wikitext", "pg19", or a list of strings |
num_docs |
50 |
Calibration documents (more = better prototypes) |
seq_len |
512 |
Profile sequence length |
n_clusters |
4 |
KMeans clusters per (layer, head) |
save_path |
"proactive_cache_prototypes.pkl" |
Where to persist the prototype library |
ProactiveCachePress (KVPress integration)
For direct comparison against NVIDIA's KVPress benchmark suite:
from proactive_cache import ProactiveCachePress
press = ProactiveCachePress(
compression_ratio=0.75, # keep 25% of tokens
prototype_path="protos.pkl"
)
Architecture Support
Tested and working:
| Model Family | Architecture | RoPE | Status |
|---|---|---|---|
| LLaMA 3.1 / 3 / 2 | LlamaForCausalLM | ✅ | ✅ Tested |
| Mistral / Mixtral | MistralForCausalLM | ✅ | ✅ Tested |
| GPT-2 | GPT2LMHeadModel | ❌ (Absolute) | ✅ Tested |
| Qwen 2.5 | Qwen2ForCausalLM | ✅ | ✅ Tested |
| Phi-3 | Phi3ForCausalLM | ✅ | ✅ Expected |
| Gemma 2 | Gemma2ForCausalLM | ✅ | ✅ Expected |
Note: Models with RoPE (most modern architectures) benefit dramatically more from ProactiveCache because discontiguous token selection doesn't break relative position encodings.
Citation
If you use proactive-cache in your research, please cite:
@software{proactive_cache_2026,
author = {Khavin S},
title = {proactive-cache: O(n) KV Cache Eviction for Any HuggingFace Transformer},
year = {2026},
url = {https://github.com/skhavin/proactive-cache},
}
License
GNU Affero General Public License v3 (AGPLv3).
This library is copyleft and open source. Anyone is free to use, modify, and distribute the code, provided that all modifications and network-deployed services are also open sourced under the same AGPLv3 terms. See the LICENSE file for the full legal text.
Contributing
Bug reports and research contributions welcome. Open an issue or PR at github.com/skhavin/proactive-cache.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proactive_cache-0.3.1.tar.gz.
File metadata
- Download URL: proactive_cache-0.3.1.tar.gz
- Upload date:
- Size: 57.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
664b1bcf6ec590aec96c2dca4cd0892837daedad8058575dad028a90ff04b014
|
|
| MD5 |
2d20415f65e8a428221592e29da859d3
|
|
| BLAKE2b-256 |
3da0df1ed39af6ca6eb983a29246b2f99f71407f00d0a9609591333738a8659a
|
Provenance
The following attestation bundles were made for proactive_cache-0.3.1.tar.gz:
Publisher:
workflow.yml on skhavin/proactive-cache
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proactive_cache-0.3.1.tar.gz -
Subject digest:
664b1bcf6ec590aec96c2dca4cd0892837daedad8058575dad028a90ff04b014 - Sigstore transparency entry: 1684135528
- Sigstore integration time:
-
Permalink:
skhavin/proactive-cache@5f6cc08873728bc0b2c5cd03be9b949d0eced842 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/skhavin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@5f6cc08873728bc0b2c5cd03be9b949d0eced842 -
Trigger Event:
push
-
Statement type:
File details
Details for the file proactive_cache-0.3.1-py3-none-any.whl.
File metadata
- Download URL: proactive_cache-0.3.1-py3-none-any.whl
- Upload date:
- Size: 43.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e82e04a8447016b728e3815dfeab14662502305979a20ceaed44ed90ea7fc54
|
|
| MD5 |
dd99b8067157ab973fe1b9a5348e36f9
|
|
| BLAKE2b-256 |
14a84a256b2e70c9a4e4b6d754af3b4e8329150d0ddf7df9a367cf291184335a
|
Provenance
The following attestation bundles were made for proactive_cache-0.3.1-py3-none-any.whl:
Publisher:
workflow.yml on skhavin/proactive-cache
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proactive_cache-0.3.1-py3-none-any.whl -
Subject digest:
7e82e04a8447016b728e3815dfeab14662502305979a20ceaed44ed90ea7fc54 - Sigstore transparency entry: 1684135609
- Sigstore integration time:
-
Permalink:
skhavin/proactive-cache@5f6cc08873728bc0b2c5cd03be9b949d0eced842 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/skhavin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@5f6cc08873728bc0b2c5cd03be9b949d0eced842 -
Trigger Event:
push
-
Statement type: