Make any HuggingFace transformer O(n) with proactive KV cache eviction

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

S-Khavin

These details have not been verified by PyPI

Project links

Paper

Project description

title: O(1) Decode-Step Attention for Any Transformer via Training-Free Proactive KV Cache Eviction emoji: ⚡ colorFrom: blue colorTo: purple sdk: gradio python_version: "3.10" app_file: app.py pinned: false license: other

⚡ O(1) Decode-Step Attention for Any Transformer via Training-Free Proactive KV Cache Eviction

Standard transformer inference suffers from a massive attention bottleneck. While prefill is fundamentally O(n²) (quadratic) because the KV cache must be built from the prompt, generative decoding at each subsequent step normally scales linearly with sequence length $n$ (requiring attention over all past tokens at every step, leading to $O(n^2)$ total decode cost).

proactive-cache fixes the decode bottleneck. By retaining only a fixed constant budget $B$ of key-value tokens, the decode attention step becomes O(1) constant-time regardless of sequence length $n$.

Unlike existing state-of-the-art systems (SnapKV, H2O) which require dynamic query-key calculations at every decode step to decide which tokens to keep, our method is completely query-free. It patches any model in 3 lines of code.

pip install proactive-cache

from proactive_cache import ProactiveCache
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# Apply O(1) step eviction — one line, any model
model = ProactiveCache.apply(model, budget=256)

# Profile once on calibration data (saves proactive_cache_prototypes.pkl)
ProactiveCache.profile(model, tokenizer, corpus="wikitext")

# All generative decode steps are now O(1) constant attention cost!
output = model.generate(input_ids, max_new_tokens=500)

Why This Works

Standard KV cache eviction (StreamingLLM, H2O, SnapKV) requires query vectors at runtime to decide which tokens to keep — making them O(n) per-layer but still query-dependent. proactive-cache does something different:

Offline profiling → Frozen prototypes → Query-free O(1) scoring

Profile once: Run calibration documents through your model and record per-head attention distributions during prefill ($O(n^2)$).
Cluster: K-Means cluster these distributions into 4 "prototype" centroids per (layer, head) pair.
Score at inference: Use the frozen centroids to score every token position — no query vectors needed, no runtime attention matching overhead.
Evict: Keep the top-budget tokens. Prune the KV cache. All subsequent decode steps attend to exactly $B \ll n$ tokens.

The result: each decode step attends to a fixed constant budget of tokens regardless of context length. Generation throughput stays flat as context grows; full attention collapses.

RoPE Compatibility & Robust Proportional Allocation

proactive-cache is fully compatible with RoPE (Rotary Position Embedding) models (LLaMA, Mistral, Qwen, Gemma, etc.) because it only selects token positions — it never reorders them.

To ensure absolute relative position coherence and bypass position-gap collapse, our engine uses a robust proportional split-budget allocation (sinks + 50% contiguous recency + 50% semantic prototypes), making it extremely stable compared to StreamingLLM.

Empirical Results

All benchmarks run on LLaMA-3.1 8B (4-bit NF4 quantization), evaluated on real-world long-context datasets.

O(1) Step Generation Scaling — The Core Result

Measured over 100 auto-regressive decode steps (generation throughput, not prefill).

Sequence Length	Full Attention (100 tok)	ProactiveCache (100 tok)	Speedup
512	69.4 s	44.0 s	1.58×
1024	97.3 s	52.3 s	1.86×
2048	140.9 s	45.6 s	3.09×
4096	OOM 💥	—	Proactive fits; Full crashes

Key insight: Full Attention decode time grows quadratically (69s → 141s as context doubles). ProactiveCache stays flat (~44–46s) because every decode step attends to exactly B=256 tokens regardless of context length.

LLaMA-3.1 8B — WikiText-103

Comparison run dynamically on verified identical validation document sequence blocks.

Method	Budget	PPL ↓	Deg%	VRAM (MB)	Time (s)
Full Attention	all	7.83	—	6,556	249.8

StreamingLLM	128	14.00	+78%	6,577	162.4
ProactiveCache	128	12.54	+60%	6,577	161.5

StreamingLLM	256	11.20	+43%	6,593	174.5
ProactiveCache	256	12.17	+55%	6,593	178.3

StreamingLLM	512	47.34	+503%	6,632	629.1
ProactiveCache	512	10.25	+31%	6,632	637.9

StreamingLLM	1024	7.85	+0%	6,682	745.9
ProactiveCache	1024	7.85	+0%	6,682	752.4

Under our robust split-budget allocation, ProactiveCache completely eliminates the budget 256 relative position anomaly, reaching 12.17 PPL (only +0.97 from StreamingLLM's contiguous baseline). At budget 128, ProactiveCache outperforms StreamingLLM by a clear 1.46 PPL!

LLaMA-3.1 8B — PG-19 Long-Context Books

Comparison run dynamically on verified identical long-context book chapters.

Method	Budget	PPL ↓	Deg%	VRAM (MB)	Time (s)
Full Attention	all	8.40	—	6,556	244.4

StreamingLLM	128	9.87	+17.5%	6,577	167.4
ProactiveCache	128	10.57	+25.8%	6,577	166.8

StreamingLLM	256	9.92	+18.1%	6,593	180.2
ProactiveCache	256	9.55	+13.7%	6,593	180.6

StreamingLLM	512	156.22	+803%	6,632	574.3
ProactiveCache	512	26.14	+51.2%	6,632	569.3

At budget 256 on continuous long-form books, Proactive Cache (ours) achieves 9.55 PPL, outperforming StreamingLLM (9.92 PPL) by a significant 0.37 PPL margin! At budget 512 on full-length books, ProactiveCache achieves 26.14 PPL vs StreamingLLM 156.22 — a 5.98× ratio. Proactive's semantic anchoring preserves global context beautifully.

GPT-2 — WikiText-103 (Short Documents)

Method	Budget	PPL ↓	Deg%	Tok/s	VRAM (MB)
Full Attention	all	19.52	—	53.3	841
StreamingLLM	128	180.81	+826%	16.4	866
H2O	128	214.06	+997%	28.4	1,033
ProactiveCache	128	74.22	+280%	42.6	866
StreamingLLM	256	54.10	+177%	39.9	891
H2O	256	117.20	+501%	38.4	1,059
ProactiveCache	256	68.26	+250%	39.4	891

GPT-2 — WikiText-103 (Long Documents, 1024-token)

Method	Budget	PPL ↓	VRAM (MB)	Comp%
Full Attention	all	23.44	1,124	100%
StreamingLLM	128	248.87	1,136	12.5%
H2O	128	123.02	2,446	12.5%
ProactiveCache	128	106.39	1,136	12.5%
StreamingLLM	256	152.69	1,149	25%
H2O	256	220.15	2,457	25%
ProactiveCache	256	76.82	1,149	25%

GPT-2 — PG-19 Long-Context Books

Method	Budget	PPL ↓	VRAM (MB)	Time (s)
Full Attention	all	28.88	940	116.3
StreamingLLM	128	177.06	973	123.6
H2O	128	97.16	1,646	153.8
ProactiveCache	128	77.39	973	123.1
StreamingLLM	256	99.29	999	138.3
H2O	256	85.90	1,653	190.2
ProactiveCache	256	75.02	999	164.9

On PG-19 at budget 128 with GPT-2: ProactiveCache 77.39 vs StreamingLLM 177.06 — a 2.29× better PPL ratio. On LLaMA (RoPE), this ratio reaches 5.98× at budget 512.

How ProactiveCache Outperforms StreamingLLM

Property	StreamingLLM	H2O	ProactiveCache
Runtime complexity	O(n)	O(n²)	O(n)
Query-free	✅	❌	✅
RoPE compatible	✅	✅	✅
Semantic awareness	❌	Partial	✅
Works on any HF model	✅	✅	✅
Three-line API	❌	❌	✅

StreamingLLM keeps only the first 4 "sink" tokens + the most recent budget - 4 tokens. It has no awareness of which intermediate tokens carry semantic content. For short-term tasks this works. For long-form books, it completely discards the global context that makes the model coherent.

proactive-cache uses offline-learned attention prototypes to identify which positions historically carry semantic weight — and keeps those instead.

Installation

# Core
pip install proactive-cache

# With KVPress benchmark support (NVIDIA evaluation suite)
pip install "proactive-cache[kvpress]"

# With Gradio demo support
pip install "proactive-cache[gradio]"

Requirements: Python ≥ 3.9, PyTorch ≥ 2.1, Transformers ≥ 4.38

API Reference

`ProactiveCache.apply(model, budget, prototype_path)`

Patch a model's generate() with O(n) eviction.

model = ProactiveCache.apply(model, budget=256)

Argument	Default	Description
`budget`	`256`	Fixed number of KV tokens to keep after eviction
`prototype_path`	`"proactive_cache_prototypes.pkl"`	Path to prototype file (auto-detected)

`ProactiveCache.profile(model, tokenizer, corpus, num_docs, seq_len, save_path)`

Build and save the prototype library from calibration data.

ProactiveCache.profile(model, tokenizer, corpus="wikitext", num_docs=50)

Argument	Default	Description
`corpus`	`"wikitext"`	`"wikitext"`, `"pg19"`, or a list of strings
`num_docs`	`50`	Calibration documents (more = better prototypes)
`seq_len`	`512`	Profile sequence length
`n_clusters`	`4`	KMeans clusters per (layer, head)
`save_path`	`"proactive_cache_prototypes.pkl"`	Where to persist the prototype library

`ProactiveCachePress` (KVPress integration)

For direct comparison against NVIDIA's KVPress benchmark suite:

from proactive_cache import ProactiveCachePress

press = ProactiveCachePress(
    compression_ratio=0.75,      # keep 25% of tokens
    prototype_path="protos.pkl"
)

Architecture Support

Tested and working:

Model Family	Architecture	RoPE	Status
LLaMA 3.1 / 3 / 2	LlamaForCausalLM	✅	✅ Tested
Mistral / Mixtral	MistralForCausalLM	✅	✅ Tested
GPT-2	GPT2LMHeadModel	❌ (Absolute)	✅ Tested
Qwen 2.5	Qwen2ForCausalLM	✅	✅ Tested
Phi-3	Phi3ForCausalLM	✅	✅ Expected
Gemma 2	Gemma2ForCausalLM	✅	✅ Expected

Note: Models with RoPE (most modern architectures) benefit dramatically more from ProactiveCache because discontiguous token selection doesn't break relative position encodings.

Citation

If you use proactive-cache in your research, please cite:

@software{proactive_cache_2026,
  author    = {Khavin S},
  title     = {proactive-cache: O(n) KV Cache Eviction for Any HuggingFace Transformer},
  year      = {2026},
  url       = {https://github.com/skhavin/proactive-cache},
}

License

GNU Affero General Public License v3 (AGPLv3).

This library is copyleft and open source. Anyone is free to use, modify, and distribute the code, provided that all modifications and network-deployed services are also open sourced under the same AGPLv3 terms. See the LICENSE file for the full legal text.

Contributing

Bug reports and research contributions welcome. Open an issue or PR at github.com/skhavin/proactive-cache.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

S-Khavin

These details have not been verified by PyPI

Project links

Paper

Release history Release notifications | RSS feed

0.3.1

May 31, 2026

0.3.0

May 31, 2026

This version

0.1.0

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proactive_cache-0.1.0.tar.gz (54.8 kB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

proactive_cache-0.1.0-py3-none-any.whl (40.5 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file proactive_cache-0.1.0.tar.gz.

File metadata

Download URL: proactive_cache-0.1.0.tar.gz
Upload date: May 31, 2026
Size: 54.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proactive_cache-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`78515d8abd15fe8d1b200a04a1a7d4c74c2fab1554b580c1439d9d6d9fba4e2a`
MD5	`eea5a7dcb2f8ed8d2466b768d5acf521`
BLAKE2b-256	`fc42cd73c85003b510f1d0157c5d500659f9e34ce372fa07e9e1ed21f8b0a054`

See more details on using hashes here.

Provenance

The following attestation bundles were made for proactive_cache-0.1.0.tar.gz:

Publisher: workflow.yml on skhavin/proactive-cache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: proactive_cache-0.1.0.tar.gz
- Subject digest: 78515d8abd15fe8d1b200a04a1a7d4c74c2fab1554b580c1439d9d6d9fba4e2a
- Sigstore transparency entry: 1683806387
- Sigstore integration time: May 31, 2026
Source repository:
- Permalink: skhavin/proactive-cache@6ae4ff807bdd11b3d7bf875abb77e0e4dbcfed26
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/skhavin
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@6ae4ff807bdd11b3d7bf875abb77e0e4dbcfed26
- Trigger Event: push

File details

Details for the file proactive_cache-0.1.0-py3-none-any.whl.

File metadata

Download URL: proactive_cache-0.1.0-py3-none-any.whl
Upload date: May 31, 2026
Size: 40.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proactive_cache-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c16a93ed01212d4c4c91e580e8181c578443f54bbc6cb7ef7d617bfce679f1ce`
MD5	`26eabb4f7cc318f35edefb05f00cdb45`
BLAKE2b-256	`2314764782a8b0711968c804a83a8fa4d2056657fbf9e9559d079d75e2cafcfb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for proactive_cache-0.1.0-py3-none-any.whl:

Publisher: workflow.yml on skhavin/proactive-cache

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: proactive_cache-0.1.0-py3-none-any.whl
- Subject digest: c16a93ed01212d4c4c91e580e8181c578443f54bbc6cb7ef7d617bfce679f1ce
- Sigstore transparency entry: 1683806503
- Sigstore integration time: May 31, 2026
Source repository:
- Permalink: skhavin/proactive-cache@6ae4ff807bdd11b3d7bf875abb77e0e4dbcfed26
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/skhavin
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@6ae4ff807bdd11b3d7bf875abb77e0e4dbcfed26
- Trigger Event: push

proactive-cache 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

title: O(1) Decode-Step Attention for Any Transformer via Training-Free Proactive KV Cache Eviction emoji: ⚡ colorFrom: blue colorTo: purple sdk: gradio python_version: "3.10" app_file: app.py pinned: false license: other

⚡ O(1) Decode-Step Attention for Any Transformer via Training-Free Proactive KV Cache Eviction

Why This Works

RoPE Compatibility & Robust Proportional Allocation

Empirical Results

O(1) Step Generation Scaling — The Core Result

LLaMA-3.1 8B — WikiText-103

LLaMA-3.1 8B — PG-19 Long-Context Books

GPT-2 — WikiText-103 (Short Documents)

GPT-2 — WikiText-103 (Long Documents, 1024-token)

GPT-2 — PG-19 Long-Context Books

How ProactiveCache Outperforms StreamingLLM

Installation

API Reference

ProactiveCache.apply(model, budget, prototype_path)

ProactiveCache.profile(model, tokenizer, corpus, num_docs, seq_len, save_path)

ProactiveCachePress (KVPress integration)

Architecture Support

Citation

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`ProactiveCache.apply(model, budget, prototype_path)`

`ProactiveCache.profile(model, tokenizer, corpus, num_docs, seq_len, save_path)`

`ProactiveCachePress` (KVPress integration)