kvpress

Efficiently compress the KV cache of any pretrained transformer

Project description

kvpress

Deploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory. kvpress implements multiple KV cache compression methods and benchmarks using 🤗 transformers, aiming to simplify the development of new methods for researchers and developers in this field.

Installation

pip install kvpress

For a local installation, use uv:

git clone https://github.com/NVIDIA/kvpress.git
cd kvpress
uv sync

To install with all optional dependencies, run:

git clone https://github.com/NVIDIA/kvpress.git
cd kvpress
uv sync --extra eval --extra flash-attn

Usage

KVPress provides a set of "presses" that compress the KV cache during the prefilling-phase. Each press is associated with a compression_ratio attribute that measures the compression of the cache. The easiest way to use a press is through our custom KVPressTextGenerationPipeline. It is automatically registered as a transformers pipeline with the name "kv-press-text-generation" when kvpress is imported and handles chat templates and tokenization for you:

from transformers import pipeline
from kvpress import ExpectedAttentionPress

model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")

context = "A very long text you want to compress once and for all"
question = "\nA question about the compressed context"  # optional

press = ExpectedAttentionPress(compression_ratio=0.5)
answer = pipe(context, question=question, press=press)["answer"]

In the snippet above, the compression is only applied on the context tokens so that you can evaluate the compression for different questions. Check the Wikipedia notebook demo for a more detailed example (also available on Colab here).

Decoding Compression

By default, KVPress applies compression during the prefilling phase. As a new (experimental) feature, we now support decoding compression via the `DecodingPress` wrapper. `DecodingPress` compresses the KV cache periodically during token generation, optionally maintaining a buffer of recent hidden states. `DecodingPress` supports the following parameters:

base_press: Any ScorerPress (e.g., KNormPress, CriticalKVPress)
compression_interval: Steps between compressions (default: 10)
target_size: Target cache size of the cache after compression (default: 1024)
hidden_states_buffer_size: Number of hidden states to buffer before compression (default: 128). Some presses don't need buffered hidden states and can set this to 0.

Unlike a compression ratio, decoding press uses a target_size to compress the cache. This means that the cache is compressed every compression_interval steps, and the compression ratio is automatically computed such that the size of the cache after compression equals target_size.

An example for decoding compression:

from transformers import pipeline
from kvpress import KnormPress
from kvpress import DecodingPress

# Initialize the pipeline
device = "cuda:0"
model = "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs = {"attn_implementation": "flash_attention_2"}
pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)

# Create a decoding press that compresses every 10 steps to 512 tokens
decoding_press = DecodingPress(
    base_press=KnormPress(),
    compression_steps=10,
    token_buffer_size=512
)

# Use with pipeline
context = "A very long text you want to compress during generation"
question = "Tell me a long story about this context"
response = pipe(context, question=question, press=decoding_press)["answer"]

Not all existing presses are fully compatible with DecodingPress due to fundamental differences in how compression works during decoding versus prefilling. in particular, we only support ScorerPresses as base presses.

Available presses

All current presses are training free and inherit from BasePress (source).

Several presses inherit from ScorerPress (source) and rely on a score to prune the KV pairs with lowest importance:

RandomPress (source): random score
KnormPress (source, paper): inverse norm of the key
SnapKVPress (source, paper): average attention weight of the last queries
ExpectedAttentionPress (source, notebook): expected attention weight during the generation phase
StreamingLLMPress (source, paper): keep only the initial and recent tokens
TOVAPress (source, paper): attention weight of the last query averaged across heads
ObservedAttentionPress (source, paper): average attention weight observed during in prefilling phase
QFilterPress (source, paper): project the Key representations on the main SVD component of the Query vectors to approximate the attention scores.
PyramidKVPress (source, paper): maintain pyramid-like cache sizes, allocating more cache budget to lower layers and less to higher layers
LagKVPress (source, paper): leverage on the KV lag-relative information to compress. It's query free, attention-weight free, and flash-attention compatible.
KeyDiffPress (source, paper): evicts tokens based solely on key similarity.
NonCausalAttnPress (source, paper): evicts tokens based on non-causal chunked attention scores.
LeverageScorePress (source, paper): evicts tokens based on approximate statistical leverage (i.e we preserve outliers in the key space).
CompactorPress (source, paper): blends NonCausalAttnPress and LeverageScorePress based on the compression_ratio.
CURPress (source, paper): prune keys and values based on the CUR decomposition using approximate leverage scores.
KVzapPress (source, paper, training): approximate KVzip+ using a fast surrogate model. To be used in conjunction with the DMSPress.
FastKVzipPress (source, paper): approximate KVzip through a lightweight gating mechanism.

Some presses rely on a different logic:

ThinKPress (source, paper): compress the dimensions of the keys based on the channel attention score on the last queries
SimLayerKVPress (source, paper): identify "lazy" layers, and apply the StreamingLLM approach to them
DuoAttentionPress (source, paper): split heads into retrieval heads (no compression) and streaming heads (StreamingLLM approach)
FinchPress (source, paper): similar to SnapKV with a dynamic window size and key value re-rotation
KVzipPress (source, paper): identifies redundant KV pairs through context reconstruction. Achieves near-lossless compression at the cost of multiple forward passes.

Finally we provide wrapper presses that can be combined with other presses:

AdaKVPress (source, paper): prune bottom scores of any ScorerPress but across all heads, achieving head-wise compressions
PerLayerCompressionPress (source): compress each layer with a different compression ratio (experimental)
ComposedPress (source): compose multiple presses together by chaining their forward hooks
KeyRerotationPress (source): rerotate pruned keys to have continuous RoPE embeddings
ChunkKVPress (source, paper): compresses by selecting important chunks, preserving semantic coherence
ChunkPress (source, paper): compress the KV cache on each sequence chunk separately. This can yield to more uniform compression across long sequences
CriticalKVPress and CriticalAdaKVPress (source, paper): refine the scores using the L1 norm of Wo @ values, coupled with a two-stage selection.
BlockPress (source, paper): segments input sequence into non-overlapping blocks and compresses iteratively.
DecodingPress (source): allows for compression during decoding, see decoding section in this README.
PrefillDecodingPress (source): allows to compress both during prefilling and during decoding.
DMSPress (source, paper): evict keys and values with scores below a given threshold of any ScorerPress instead of relying on top-k scores. Support both prefilling and decoding (if decoding=True), but only supports dense-prefill and not sparse-prefill.

For a detailed list of existing KV cache compression methods, check Awesome-KV-Cache-Compression or Awesome-LLM-Compression

Evaluation

We provide a simple CLI to evaluate the performance of different presses on several long-context datasets.

Accuracy: Test your method on popular benchmarks directly using our CLI.
Speed and Memory: The speed_and_memory notebook can help you measure peak memory usage and total time gain.

Please refer to the evaluation directory in this repo for more details and results.

Below we report the average performance on the RULER dataset with 4k context length for different presses, from our

Quantization

We support KV cache quantization through the transformers QuantizedCache class (see HF blog post). To use it, simply pass a cache object to your pipeline:

from transformers import QuantizedCache

cache = QuantizedCache(backend="quanto", nbits=4)

pipe(..., cache=cache)

By default, the DynamicCache is used (no quantization).

[!IMPORTANT]
To use the QuantizedCache, you need to install additional dependencies (e.g. pip install optimum-quanto).

Contributing

We welcome contributions! To add a new press, simply open an issue or submit a pull request. Check the new_press.ipynb notebook for a step-by-step guide.

Citation

If you use KVPress in your research, please cite our paper:

@article{devoto2025expectedattention,
  title={Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution},
  author={Devoto, Alessio and Jeblick, Maximilian and J{\'e}gou, Simon},
  journal={arXiv preprint arXiv:2510.00636},
  year={2025},
  url={https://arxiv.org/abs/2510.00636}
}

FAQ

Which models are supported ?

Some presses depend on the model architecture (e.g. ExpectedAttentionPress or SnapKVPress) hence they might not work with all models. We tested support for LlamaForCausalLM, MistralForCausalLM, Phi3ForCausalLM, Qwen2ForCausalLM, Qwen3ForCausalLM, and Gemma3ForCausalLM but many other models might be supported out of the box because their implementation is often similar in transformers.

How to run inference on multiple GPUs ?

kvpress supports multi-GPU inference through accelerate:

pipe = pipeline("kv-press-text-generation", model=model, device_map="auto")

What are the memory and throughput gains ?

Memory usage should be reduced by around compression_ratio * kv_cache_size. As the KV cache is smaller, decoding should also be faster. You can measure peak memory usage gain and total time gain using this notebook.

How does a press work ?

A press registers a forward hook (press.forward_hook method) to each attention layer during the prefilling phase. Registration can be applied using the press as a context manager (press.__call__ method):

import torch
from transformers import AutoModelForCausalLM
from kvpress import KnormPress

device = "cuda:0"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(ckpt).to(device)
press = KnormPress(compression_ratio=0.4)

inputs = model.dummy_inputs["input_ids"].to(device)

with torch.no_grad():
    print(model(inputs).past_key_values[0][0].shape)
    # torch.Size([3, 8, 5, 128])
    
with torch.no_grad(), press(model):
    print(model(inputs).past_key_values[0][0].shape)
    # torch.Size([3, 8, 3, 128])

Why not using model.generate ?

In fact you can use model.generate with a press by using the press as a context manager:

with press(model):
    outputs = model.generate(inputs)

However, the generate method does not allow to exclude the question from the compression, which would artificially favors methods such as SnapKV. Ideally, we want a compression method that works whatever comes after the context (e.g. for use cases such as chat or document question answering). Finally the generate method does not allow to provide generation for multiple questions at once.

Can I combine compression during prefilling and decoding ?

Combines separate presses for prefilling and decoding phases.

Parameters:

prefilling_press: Press used during prefill phase
decoding_press: Press used during decoding phase

Usage Examples

Basic Decoding Compression

from transformers import pipeline
from kvpress import KnormPress
from kvpress import DecodingPress

# Initialize the pipeline
device = "cuda:0"
model = "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs = {"attn_implementation": "flash_attention_2"}
pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)

# Create a decoding press that compresses every 10 steps to 512 tokens
decoding_press = DecodingPress(
    base_press=KnormPress(),
    compression_steps=10,
    token_buffer_size=512
)

# Use with pipeline
context = "A very long text you want to compress during generation"
question = "Tell me a long story about this context"
response = pipe(context, question=question, press=decoding_press)["answer"]

Combined Prefill + Decoding Compression

from transformers import pipeline
from kvpress import CriticalKVPress, KnormPress
from kvpress import DecodingPress, PrefillDecodingPress

# Initialize the pipeline
device = "cuda:0"
model = "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs = {"attn_implementation": "flash_attention_2"}
pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)

# Different strategies for prefill vs decoding
prefill_press = CriticalKVPress(KnormPress())
decoding_press = DecodingPress(
    base_press=KnormPress(compression_ratio=0.2),
    compression_steps=5,
    token_buffer_size=256
)

# Combine them
combined_press = PrefillDecodingPress(
    prefilling_press=prefill_press,
    decoding_press=decoding_press
)

context = "A very long context that will be compressed during prefill"
question = "Generate a detailed analysis that will be compressed during decoding"
response = pipe(context, question=question, press=combined_press)["answer"]

Project details

Release history Release notifications | RSS feed

0.5.3

Apr 9, 2026

0.5.2

Apr 1, 2026

This version

0.5.1

Feb 16, 2026

0.5.0

Jan 28, 2026

0.4.3

Jan 27, 2026

0.4.2

Jan 21, 2026

0.4.1

Jan 14, 2026

0.4.0

Dec 5, 2025

0.3.0

Sep 4, 2025

0.2.10

Aug 6, 2025

0.2.9

Jul 28, 2025

0.2.8

Jul 8, 2025

0.2.7

Jul 7, 2025

0.2.6

Jun 16, 2025

0.2.5

Apr 17, 2025

0.2.4

Mar 17, 2025

0.2.3

Feb 18, 2025

0.2.2

Feb 12, 2025

0.2.1

Jan 21, 2025

0.2.0

Jan 13, 2025

0.1.1

Jan 7, 2025

0.1.0

Dec 12, 2024

0.0.4

Dec 3, 2024

0.0.3

Nov 26, 2024

0.0.2

Nov 21, 2024

0.0.1

Nov 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvpress-0.5.1.tar.gz (1.3 MB view details)

Uploaded Feb 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kvpress-0.5.1-py3-none-any.whl (89.3 kB view details)

Uploaded Feb 16, 2026 Python 3

File details

Details for the file kvpress-0.5.1.tar.gz.

File metadata

Download URL: kvpress-0.5.1.tar.gz
Upload date: Feb 16, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvpress-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`2b023c8039328693a5074b9ec2679198a255e62be7a80d663776c961a944b08c`
MD5	`17d7b6b9c324b730a580727dcc49e9e0`
BLAKE2b-256	`30f2504bf403a8f68b34d286d2721cab2134a6135202c67697d6d795728bbaf0`

See more details on using hashes here.

File details

Details for the file kvpress-0.5.1-py3-none-any.whl.

File metadata

Download URL: kvpress-0.5.1-py3-none-any.whl
Upload date: Feb 16, 2026
Size: 89.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvpress-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d4c980c8e25e2e9461aae66d0116a4a7000af92eb300d8aabdbb0dad0c7a1ed`
MD5	`41f2f8d227a8b3628de1f109bc4c7540`
BLAKE2b-256	`981d02634a642fe4a436035a8871e9417d7cbcef17abee46383cc393cadf7df0`

See more details on using hashes here.

kvpress 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Installation

Usage

Available presses

Evaluation

Quantization

Contributing

Citation

FAQ

Which models are supported ?

How to run inference on multiple GPUs ?

What are the memory and throughput gains ?

How does a press work ?

Why not using model.generate ?

Can I combine compression during prefilling and decoding ?

Usage Examples

Basic Decoding Compression

Combined Prefill + Decoding Compression

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes