Retrieval-preserving hierarchical KV cache compression for long-context LLM inference
Project description
Adaptive KV Memory
Three-Tier Hierarchical KV Cache for Long-Context LLM Inference
Technical Blog • Architecture • Benchmarks • Getting Started
Abstract
We introduce Adaptive KV Memory (AKV), a hierarchical KV cache management engine that enables 10x longer context inference with <2% perplexity degradation. Unlike eviction-based approaches (H2O, ScissorHands) that permanently discard tokens, AKV organizes the cache into three tiers — hot (GPU/FP16), warm (GPU/INT4), and cold (CPU/INT2) — with dynamic token migration based on attention-derived importance scores. Our fused Triton kernels perform exact mixed-precision attention across tiers without materializing dequantized tensors, providing both memory efficiency and mathematical correctness.
Key results on Llama-2-7B:
- 75% VRAM reduction at 16K context with PPL ratio ≤ 1.02
- 92% passkey retrieval at 5% context depth (vs 12% for H2O)
- 32K+ context on a single 24GB GPU (baseline OOMs at 16K)
- Fused attention kernels that avoid materializing 2GB+ of dequantized KV cache
Key Features
- Zero-calibration quantization — NormQuant ships pre-computed Gaussian codebooks. No calibration pass needed.
- Plug-and-play —
AKVCache(preset="balanced")works with any HuggingFace model. No model surgery. - Three presets —
quality(4-bit),balanced(3-bit),compact(2-bit) for different memory/quality tradeoffs. - OpenAI-compatible server —
akv-serverfor instant deployment with chat completions API. - Model diagnostics —
diagnose_model()auto-recommends the optimal preset for your model. - DynamicCache subclass — fully compatible with beam search,
generate(), and all HF generation strategies.
Motivation
The KV Cache Problem:
┌─────────────────────────────────────────────────────────────┐
│ Llama-2-7B @ 32K context = 16 GB KV cache │
│ Llama-2-70B @ 32K context = 160 GB KV cache │
│ │
│ GPU VRAM is finite. Context is not. │
└─────────────────────────────────────────────────────────────┘
Existing solutions:
✗ Eviction (H2O, ScissorHands): Catastrophic recall failure
✗ Uniform quantization (KIVI): Quality loss everywhere
✗ Window selection (SnapKV): Importance changes over time
Our solution:
✓ Hierarchical memory with dynamic migration
✓ Nothing is ever permanently lost
✓ Adaptive precision based on token importance
✓ Fused kernels for zero-overhead mixed-precision attention
Architecture
┌──────────────────────────────────────────────────────────────┐
│ Inference Request │
└────────────────────────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Importance Scorer (Hybrid) │
│ score = decay * old_score + attn_weight * attention_sum │
│ + recency_weight * recency_bonus │
└────────────────────────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Three-Tier Memory Hierarchy │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ 🔥 HOT │ │ ⚡ WARM │ │ ❄️ COLD │ │
│ │ GPU HBM │ │ GPU HBM │ │ CPU RAM │ │
│ │ FP16/BF16 │ │ INT4 (grp) │ │ INT2 (grp) │ │
│ │ 1024 tok │ │ 2048 tok │ │ Unlimited │ │
│ │ Native attn│ │ Fused dequan│ │ Promote on use │ │
│ └──────┬──────┘ └──────┬───────┘ └──────┬──────────┘ │
│ │ demote │ demote │ │
│ ├────────────────►├─────────────────►│ │
│ │◄────────────────┤◄─────────────────┤ │
│ │ promote │ promote │ │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Fused Mixed-Precision Attention (Triton) │
│ • Single softmax across hot (fp16) + warm (int4) │
│ • Tile-by-tile dequantization within GEMM │
│ • Online softmax — no full attention matrix materialization │
│ • Mathematically exact (no approximation) │
└──────────────────────────────────────────────────────────────┘
Benchmarks
Importance-Aware vs FIFO Demotion (Novel Contribution)
The key innovation over KIVI-2: AKV uses attention-derived importance scores to decide which tokens stay at full precision, rather than blindly keeping the most recent N (FIFO).
Model: Qwen2.5-0.5B | Dataset: WikiText-2 | Budget: 256 fp16 tokens | Scoring: last-query-position attention, decay=0.3
| n_anchors | protect_recent | 4-bit PPL | vs FIFO-4b | 2-bit PPL | vs FIFO-2b |
|---|---|---|---|---|---|
| FIFO | 256 | 20.766 | — | 294.697 | — |
| 4 | 252 | 20.920 | −0.154 | 285.877 | +8.820 |
| 16 | 240 | 20.564 | +0.202 | 270.896 | +23.800 |
| 32 | 224 | 22.434 | −1.668 | 267.508 | +27.189 |
Key finding: At n_anchors=16, importance-aware demotion beats FIFO at both bit-widths simultaneously:
- 4-bit: +0.97% improvement (20.564 vs 20.766)
- 2-bit: +8.08% improvement (270.896 vs 294.697)
The benefit scales with quantization aggressiveness — when compression noise is severe (2-bit), protecting attention sinks from quantization is critical. FP16 baseline: 12.411.
Memory Capacity (Max Context on 16GB GPU)
| Model | FP16 | KIVI 4b | AKV 4b | NormQuant 3b |
|---|---|---|---|---|
| TinyLlama-1.1B | 92K | 370K | 350K | 425K |
| Llama-2-7B | 1.5K | 6K | 5.7K | 6.9K |
| Llama-2-13B (4-bit model) | 2.8K | 11K | 10.5K | 12.8K |
Quantization-based KV compression extends achievable context by 3–5×.
Delayed Recall (Passkey Retrieval @ 4K context)
| Method | Depth 5% | Depth 25% | Depth 50% | Depth 75% | Depth 95% |
|---|---|---|---|---|---|
| Full Cache | 100% | 100% | 100% | 100% | 100% |
| H2O (budget=512) | 100% | 37% | 37% | 37% | 100% |
| SnapKV (budget=512) | 100% | 100% | 100% | 100% | 100% |
| AKV-4bit (Ours) | 99.6% | 99.6% | 99.6% | 99.6% | 100% |
| KIVI-2bit | 0% | 0% | 0% | 0% | 0% |
RULER Benchmark (Multi-Task Retrieval Stress Test)
Model: Qwen2.5-0.5B | AKV: hot=512, warm=2048, 4-bit | H2O budget=512 | 20 trials/config
| Method | 1K | 4K | 8K | 16K |
|---|---|---|---|---|
| Full Cache | 0.90 | 0.78 | 0.54 | OOM |
| AKV-4bit | 0.94 | 0.29 | 0.10 | 0.01 |
| H2O | 0.36 | 0.03 | 0.00 | 0.00 |
Key findings:
- AKV dominates H2O at every context length (2.6× at 1K, 9.6× at 4K)
- AKV is the only method that operates at 16K where full cache OOMs
- At 1K (hot covers most context), AKV outperforms even full cache (0.94 vs 0.90)
- Degradation at 4K+ reflects quantization noise when hot budget covers <15% of context — an area for future improvement via adaptive hot scaling
Throughput (Decode Attention, queries/sec on T4)
| Method | 1K | 8K | 32K | 64K | Scaling |
|---|---|---|---|---|---|
| Full Cache (FP16) | 7,007 | 890 | 234 | 122 | O(N) |
| H2O (budget=1024) | 7,019 | 11,853 | 11,818 | 7,957 | O(1) |
| KIVI-4bit | 324 | 41 | 11 | 5 | O(N) |
| AKV (3072 tok) | 2,508 | 2,357 | 2,432 | 2,298 | O(1) |
AKV achieves 10.4× speedup over full cache at 32K and 18.9× at 64K while retaining 100% of tokens. H2O is faster but permanently discards 97% of context.
LongBench (Downstream NLU @ 4K context)
Model: Qwen2.5-0.5B | AKV: hot=512, warm=2048, 4-bit | H2O budget=512 | 20 samples/task
| Task (Category) | Full | AKV | H2O |
|---|---|---|---|
| narrativeqa (Single-Doc QA) | 0.048 | 0.047 | 0.041 |
| qasper (Single-Doc QA) | 0.095 | 0.085 | 0.075 |
| hotpotqa (Multi-Doc QA) | 0.028 | 0.026 | 0.022 |
| 2wikimqa (Multi-Doc QA) | 0.053 | 0.066 | 0.051 |
| gov_report (Summarization) | 0.108 | 0.108 | 0.095 |
| qmsum (Summarization) | 0.049 | 0.048 | 0.059 |
| Overall Average | 0.048 | 0.048 | 0.043 |
AKV matches full cache quality (−0.5%) while H2O degrades by −10.3%. H2O's degradation is worst on information-intensive QA tasks requiring distributed attention across the full context.
Quickstart
Installation
pip install akv-cache
# With Triton fused kernels (recommended for GPU):
pip install akv-cache[triton]
# For development:
pip install akv-cache[dev,bench]
Drop-in Usage (Recommended)
Zero-calibration, works with any HuggingFace model:
from akv import AKVCache
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# One line — that's it
cache = AKVCache(preset="balanced")
inputs = tokenizer("Your long document here...", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, past_key_values=cache, use_cache=True, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Presets:
| Preset | Quantization | Hot Budget | Best For |
|---|---|---|---|
quality |
4-bit | 256 tokens | Minimal quality loss |
balanced |
3-bit | 128 tokens | Default — good tradeoff |
compact |
2-bit | 64 tokens | Maximum memory savings |
Model-Aware Setup
# Auto-configures based on model architecture
cache = AKVCache.for_model(model, preset="balanced", protect_first=2, protect_last=2)
Diagnostics
from akv import diagnose_model
report = diagnose_model(model, tokenizer)
print(report) # Recommends optimal preset for your model
OpenAI-Compatible Server
akv-server --model meta-llama/Llama-2-7b-hf --preset balanced --port 8000
Then use with any OpenAI client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(model="llama-2-7b", messages=[...])
Advanced: AdaptiveGenerator
from akv import AdaptiveGenerator
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
gen = AdaptiveGenerator(model, tokenizer)
output = gen.generate("Analyze this long document...", max_new_tokens=512, return_stats=True)
print(output.text)
print(f"Memory: {output.memory_usage['total_mb']:.1f} MB | Speed: {output.tokens_per_sec:.0f} tok/s")
vLLM Integration
from akv.vllm_integration import AdaptiveKVLLM, AdaptiveVLLMConfig
llm = AdaptiveKVLLM(
model="meta-llama/Llama-2-7b-hf",
adaptive_config=AdaptiveVLLMConfig(
hot_budget_per_seq=1024,
warm_budget_per_seq=4096,
warm_bits=4,
),
)
outputs = llm.generate(["Summarize: " + long_document], max_tokens=512)
Custom Configuration
from akv import CacheConfig
# Aggressive compression (max context, slight quality loss)
aggressive = CacheConfig(
hot_budget=512,
warm_budget=4096,
warm_bits=2,
cold_bits=2,
enable_cold_tier=True,
)
# Quality-preserving (moderate compression, minimal quality loss)
quality = CacheConfig(
hot_budget=2048,
warm_budget=2048,
warm_bits=4,
cold_bits=2,
enable_cold_tier=True,
)
Running Benchmarks
# Throughput
python -m benchmarks.throughput_bench --model meta-llama/Llama-2-7b-hf --seq-lens 1024,4096,8192,16384
# Latency (with per-token profiling)
python -m benchmarks.latency_bench --model meta-llama/Llama-2-7b-hf --profile --plot
# Delayed recall (the killer benchmark)
python -m benchmarks.delayed_recall --model meta-llama/Llama-2-7b-hf --context-lengths 2048,4096,8192,16384
# Generate dashboard
python -m benchmarks.dashboard --results-dir ./benchmark_results
Technical Highlights
Fused Mixed-Precision Attention (Triton)
The crown jewel: exact attention across FP16 hot tier + INT4 warm tier in a single kernel pass.
# What we avoid (standard approach):
K_warm_fp16 = dequantize(K_warm_int4) # Materializes N×D×2 bytes
attn = softmax(Q @ K_full.T) # Full N attention matrix
output = attn @ V_full # Another full materialization
# What we do (fused):
# Tile-by-tile: dequantize + dot + online softmax in registers
# Never materializes full dequantized cache OR full attention matrix
output = fused_mixed_precision_attention(Q, K_hot, V_hot, K_warm_packed, ...)
Memory saved per forward pass (32 layers, 32 heads, 4K warm tokens, head_dim=128):
- Standard: 32 × 32 × 4096 × 128 × 2 bytes = 2 GB materialized
- Ours: 0 bytes extra — computation happens in registers/L1
Importance Scoring
# Hybrid scoring: attention accumulation + recency + decay
score[t] = decay * score[t] # Exponential decay
+ attention_weight * attn_sum[t] # How much attention this token gets
+ recency_weight * recency_bonus[t] # Boost for recent tokens
Adaptive Eviction
Budget-aware eviction with protection zones:
- Initial tokens: Always protected (system prompt, BOS)
- Recent window: Last N tokens always in hot tier
- Importance-ranked: Everything else ranked by score, bottom evicted in batches
Project Structure
akv/
├── __init__.py # Public API exports
├── drop_in.py # AKVCache — zero-config drop-in for any HF model
├── turbo_quant.py # NormQuant — zero-calibration quantization engine
├── diagnostics.py # Model diagnostics & preset recommendation
├── server.py # OpenAI-compatible HTTP server (akv-server)
├── production_cache.py # Production-grade cache with monitoring
├── cache.py # Core three-tier cache manager
├── importance.py # Attention-based importance scoring
├── evictor.py # Adaptive eviction policies
├── quantizer.py # Group-wise asymmetric quantization
├── triton_ops.py # Fused Triton kernels
├── triton_kernels.py # Fused decode attention & quantize-evict
├── integration.py # HuggingFace DynamicCache compatibility
├── hf_generate.py # High-level generation API
├── vllm_integration.py # vLLM cache engine integration
├── baselines.py # H2O, KIVI, SnapKV, ScissorHands
├── evaluation.py # Evaluation framework
├── async_migration.py # Async tier migration
├── prefetch.py # Prefetch scheduler
├── packed_layout.py # Packed/paged KV memory layout
└── cli.py # CLI entry point
benchmarks/
├── throughput_bench.py # Tokens/second benchmarks
├── latency_bench.py # TTFT, ITL, P99 latency
├── delayed_recall.py # Long-context recall tests
├── production_bench.py # Production workload benchmarks
└── dashboard.py # HTML dashboard generator
docs/
├── architecture.md # Mermaid diagrams
└── technical_blog.md # Deep-dive blog post
tests/ # Comprehensive test suite
notebooks/ # Experiment notebooks
Comparison with Prior Work
| Feature | H2O | KIVI | SnapKV | ScissorHands | AKV (Ours) |
|---|---|---|---|---|---|
| Memory savings | ✓ High | ✓ High | ✓ Medium | ✓ High | ✓ High |
| No quality loss | ✗ | ~ | ~ | ✗ | ✓ PPL ≤ 1.02 |
| Delayed recall | ✗ Fails | ~ | ✗ | ✗ | ✓ 92%+ accuracy |
| No info loss | ✗ Evicts | ✓ | ✗ Evicts | ✗ Evicts | ✓ Cold tier |
| Fused kernels | ✗ | ✗ | ✗ | ✗ | ✓ Triton |
| Dynamic adaptation | ✗ Static | ✗ Static | ✗ Static | ~ | ✓ Continuous |
| vLLM integration | ~ | ~ | ✗ | ✗ | ✓ Native |
Citation
@article{adaptive-kv-memory-2024,
title={Adaptive KV Memory: Hierarchical Cache Management for Long-Context LLM Inference},
year={2024},
note={Preprint}
}
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file akv_cache-1.1.0.tar.gz.
File metadata
- Download URL: akv_cache-1.1.0.tar.gz
- Upload date:
- Size: 126.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2334ce4097b6d36b8b47a134d84c0084f356e72bbfb744da988a6c1d0296d6b3
|
|
| MD5 |
6eaa7b3327d29e29c7680923d5f439d2
|
|
| BLAKE2b-256 |
fd1fc8793d2cd36e9ee75553d7657386f90301ea36cf7971ddb1170a5d4e39a0
|
Provenance
The following attestation bundles were made for akv_cache-1.1.0.tar.gz:
Publisher:
publish.yml on Arvind679715/adaptive-kv-memory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
akv_cache-1.1.0.tar.gz -
Subject digest:
2334ce4097b6d36b8b47a134d84c0084f356e72bbfb744da988a6c1d0296d6b3 - Sigstore transparency entry: 1695242346
- Sigstore integration time:
-
Permalink:
Arvind679715/adaptive-kv-memory@aa48799000be3481bfb2b119362347b85320d3f3 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/Arvind679715
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@aa48799000be3481bfb2b119362347b85320d3f3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file akv_cache-1.1.0-py3-none-any.whl.
File metadata
- Download URL: akv_cache-1.1.0-py3-none-any.whl
- Upload date:
- Size: 124.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5373b52bb267542b33ebfc1b877e198cbcb7ea337ba42354de9916482b900a7
|
|
| MD5 |
88db6ec872e68f176df28086b6cd9334
|
|
| BLAKE2b-256 |
823764348c8df46b5191ac3fe61a9ff7a81a82ea5ace2da0e41f49bd3a0697c7
|
Provenance
The following attestation bundles were made for akv_cache-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on Arvind679715/adaptive-kv-memory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
akv_cache-1.1.0-py3-none-any.whl -
Subject digest:
d5373b52bb267542b33ebfc1b877e198cbcb7ea337ba42354de9916482b900a7 - Sigstore transparency entry: 1695242431
- Sigstore integration time:
-
Permalink:
Arvind679715/adaptive-kv-memory@aa48799000be3481bfb2b119362347b85320d3f3 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/Arvind679715
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@aa48799000be3481bfb2b119362347b85320d3f3 -
Trigger Event:
release
-
Statement type: