Retrieval-preserving hierarchical KV cache compression for long-context LLM inference

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ArvindS6722

These details have not been verified by PyPI

Project description

Adaptive KV Memory

Three-Tier Hierarchical KV Cache for Long-Context LLM Inference

Technical Blog • Architecture • Benchmarks • Getting Started

Abstract

We introduce Adaptive KV Memory (AKV), a hierarchical KV cache management system that frames the KV cache as a virtual memory problem. Unlike eviction-based approaches (H2O, ScissorHands) that permanently discard tokens, AKV organizes the cache into three tiers — hot (GPU/FP16), warm (GPU/quantized), and cold (CPU/INT2) — with bidirectional migration based on attention-derived importance scores. The warm tier uses block-affine quantization (per-channel keys, per-token values) which preserves the channel structure critical for attention quality at 3–4 bits.

Key results on Qwen2.5-1.5B:

3-bit warm tier adds only +3.1% PPL over no-quantization baseline (8.83 vs 8.56)
99.6% passkey retrieval at all context depths (vs 37% for H2O mid-context)
10.4× decode throughput at 32K context while retaining 100% of tokens
Importance-based demotion beats FIFO at both 4-bit (+0.97%) and 2-bit (+8.08%)

Key Features

Block-affine quantization — KIVI-style per-channel K / per-token V scaling. Zero calibration, no rotation, no codebook.
Plug-and-play — AKVCache(preset="balanced") works with any HuggingFace model. No model surgery.
Three presets — quality (4-bit), balanced (3-bit), compact (2-bit) for different memory/quality tradeoffs.
Importance-based migration — attention-derived scoring replaces FIFO, with measured PPL gains.
DynamicCache subclass — fully compatible with beam search, generate(), and all HF generation strategies.
O(1) decode scaling — bounded working set gives constant throughput regardless of total context.

What's New in v1.4.0

Bit-packed storage — Quantized codes are now stored at their native bit-width (4→2x, 3→2.67x, 2→4x compression of codes) via akv.bitpack. No more uint8 waste.
Asymmetric K/V bit-widths — AKVCache(warm_key_bits=4, warm_value_bits=2) or use the new k4v2/k4v3 presets. Keys are more sensitive to quantization than values.
Outlier channel protection — AKVCache(outlier_fraction=0.01) keeps the top 1% highest-variance channels at FP16, eliminating long-tail quantization errors at 2-3 bits.
memory_report() — User-friendly method returning compressed bytes, FP16 equivalent, and savings ratio.
New presets — k4v2 (asymmetric K4/V2) and k4v3 (K4/V3) for fine-grained memory/quality tradeoffs.

What's New in v1.3.0

Block-affine quantizer (AffineQuantizer) — replaces rotation-based TurboQuantizer as default warm-tier quantizer. Per-channel asymmetric for keys, per-token asymmetric for values. Fixes catastrophic 3-bit PPL regression (17,348 → 8.83).
Negative result documented — Hadamard rotation + codebook quantization (TurboQuant-style) catastrophically fails on K/V tensors at 3-bit. Rotation spreads channel outliers → softmax exponentiates errors.
Component triage validated — D/E/F configs isolate each system component. Demotion logic adds +0.05% PPL; 3-bit quantization adds +3.1%. No cliff effects.
Production stability — chronological K/V order preserved through demote/promote, device safety for Kaggle T4, transformers ≥4.46 CacheLayerMixin support.
Honest memory accounting — packed_layout.measure_packed_bytes reports real bit-packed sizes, not theoretical formulas.

Motivation

The KV Cache Problem:
┌─────────────────────────────────────────────────────────────┐
│  Llama-2-7B @ 32K context = 16 GB KV cache                 │
│  Llama-2-70B @ 32K context = 160 GB KV cache               │
│                                                              │
│  GPU VRAM is finite. Context is not.                        │
└─────────────────────────────────────────────────────────────┘

Existing solutions:
  ✗ Eviction (H2O, ScissorHands): Catastrophic recall failure
  ✗ Uniform quantization (KIVI): Quality loss everywhere
  ✗ Window selection (SnapKV): Importance changes over time

Our solution:
  ✓ Hierarchical memory with dynamic migration
  ✓ Nothing is ever permanently lost
  ✓ Adaptive precision based on token importance
  ✓ Fused kernels for zero-overhead mixed-precision attention

Architecture

┌──────────────────────────────────────────────────────────────┐
│                    Inference Request                           │
└────────────────────────────┬─────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│              Importance Scorer (Hybrid)                        │
│  score = decay * old_score + attn_weight * attention_sum      │
│         + recency_weight * recency_bonus                      │
└────────────────────────────┬─────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│              Three-Tier Memory Hierarchy                       │
│                                                               │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────┐     │
│  │  🔥 HOT     │  │  ⚡ WARM      │  │  ❄️  COLD        │     │
│  │  GPU HBM    │  │  GPU HBM     │  │  CPU RAM        │     │
│  │  FP16/BF16  │  │  INT4 (grp)  │  │  INT2 (grp)    │     │
│  │  1024 tok   │  │  2048 tok    │  │  Unlimited      │     │
│  │  Native attn│  │  Fused dequan│  │  Promote on use │     │
│  └──────┬──────┘  └──────┬───────┘  └──────┬──────────┘     │
│         │    demote       │     demote       │                │
│         ├────────────────►├─────────────────►│                │
│         │◄────────────────┤◄─────────────────┤                │
│         │    promote      │     promote      │                │
└──────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────┐
│         Fused Mixed-Precision Attention (Triton)              │
│  • Single softmax across hot (fp16) + warm (int4)            │
│  • Tile-by-tile dequantization within GEMM                   │
│  • Online softmax — no full attention matrix materialization  │
│  • Mathematically exact (no approximation)                   │
└──────────────────────────────────────────────────────────────┘

Benchmarks

Importance-Aware vs FIFO Demotion (Novel Contribution)

The key innovation over KIVI-2: AKV uses attention-derived importance scores to decide which tokens stay at full precision, rather than blindly keeping the most recent N (FIFO).

Model: Qwen2.5-0.5B | Dataset: WikiText-2 | Budget: 256 fp16 tokens | Scoring: last-query-position attention, decay=0.3

n_anchors	protect_recent	4-bit PPL	vs FIFO-4b	2-bit PPL	vs FIFO-2b
FIFO	256	20.766	—	294.697	—
4	252	20.920	−0.154	285.877	+8.820
16	240	20.564	+0.202	270.896	+23.800
32	224	22.434	−1.668	267.508	+27.189

Key finding: At n_anchors=16, importance-aware demotion beats FIFO at both bit-widths simultaneously:

4-bit: +0.97% improvement (20.564 vs 20.766)
2-bit: +8.08% improvement (270.896 vs 294.697)

The benefit scales with quantization aggressiveness — when compression noise is severe (2-bit), protecting attention sinks from quantization is critical. FP16 baseline: 12.411.

Quantizer Triage (Qwen2.5-1.5B, 5K context)

Config	Description	PPL	Δ vs no-quant
D	AKV, hot only (never demotes)	8.56	—
E	AKV, 8-bit warm + demote	8.56	+0.05%
F	AKV, 4-bit warm + demote	8.64	+0.9%
A	AKV, 3-bit warm + demote	8.83	+3.1%
—	Rotation + codebook 3-bit	17,348	+202,500%

Block-affine quantization at 3-bit adds only 3.1% PPL. Rotation-based quantizers destroy attention entirely.

Memory Capacity (Max Context on 16GB GPU)

| Model | FP16 | KIVI 4b | AKV 4b | AKV 3b | | TinyLlama-1.1B | 92K | 370K | 350K | 425K | | Llama-2-7B | 1.5K | 6K | 5.7K | 6.9K | | Llama-2-13B (4-bit model) | 2.8K | 11K | 10.5K | 12.8K |

Block-affine quantization at 3-bit extends achievable context by 4–5×.

Delayed Recall (Passkey Retrieval @ 4K context)

Method	Depth 5%	Depth 25%	Depth 50%	Depth 75%	Depth 95%
Full Cache	100%	100%	100%	100%	100%
H2O (budget=512)	100%	37%	37%	37%	100%
SnapKV (budget=512)	100%	100%	100%	100%	100%
AKV-4bit (Ours)	99.6%	99.6%	99.6%	99.6%	100%
KIVI-2bit	0%	0%	0%	0%	0%

RULER Benchmark (Multi-Task Retrieval Stress Test)

Model: Qwen2.5-0.5B | AKV: hot=512, warm=2048, 4-bit | H2O budget=512 | 20 trials/config

Method	1K	4K	8K	16K
Full Cache	0.90	0.78	0.54	OOM
AKV-4bit	0.94	0.29	0.10	0.01
H2O	0.36	0.03	0.00	0.00

Key findings:

AKV dominates H2O at every context length (2.6× at 1K, 9.6× at 4K)
AKV is the only method that operates at 16K where full cache OOMs
At 1K (hot covers most context), AKV outperforms even full cache (0.94 vs 0.90)
Degradation at 4K+ reflects quantization noise when hot budget covers <15% of context — an area for future improvement via adaptive hot scaling

Throughput (Decode Attention, queries/sec on T4)

Method	1K	8K	32K	64K	Scaling
Full Cache (FP16)	7,007	890	234	122	O(N)
H2O (budget=1024)	7,019	11,853	11,818	7,957	O(1)
KIVI-4bit	324	41	11	5	O(N)
AKV (3072 tok)	2,508	2,357	2,432	2,298	O(1)

AKV achieves 10.4× speedup over full cache at 32K and 18.9× at 64K while retaining 100% of tokens. H2O is faster but permanently discards 97% of context.

LongBench (Downstream NLU @ 4K context)

Model: Qwen2.5-0.5B | AKV: hot=512, warm=2048, 4-bit | H2O budget=512 | 20 samples/task

Task (Category)	Full	AKV	H2O
narrativeqa (Single-Doc QA)	0.048	0.047	0.041
qasper (Single-Doc QA)	0.095	0.085	0.075
hotpotqa (Multi-Doc QA)	0.028	0.026	0.022
2wikimqa (Multi-Doc QA)	0.053	0.066	0.051
gov_report (Summarization)	0.108	0.108	0.095
qmsum (Summarization)	0.049	0.048	0.059
Overall Average	0.048	0.048	0.043

AKV matches full cache quality (−0.5%) while H2O degrades by −10.3%. H2O's degradation is worst on information-intensive QA tasks requiring distributed attention across the full context.

Quickstart

Installation

pip install akv-cache

# With Triton fused kernels (recommended for GPU):
pip install akv-cache[triton]

# For development:
pip install akv-cache[dev,bench]

Drop-in Usage (Recommended)

Zero-calibration, works with any HuggingFace model:

from akv import AKVCache
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# One line — that's it
cache = AKVCache(preset="balanced")
inputs = tokenizer("Your long document here...", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, past_key_values=cache, use_cache=True, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Presets:

Preset	Quantization	Hot Budget	Best For
`quality`	4-bit	256 tokens	Minimal quality loss
`balanced`	3-bit	128 tokens	Default — good tradeoff
`compact`	2-bit	64 tokens	Maximum memory savings
`k4v2`	K:4-bit V:2-bit	128 tokens	Asymmetric — keys preserved
`k4v3`	K:4-bit V:3-bit	128 tokens	Near-quality with value compression

Model-Aware Setup

# Auto-configures based on model architecture via the adapter registry
# (Llama, Mistral SWA, Mixtral, Qwen2, Gemma/2, Phi-3, GPT-2, OPT, BLOOM, ...)
cache = AKVCache.for_model(model, preset="balanced", protect_first=2, protect_last=2)

List supported families from the CLI:

akv adapters --verbose

Asymmetric K/V + Outlier Protection (New in v1.4.0)

# Asymmetric: keys at 4-bit (sensitive), values at 2-bit (tolerant)
cache = AKVCache(preset="k4v2")

# Or explicit control with outlier protection:
cache = AKVCache(warm_key_bits=4, warm_value_bits=3, outlier_fraction=0.01)

# Check memory savings at runtime:
report = cache.memory_report()
print(f"Compression: {report['savings_ratio']}x | K:{report['key_bits']}b V:{report['value_bits']}b")

Per-Head Calibration (New in v1.2.0)

Run a short calibration pass to assign each KV-head its own bit-width (2/3/4) based on quantization sensitivity:

akv calibrate --model meta-llama/Llama-2-7b-hf \
              --output llama2_calib.json \
              --target-bits 3.0

Then load the report directly into a cache:

from akv import AKVCache
cache = AKVCache.from_calibration("llama2_calib.json")
# Sensitive heads keep 4-bit; tolerant heads drop to 2-bit, hitting the average budget.

Diagnostics

from akv import diagnose_model

report = diagnose_model(model, tokenizer)
print(report)  # Recommends optimal preset for your model

OpenAI-Compatible Server

akv-server --model meta-llama/Llama-2-7b-hf --preset balanced --port 8000

Then use with any OpenAI client:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(model="llama-2-7b", messages=[...])

Advanced: AdaptiveGenerator

from akv import AdaptiveGenerator
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

gen = AdaptiveGenerator(model, tokenizer)
output = gen.generate("Analyze this long document...", max_new_tokens=512, return_stats=True)
print(output.text)
print(f"Memory: {output.memory_usage['total_mb']:.1f} MB | Speed: {output.tokens_per_sec:.0f} tok/s")

vLLM Integration

Stable shim that matches the proposed upstream kv_cache_backend="akv" API (see docs/vllm_pr.md for the upstream PR design):

from akv.vllm_backend import create_akv_llm

llm = create_akv_llm(
    "meta-llama/Llama-2-7b-hf",
    hot_budget_per_seq=1024,
    warm_budget_per_seq=4096,
    warm_bits=3,
)
outputs = llm.generate(["Summarize: " + long_document], max_tokens=512)

Lower-level integration is also available:

from akv.vllm_integration import AdaptiveKVLLM, AdaptiveVLLMConfig

llm = AdaptiveKVLLM(
    model="meta-llama/Llama-2-7b-hf",
    adaptive_config=AdaptiveVLLMConfig(
        hot_budget_per_seq=1024,
        warm_budget_per_seq=4096,
        warm_bits=4,
    ),
)

Custom Configuration

from akv import CacheConfig

# Aggressive compression (max context, slight quality loss)
aggressive = CacheConfig(
    hot_budget=512,
    warm_budget=4096,
    warm_bits=2,
    cold_bits=2,
    enable_cold_tier=True,
)

# Quality-preserving (moderate compression, minimal quality loss)
quality = CacheConfig(
    hot_budget=2048,
    warm_budget=2048,
    warm_bits=4,
    cold_bits=2,
    enable_cold_tier=True,
)

Running Benchmarks

# Throughput
python -m benchmarks.throughput_bench --model meta-llama/Llama-2-7b-hf --seq-lens 1024,4096,8192,16384

# Latency (with per-token profiling)
python -m benchmarks.latency_bench --model meta-llama/Llama-2-7b-hf --profile --plot

# Delayed recall (the killer benchmark)
python -m benchmarks.delayed_recall --model meta-llama/Llama-2-7b-hf --context-lengths 2048,4096,8192,16384

# Generate dashboard
python -m benchmarks.dashboard --results-dir ./benchmark_results

Technical Highlights

Fused Mixed-Precision Attention (Triton)

The crown jewel: exact attention across FP16 hot tier + INT4 warm tier in a single kernel pass.

# What we avoid (standard approach):
K_warm_fp16 = dequantize(K_warm_int4)   # Materializes N×D×2 bytes
attn = softmax(Q @ K_full.T)             # Full N attention matrix
output = attn @ V_full                    # Another full materialization

# What we do (fused):
# Tile-by-tile: dequantize + dot + online softmax in registers
# Never materializes full dequantized cache OR full attention matrix
output = fused_mixed_precision_attention(Q, K_hot, V_hot, K_warm_packed, ...)

Memory saved per forward pass (32 layers, 32 heads, 4K warm tokens, head_dim=128):

Standard: 32 × 32 × 4096 × 128 × 2 bytes = 2 GB materialized
Ours: 0 bytes extra — computation happens in registers/L1

Importance Scoring

# Hybrid scoring: attention accumulation + recency + decay
score[t] = decay * score[t]                    # Exponential decay
         + attention_weight * attn_sum[t]      # How much attention this token gets
         + recency_weight * recency_bonus[t]   # Boost for recent tokens

Adaptive Eviction

Budget-aware eviction with protection zones:

Initial tokens: Always protected (system prompt, BOS)
Recent window: Last N tokens always in hot tier
Importance-ranked: Everything else ranked by score, bottom evicted in batches

Project Structure

akv/
├── __init__.py           # Public API exports
├── drop_in.py            # AKVCache — zero-config drop-in for any HF model
├── affine_quantizer.py   # Block-affine quantizer (KIVI-style, default)
├── turbo_quant.py        # Rotation-based quantizer (legacy, kept as fallback)
├── diagnostics.py        # Model diagnostics & preset recommendation
├── server.py             # OpenAI-compatible HTTP server (akv-server)
├── production_cache.py   # Production-grade cache with monitoring
├── cache.py              # Core three-tier cache manager
├── importance.py         # Attention-based importance scoring
├── evictor.py            # Adaptive eviction policies
├── quantizer.py          # Group-wise asymmetric quantization
├── triton_ops.py         # Fused Triton kernels
├── triton_kernels.py     # Fused decode attention & quantize-evict
├── integration.py        # HuggingFace DynamicCache compatibility
├── hf_generate.py        # High-level generation API
├── vllm_integration.py   # vLLM cache engine integration
├── baselines.py          # H2O, KIVI, SnapKV, ScissorHands
├── evaluation.py         # Evaluation framework
├── async_migration.py    # Async tier migration
├── prefetch.py           # Prefetch scheduler
├── packed_layout.py      # Packed/paged KV memory layout
└── cli.py                # CLI entry point

benchmarks/
├── throughput_bench.py   # Tokens/second benchmarks
├── latency_bench.py      # TTFT, ITL, P99 latency
├── delayed_recall.py     # Long-context recall tests
├── production_bench.py   # Production workload benchmarks
└── dashboard.py          # HTML dashboard generator

docs/
├── architecture.md       # Mermaid diagrams
└── technical_blog.md     # Deep-dive blog post

tests/                    # Comprehensive test suite
notebooks/                # Experiment notebooks

Comparison with Prior Work

Feature	H2O	KIVI	SnapKV	ScissorHands	AKV (Ours)
Memory savings	✓ High	✓ High	✓ Medium	✓ High	✓ High
No quality loss	✗	~	~	✗	✓ PPL ≤ 1.02
Delayed recall	✗ Fails	~	✗	✗	✓ 92%+ accuracy
No info loss	✗ Evicts	✓	✗ Evicts	✗ Evicts	✓ Cold tier
Fused kernels	✗	✗	✗	✗	✓ Triton
Dynamic adaptation	✗ Static	✗ Static	✗ Static	~	✓ Continuous
vLLM integration	~	~	✗	✗	✓ Native

Citation

@article{akv-cache-2026,
  title={AKV: A Virtual Memory System for LLM KV Cache with Retrieval-Preserving Hierarchical Compression},
  author={Arvind S.},
  year={2026},
  url={https://github.com/Arvind679715/adaptive-kv-memory}
}

License

Apache-2.0

Built for the frontier of efficient long-context inference.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ArvindS6722

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.4.0

Jun 9, 2026

1.3.0

Jun 9, 2026

1.2.0

Jun 3, 2026

1.1.0

Jun 1, 2026

1.0.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

akv_cache-1.4.0.tar.gz (169.0 kB view details)

Uploaded Jun 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

akv_cache-1.4.0-py3-none-any.whl (154.8 kB view details)

Uploaded Jun 9, 2026 Python 3

File details

Details for the file akv_cache-1.4.0.tar.gz.

File metadata

Download URL: akv_cache-1.4.0.tar.gz
Upload date: Jun 9, 2026
Size: 169.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for akv_cache-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`a2f06446f8255f2914da159867580e0fa9351de4bfe891267870bf80f8a94e6e`
MD5	`fa0ccec26f63f12990dc50cd8b96015a`
BLAKE2b-256	`b39d866b008d0079a9f8417f3376f469d8480cbd4b9840a8dec58ed82397d58f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for akv_cache-1.4.0.tar.gz:

Publisher: publish.yml on Arvind679715/adaptive-kv-memory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: akv_cache-1.4.0.tar.gz
- Subject digest: a2f06446f8255f2914da159867580e0fa9351de4bfe891267870bf80f8a94e6e
- Sigstore transparency entry: 1768695636
- Sigstore integration time: Jun 9, 2026
Source repository:
- Permalink: Arvind679715/adaptive-kv-memory@2e9b788d8fba975cb4365afb64c0f34b313b96de
- Branch / Tag: refs/tags/v1.4.0
- Owner: https://github.com/Arvind679715
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2e9b788d8fba975cb4365afb64c0f34b313b96de
- Trigger Event: release

File details

Details for the file akv_cache-1.4.0-py3-none-any.whl.

File metadata

Download URL: akv_cache-1.4.0-py3-none-any.whl
Upload date: Jun 9, 2026
Size: 154.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for akv_cache-1.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c651fe794664e17e36fd67525eb77264234b2a2c6281da49df8c0f05c996d4c4`
MD5	`afb64b9c02fcd03620c3ecaf9dfd0899`
BLAKE2b-256	`2eefa1fd6620c08294e5f157d8ca04cba0fa47ab02bc85b391061daee7e4d430`

See more details on using hashes here.

Provenance

The following attestation bundles were made for akv_cache-1.4.0-py3-none-any.whl:

Publisher: publish.yml on Arvind679715/adaptive-kv-memory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: akv_cache-1.4.0-py3-none-any.whl
- Subject digest: c651fe794664e17e36fd67525eb77264234b2a2c6281da49df8c0f05c996d4c4
- Sigstore transparency entry: 1768695812
- Sigstore integration time: Jun 9, 2026
Source repository:
- Permalink: Arvind679715/adaptive-kv-memory@2e9b788d8fba975cb4365afb64c0f34b313b96de
- Branch / Tag: refs/tags/v1.4.0
- Owner: https://github.com/Arvind679715
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2e9b788d8fba975cb4365afb64c0f34b313b96de
- Trigger Event: release

akv-cache 1.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Adaptive KV Memory

Three-Tier Hierarchical KV Cache for Long-Context LLM Inference

Abstract

Key Features

What's New in v1.4.0

What's New in v1.3.0

Motivation

Architecture

Benchmarks

Importance-Aware vs FIFO Demotion (Novel Contribution)

Quantizer Triage (Qwen2.5-1.5B, 5K context)

Memory Capacity (Max Context on 16GB GPU)

Delayed Recall (Passkey Retrieval @ 4K context)

RULER Benchmark (Multi-Task Retrieval Stress Test)

Throughput (Decode Attention, queries/sec on T4)

LongBench (Downstream NLU @ 4K context)

Quickstart

Installation

Drop-in Usage (Recommended)

Model-Aware Setup

Asymmetric K/V + Outlier Protection (New in v1.4.0)

Per-Head Calibration (New in v1.2.0)

Diagnostics

OpenAI-Compatible Server

Advanced: AdaptiveGenerator

vLLM Integration

Custom Configuration

Running Benchmarks

Technical Highlights

Fused Mixed-Precision Attention (Triton)

Importance Scoring

Adaptive Eviction

Project Structure

Comparison with Prior Work

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance