FADE: Frequency-Adaptive Decay Encoding — attention-aware tiered KV cache compression for LLM inference.
Project description
FADE
Frequency-Adaptive Decay Encoding — drop-in KV cache compression for HuggingFace transformers. Shrinks the KV cache 3.5–12× with near-baseline quality (up to 23× in aggressive mode — validate on your workload).
from fade import FadeConfig, create_tiered_cache
cache = create_tiered_cache(model, config=FadeConfig.safe())
out = model.generate(input_ids, past_key_values=cache, max_new_tokens=256)
Works with model.generate() — greedy, sampling, beam search. No manual decode loop needed.
How it works
Tokens live in tiers based on age and attention importance:
| Tier | What's stored | When |
|---|---|---|
| FP16 | Full precision | First N_SINK tokens + last RECENT_WINDOW tokens |
| INT4 | Bit-packed 4-bit | Middle-aged tokens (the bulk of the cache) |
| INT2 | Grouped 2-bit | Optional deeper compression (lossy) |
| PQ | Product-quantized codes | ~2 bits/element via trained codebook (Phase 3) |
| Evicted | Nothing | Dropped when INT4_BUDGET is finite |
When tokens are evicted, surviving K tensors are un-RoPE'd at old positions and re-RoPE'd with contiguous StreamingLLM positions.
flowchart LR
A["New Token"] --> B["FP16 Tier\n(sinks + recent)"]
B -->|reassign| C["INT4 Tier\n(middle)"]
C -->|budget full| D["INT2 / PQ\n(deep compress)"]
D -->|evict| E["Dropped\n(re-RoPE survivors)"]
style A fill:#4CAF50,color:#fff
style B fill:#2196F3,color:#fff
style C fill:#FF9800,color:#fff
style D fill:#f44336,color:#fff
style E fill:#9E9E9E,color:#fff
Compression at a glance
| Config | KV cache | Compression |
|---|---|---|
| Baseline FP16 | 112.00 MiB | 1× |
| Safe (INT4) | 31.24 MiB | 3.6× |
| Rotated 2-bit | 17.70 MiB | 6.3× |
| Balanced (eviction) | 9.30 MiB | 12.0× |
| Aggressive | 4.77 MiB | 23.5× |
Qwen2.5-7B-Instruct, 2048 tokens, DGX Spark. Needle: 4/4 PASS (512–4096). WikiText-2 PPL: 6.56.
Fused kernel speed
| Path | Latency | vs FP16 |
|---|---|---|
| FP16 FlashAttention | 0.133 ms | 1.0× |
| Fused INT4 (FADE) | 0.189 ms | 1.4× |
| Dequant + SDPA (old) | 0.932 ms | 7.0× |
How FADE compares (2026)
| FADE | kvpress (NVIDIA) | TurboQuant (Google, ICLR 2026) | KVTC (NVIDIA, ICLR 2026) | |
|---|---|---|---|---|
| Approach | Tiered quant + eviction + re-RoPE | Token eviction / scoring (30+ methods) | Rotation + Lloyd-Max codebook | PCA + DP bit allocation + entropy coding |
| Compression | 3.5–12× (23× aggressive) | 2–10× (eviction only) | 4–6× (3.5-bit zero-loss claimed) | 6–20× (up to 40× with entropy) |
| Quantization | INT4/INT2/PQ + rotated 2-bit | Via HF QuantizedCache |
3–4 bit | 1–6 bit adaptive |
| Eviction | H2O, EMA, position, adaptive, learned | 30+ methods (SnapKV, TOVA, KVzap, etc.) | None | None |
| Re-RoPE | ✅ StreamingLLM contiguous | Partial (KeyRerotationPress, FinchPress) | ❌ | ✅ (undo before PCA) |
| Fused kernel | ✅ Triton INT4 FlashAttn | ❌ | ✅ Triton fused | ✅ Triton |
| HF generate() | ✅ Drop-in | Pipeline + context manager | ✅ Drop-in | ❌ |
| Serving | ✅ fade-server (OpenAI API) | ❌ | ✅ vLLM / SGLang integration | ❌ |
| Hybrid models | ✅ Qwen 3.5 DeltaNet skip | ❌ | ❌ | ❌ |
| Per-sequence batching | ✅ Ragged tiers | ❌ | ❌ | ❌ |
| Stars | New | 1K+ | 1K+ (across implementations) | ~10 |
| Install | pip install fade-kv |
pip install kvpress |
pip install turboquant-kv |
From source |
FADE's unique advantage: only system that combines quantization + attention-aware eviction + correct re-RoPE in one drop-in cache.
Install
From PyPI:
pip install fade-kv
pip install fade-kv[server] # adds fade-server CLI
From source (development):
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install torch # match your CUDA version: https://pytorch.org/get-started/locally/
pip install -e ".[dev]"
Optional extras: pip install fade-kv[cuda] (accelerate), fade-kv[eval] (datasets), fade-kv[codebook] (scikit-learn for PQ).
Quick start
Presets
from fade import FadeConfig, create_tiered_cache
# Safe: ~3-4x compression, no eviction.
cache = create_tiered_cache(model, config=FadeConfig.safe())
# Balanced: ~5x compression with H2O eviction.
cache = create_tiered_cache(model, config=FadeConfig.balanced())
# Aggressive: ~7-8x compression. Validate on your workload first.
cache = create_tiered_cache(model, config=FadeConfig.aggressive())
Custom config
cache = create_tiered_cache(model, config=FadeConfig(
phase="2",
n_sink=4,
recent_window=64,
int4_budget=400,
eviction_policy="h2o", # "h2o", "ema", "position", or "learned"
middle_k_bits=4, # K stays INT4 (outlier-sensitive)
middle_v_bits=2, # V at INT2 (~30% more compression)
))
Rotated 2-bit backend (~6× compression)
from fade.backends import get_backend
cache = create_tiered_cache(model, config=FadeConfig.safe(),
quant_backend=get_backend("rotated", head_dim=64, bits=2))
Random orthogonal rotation spreads per-channel outliers before quantization, making 2-bit viable. Uses native PyTorch — no external dependencies.
Manual decode with tier reassignment
from fade.patch import forward_with_tracking, load_model
from fade.policy import reassign_tiers
from fade.tracker import AttentionTracker
model, tokenizer = load_model("Qwen/Qwen2.5-3B-Instruct", attn_impl="auto", need_attentions=True)
cache = create_tiered_cache(model, config=FadeConfig.balanced())
tracker = AttentionTracker(num_layers=model.config.num_hidden_layers)
out = forward_with_tracking(model, input_ids, cache, tracker=tracker)
for step in range(max_tokens):
out = forward_with_tracking(model, next_token, cache, tracker=tracker)
if (step + 1) % 64 == 0:
reassign_tiers(cache, tracker, model.config.num_hidden_layers)
Eviction policies
| Policy | Quality | Speed | Needs attention? |
|---|---|---|---|
h2o |
Best | Normal | Yes (prefill only) |
ema |
Good | Normal | Yes (decode only) |
adaptive |
Good | Normal | Yes (decode EMA) |
position |
Fair | Fast | No |
learned |
Good* | Fast | No |
adaptive splits middle tokens by attention score: high→INT4, low→INT2, lowest→evict.
*Learned policy requires a trained checkpoint: python scripts/train_eviction_mlp.py
Supported models
FADE auto-detects the RoPE scheme from the model config:
- Qwen2 / Qwen3 — vanilla RoPE, GQA
- Llama / Llama-3.1 — vanilla + frequency-dependent scaling
- Mistral — vanilla RoPE, sliding-window
- Phi-3 — vanilla RoPE
- Gemma-2 — vanilla RoPE
- Gemma 4 — proportional RoPE with
partial_rotary_factor+ per-layer-type dispatch - Falcon — ALiBi (non-RoPE; re-RoPE is a no-op)
- Qwen 3.5 / 3.6 — hybrid DeltaNet + softmax attention. FADE auto-detects
layer_typesand skips DeltaNet layers (only full-attention layers are tiered).
RoPE scaling types: linear, llama3, ntk, dynamic, yarn, proportional. Non-RoPE models (ALiBi, Bloom, MPT) work via the NoRope sentinel.
Batching
Two modes:
- Shared-tier (default): all rows share positions and tier decisions. For lockstep decoding.
- Per-sequence (
apply_tier_assignment_per_sequence): each row gets independent[B, S]tiers. For continuous-batching where sequences diverge.
Performance
- Fused INT4 FlashAttention kernel — single Triton kernel reads packed INT4 K/V, computes attention with online softmax, writes fp16 output. Never materializes fp16 K/V. 4.9× faster than the old dequant+SDPA path, within 1.4× of pure fp16 FlashAttention on RTX 3060.
- Pre-allocated FP16 buffer — doubling buffer eliminates
torch.caton every decode step. - torch.compile —
cache.enable_compile()wraps_materializebetween graph-break boundaries. - Dequant-cache age eviction —
cache.max_dequant_age = Nperiodically refreshes cached dequant buffers. - Benchmarks —
python benchmarks/tps.py(decode throughput),python benchmarks/divergence.py(quality).
# Use the fused kernel directly:
from fade.kernels.fused_int4_attn import fused_int4_sdpa
out = fused_int4_sdpa(q, k_packed, k_scale, v_packed, v_scale)
Inference server
OpenAI-compatible API with automatic tier management:
fade-server --model Qwen/Qwen2.5-0.5B-Instruct --preset balanced --port 8000
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],\"max_tokens\":100}"
Endpoints: /v1/chat/completions (greedy + sampling), /v1/models, /health.
Checkpointing
sd = cache.cache_state_dict()
torch.save(sd, "cache.pt")
cache.load_cache_state_dict(torch.load("cache.pt"))
Observability
from fade.telemetry import JsonlExporter, attach_telemetry
attach_telemetry(cache, JsonlExporter("events.jsonl"))
Debug dump: cache.dump_debug("snapshot.json")
PQ codebook
from fade.codebook import PQCodebook
cb = PQCodebook.train(calibration_vectors, sub_dim=32, num_centroids=256)
cache.set_codebooks(cb) # enables TIER_PQ in tier assignment
Train codebooks from a real model: python scripts/train_codebook.py
Results
DGX Spark — Qwen2.5-7B-Instruct (2048 tokens)
| Config | KV cache | Compression | Decode TPS |
|---|---|---|---|
| Baseline FP16 | 112.00 MiB | 1.0× | 13.3 tok/s |
| Safe (INT4) | 31.24 MiB | 3.6× | 13.3 tok/s |
| Rotated 2-bit | 17.70 MiB | 6.3× | 13.3 tok/s |
| Balanced (eviction) | 9.30 MiB | 12.0× | 13.3 tok/s |
| Aggressive | 4.77 MiB | 23.5× | 13.3 tok/s |
Needle: 4/4 PASS (512–4096 tokens). WikiText-2 baseline PPL: 6.56.
DGX Spark — Mistral-7B-Instruct-v0.3 (2048 tokens)
| Config | KV cache | Compression | Decode TPS |
|---|---|---|---|
| Baseline FP16 | 256.00 MiB | 1.0× | 15.3 tok/s |
| Safe (INT4) | 71.40 MiB | 3.6× | 15.2 tok/s |
| Rotated 2-bit | 40.47 MiB | 6.3× | 15.2 tok/s |
| Balanced (eviction) | 21.26 MiB | 12.0× | 15.2 tok/s |
| Aggressive | 10.91 MiB | 23.5× | 15.2 tok/s |
Needle: 4/4 PASS (512–4096 tokens). WikiText-2 baseline PPL: 4.98.
DGX Spark — Llama-3.1-8B-Instruct (2048 tokens)
| Config | KV cache | Compression | Decode TPS |
|---|---|---|---|
| Baseline FP16 | 256.00 MiB | 1.0× | 14.4 tok/s |
| Safe (INT4) | 71.40 MiB | 3.6× | 14.3 tok/s |
| Rotated 2-bit | 40.47 MiB | 6.3× | 14.3 tok/s |
| Balanced (eviction) | 21.26 MiB | 12.0× | 14.3 tok/s |
| Aggressive | 10.91 MiB | 23.5× | 14.3 tok/s |
Needle: 4/4 PASS (512–4096 tokens). WikiText-2 baseline PPL: 6.45. Llama-3.1 frequency-dependent RoPE scaling.
All DGX Spark benchmarks: NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory).
RTX 3060 — Qwen2.5-0.5B-Instruct (2048 tokens)
| Config | KV cache | Compression | Decode TPS |
|---|---|---|---|
| Baseline FP16 | 24.00 MiB | 1.0× | 128.5 tok/s |
| Safe (INT4) | 6.78 MiB | 3.5× | 125.8 tok/s |
| Rotated 2-bit | 3.88 MiB | 6.2× | 125.9 tok/s |
| Balanced (eviction) | 2.01 MiB | 11.9× | 125.8 tok/s |
| Aggressive | 1.03 MiB | 23.3× | 125.9 tok/s |
Needle: 4/4 PASS (512–4096 tokens). Baseline FP16 PPL: 1.24. TPS overhead: ~2%.
Fused Triton kernel (RTX 3060)
| Path | Time | vs FP16 |
|---|---|---|
| FP16 SDPA | 0.133 ms | 1.0× |
| Dequant + SDPA (old) | 0.932 ms | 7.0× slower |
| Fused INT4 (new) | 0.189 ms | 1.4× |
Run benchmarks yourself: python benchmarks/production_suite.py, python benchmarks/full_suite.py
Project layout
fade/
cache.py # TieredKVCache with 5 tiers (FP16/INT4/INT2/PQ/evict)
config.py # FadeConfig with presets
quant.py # INT4/INT2 quantization + bit-packing
rope.py # 7 RoPE schemes incl. Gemma 4 proportional
policy.py # Tier assignment: h2o, ema, position
learned_policy.py # Learned eviction MLP
tracker.py # AttentionTracker (per-layer EMA)
patch.py # load_model, create_tiered_cache, forward_with_tracking
codebook.py # PQ codebook train/encode/decode
telemetry.py # Structured telemetry + exporters
kernels/ # Fused INT4 FlashAttention kernel + unpack kernel + fallback
serving/ # vLLM / SGLang adapter stubs
eval/ # Perplexity, needle, quality suite
examples/ # quickstart.py
experiments/ # run_baseline.py, run_tiered.py
benchmarks/ # tps.py, divergence.py
scripts/ # train_eviction_mlp.py, train_codebook.py
tests/ # 136 tests, all CPU, no downloads
Gotchas
- Attention impl:
eageronly needed for H2O prefill. Useload_model(attn_impl="auto"). - Transformers version: verified on 4.45 and 5.3. Weekly canary CI runs against
transformers@main. - Memory: use
cache.compressed_storage_bytes(), notnvidia-smi. - RoPE precision: all math in float32, cast through model dtype to match rounding.
- Hybrid models: Qwen 3.5/3.6 DeltaNet layers are auto-skipped — only full-attention layers are tiered.
- Triton kernels: fused attention via
fused_int4_sdpa(), unpack-only viaint4_sdpa(force_triton=True). Runcheck_fused_parity()to validate on your hardware.
Citations
FADE builds on ideas from these papers (all independently reimplemented — see NOTICE for details):
- H2O — Zhang et al., 2023. Heavy-Hitter Oracle for Efficient Generative Inference of LLMs. arXiv:2306.14048
- StreamingLLM — Xiao et al., 2023. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453
- KIVI — Liu et al., 2024. A Tuning-Free KV Cache Quantization Algorithm. arXiv:2402.02750
- TurboQuant — Zandieh et al., ICLR 2026. Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874
- KVTC — Staniszewski & Łańcucki, ICLR 2026. KV Cache Transform Coding for Compact Storage in LLM Inference. arXiv:2511.01815
- KnormPress — Devoto et al., 2024. A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression. arXiv:2406.11430
If you use FADE in your work:
@software{fade2026,
title = {FADE: Frequency-Adaptive Decay Encoding},
author = {Branislav Đalić},
url = {https://github.com/Omodaka9375/fade},
year = {2026},
}
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fade_kv-1.0.0.tar.gz.
File metadata
- Download URL: fade_kv-1.0.0.tar.gz
- Upload date:
- Size: 103.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f12c58d964a43a3be17c3de97c98e78a605c5c3e4f2820e07fdf4d822b5a879
|
|
| MD5 |
501dc07719dd842ff24d476f0cc3aa71
|
|
| BLAKE2b-256 |
4be07f0e62d935b04cb5cd93bdfe15e4fcb32c317e0abd9aca7b83c9d44bfb3a
|
Provenance
The following attestation bundles were made for fade_kv-1.0.0.tar.gz:
Publisher:
ci.yml on Omodaka9375/fade
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fade_kv-1.0.0.tar.gz -
Subject digest:
1f12c58d964a43a3be17c3de97c98e78a605c5c3e4f2820e07fdf4d822b5a879 - Sigstore transparency entry: 1385499434
- Sigstore integration time:
-
Permalink:
Omodaka9375/fade@1e7cc822756fe84586b3b0b4bbef2fc07314bab2 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/Omodaka9375
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@1e7cc822756fe84586b3b0b4bbef2fc07314bab2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file fade_kv-1.0.0-py3-none-any.whl.
File metadata
- Download URL: fade_kv-1.0.0-py3-none-any.whl
- Upload date:
- Size: 80.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df890613a6c96f0033f584ca6ca7ed2312784697061d5cb9b34535f64b56a705
|
|
| MD5 |
acddb79640d37c8ff6d24384d904c254
|
|
| BLAKE2b-256 |
473e6f1c34eea5cc56dd426c2c597f4aa00673ef06a912c2f301bdf18a70e51f
|
Provenance
The following attestation bundles were made for fade_kv-1.0.0-py3-none-any.whl:
Publisher:
ci.yml on Omodaka9375/fade
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fade_kv-1.0.0-py3-none-any.whl -
Subject digest:
df890613a6c96f0033f584ca6ca7ed2312784697061d5cb9b34535f64b56a705 - Sigstore transparency entry: 1385499448
- Sigstore integration time:
-
Permalink:
Omodaka9375/fade@1e7cc822756fe84586b3b0b4bbef2fc07314bab2 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/Omodaka9375
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@1e7cc822756fe84586b3b0b4bbef2fc07314bab2 -
Trigger Event:
push
-
Statement type: