FADE: Frequency-Adaptive Decay Encoding — attention-aware tiered KV cache compression for LLM inference.
Project description
FADE
Frequency-Adaptive Decay Encoding — drop-in KV cache compression for HuggingFace transformers. Shrinks the KV cache 3–23× depending on config, with near-baseline quality.
from fade import FadeConfig, create_tiered_cache
cache = create_tiered_cache(model, config=FadeConfig.safe())
out = model.generate(input_ids, past_key_values=cache, max_new_tokens=256)
Works with model.generate() — greedy, sampling, beam search. No manual decode loop needed.
How it works
Tokens live in tiers based on age and attention importance:
| Tier | What's stored | When |
|---|---|---|
| FP16 | Full precision | First N_SINK tokens + last RECENT_WINDOW tokens |
| INT4 | Bit-packed 4-bit | Middle-aged tokens (the bulk of the cache) |
| INT2 | Grouped 2-bit | Optional deeper compression (lossy) |
| PQ | Product-quantized codes | ~2 bits/element via trained codebook (Phase 3) |
| Evicted | Nothing | Dropped when INT4_BUDGET is finite |
When tokens are evicted, surviving K tensors are un-RoPE'd at old positions and re-RoPE'd with contiguous StreamingLLM positions.
Install
From PyPI:
pip install fade-kv
pip install fade-kv[server] # adds fade-server CLI
From source (development):
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install torch # match your CUDA version: https://pytorch.org/get-started/locally/
pip install -e ".[dev]"
Optional extras: pip install fade-kv[cuda] (accelerate), fade-kv[eval] (datasets), fade-kv[codebook] (scikit-learn for PQ).
Quick start
Presets
from fade import FadeConfig, create_tiered_cache
# Safe: ~3-4x compression, 100% greedy match. No eviction.
cache = create_tiered_cache(model, config=FadeConfig.safe())
# Balanced: ~5x compression with H2O eviction.
cache = create_tiered_cache(model, config=FadeConfig.balanced())
# Aggressive: ~7-8x compression. Validate on your workload first.
cache = create_tiered_cache(model, config=FadeConfig.aggressive())
Custom config
cache = create_tiered_cache(model, config=FadeConfig(
phase="2",
n_sink=4,
recent_window=64,
int4_budget=400,
eviction_policy="h2o", # "h2o", "ema", "position", or "learned"
middle_k_bits=4, # K stays INT4 (outlier-sensitive)
middle_v_bits=2, # V at INT2 (~30% more compression)
))
Rotated 2-bit backend (~6× compression)
from fade.backends import get_backend
cache = create_tiered_cache(model, config=FadeConfig.safe(),
quant_backend=get_backend("rotated", head_dim=64, bits=2))
Random orthogonal rotation spreads per-channel outliers before quantization, making 2-bit viable. Uses native PyTorch — no external dependencies.
Manual decode with tier reassignment
from fade.patch import forward_with_tracking, load_model
from fade.policy import reassign_tiers
from fade.tracker import AttentionTracker
model, tokenizer = load_model("Qwen/Qwen2.5-3B-Instruct", attn_impl="auto", need_attentions=True)
cache = create_tiered_cache(model, config=FadeConfig.balanced())
tracker = AttentionTracker(num_layers=model.config.num_hidden_layers)
out = forward_with_tracking(model, input_ids, cache, tracker=tracker)
for step in range(max_tokens):
out = forward_with_tracking(model, next_token, cache, tracker=tracker)
if (step + 1) % 64 == 0:
reassign_tiers(cache, tracker, model.config.num_hidden_layers)
Eviction policies
| Policy | Quality | Speed | Needs attention? |
|---|---|---|---|
h2o |
Best | Normal | Yes (prefill only) |
ema |
Good | Normal | Yes (decode only) |
adaptive |
Good | Normal | Yes (decode EMA) |
position |
Fair | Fast | No |
learned |
Good* | Fast | No |
adaptive splits middle tokens by attention score: high→INT4, low→INT2, lowest→evict.
*Learned policy requires a trained checkpoint: python scripts/train_eviction_mlp.py
Supported models
FADE auto-detects the RoPE scheme from the model config:
- Qwen2 / Qwen3 — vanilla RoPE, GQA
- Llama / Llama-3.1 — vanilla + frequency-dependent scaling
- Mistral — vanilla RoPE, sliding-window
- Phi-3 — vanilla RoPE
- Gemma-2 — vanilla RoPE
- Gemma 4 — proportional RoPE with
partial_rotary_factor+ per-layer-type dispatch - Falcon — ALiBi (non-RoPE; re-RoPE is a no-op)
- Qwen 3.5 / 3.6 — hybrid DeltaNet + softmax attention. FADE auto-detects
layer_typesand skips DeltaNet layers (only full-attention layers are tiered).
RoPE scaling types: linear, llama3, ntk, dynamic, yarn, proportional. Non-RoPE models (ALiBi, Bloom, MPT) work via the NoRope sentinel.
Batching
Two modes:
- Shared-tier (default): all rows share positions and tier decisions. For lockstep decoding.
- Per-sequence (
apply_tier_assignment_per_sequence): each row gets independent[B, S]tiers. For continuous-batching where sequences diverge.
Performance
- Fused INT4 FlashAttention kernel — single Triton kernel reads packed INT4 K/V, computes attention with online softmax, writes fp16 output. Never materializes fp16 K/V. 4.9× faster than the old dequant+SDPA path, within 1.4× of pure fp16 FlashAttention on RTX 3060.
- Pre-allocated FP16 buffer — doubling buffer eliminates
torch.caton every decode step. - torch.compile —
cache.enable_compile()wraps_materializebetween graph-break boundaries. - Dequant-cache age eviction —
cache.max_dequant_age = Nperiodically refreshes cached dequant buffers. - Benchmarks —
python benchmarks/tps.py(decode throughput),python benchmarks/divergence.py(quality).
# Use the fused kernel directly:
from fade.kernels.fused_int4_attn import fused_int4_sdpa
out = fused_int4_sdpa(q, k_packed, k_scale, v_packed, v_scale)
Inference server
OpenAI-compatible API with automatic tier management:
fade-server --model Qwen/Qwen2.5-0.5B-Instruct --preset balanced --port 8000
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],\"max_tokens\":100}"
Endpoints: /v1/chat/completions (greedy + sampling), /v1/models, /health.
Checkpointing
sd = cache.cache_state_dict()
torch.save(sd, "cache.pt")
cache.load_cache_state_dict(torch.load("cache.pt"))
Observability
from fade.telemetry import JsonlExporter, attach_telemetry
attach_telemetry(cache, JsonlExporter("events.jsonl"))
Debug dump: cache.dump_debug("snapshot.json")
PQ codebook
from fade.codebook import PQCodebook
cb = PQCodebook.train(calibration_vectors, sub_dim=32, num_centroids=256)
cache.set_codebooks(cb) # enables TIER_PQ in tier assignment
Train codebooks from a real model: python scripts/train_codebook.py
Results
Benchmarked on Qwen2.5-0.5B-Instruct, 2048 tokens, RTX 3060 12GB.
Compression
| Config | KV cache | Compression | Notes |
|---|---|---|---|
| Baseline FP16 | 24.00 MiB | 1.0× | |
| Safe (INT4, no eviction) | 6.78 MiB | 3.5× | 100% greedy match |
| Rotated 2-bit | 3.88 MiB | 6.2× | Rotation + 2-bit packing |
| Balanced (INT4 + eviction) | 2.01 MiB | 11.9× | Position-based eviction |
| Aggressive | 1.03 MiB | 23.3× | Smaller budget |
Quality
| Test | Result |
|---|---|
| Needle @512 tokens | ✅ PASS |
| Needle @1024 tokens | ✅ PASS |
| Needle @2048 tokens | ✅ PASS |
| Baseline PPL | 1.24 |
Performance (fused Triton kernel)
| Path | Time | vs FP16 |
|---|---|---|
| FP16 SDPA | 0.133 ms | 1.0× |
| Dequant + SDPA (old) | 0.932 ms | 7.0× slower |
| Fused INT4 (new) | 0.189 ms | 1.4× |
Run benchmarks yourself: python benchmarks/full_suite.py, python benchmarks/pareto.py --csv pareto.csv
Project layout
fade/
cache.py # TieredKVCache with 5 tiers (FP16/INT4/INT2/PQ/evict)
config.py # FadeConfig with presets
quant.py # INT4/INT2 quantization + bit-packing
rope.py # 7 RoPE schemes incl. Gemma 4 proportional
policy.py # Tier assignment: h2o, ema, position
learned_policy.py # Learned eviction MLP
tracker.py # AttentionTracker (per-layer EMA)
patch.py # load_model, create_tiered_cache, forward_with_tracking
codebook.py # PQ codebook train/encode/decode
telemetry.py # Structured telemetry + exporters
kernels/ # Fused INT4 FlashAttention kernel + unpack kernel + fallback
serving/ # vLLM / SGLang adapter stubs
eval/ # Perplexity, needle, quality suite
examples/ # quickstart.py
experiments/ # run_baseline.py, run_tiered.py
benchmarks/ # tps.py, divergence.py
scripts/ # train_eviction_mlp.py, train_codebook.py
tests/ # 136 tests, all CPU, no downloads
Gotchas
- Attention impl:
eageronly needed for H2O prefill. Useload_model(attn_impl="auto"). - Transformers version: verified on 4.45 and 5.3. Weekly canary CI runs against
transformers@main. - Memory: use
cache.compressed_storage_bytes(), notnvidia-smi. - RoPE precision: all math in float32, cast through model dtype to match rounding.
- Hybrid models: Qwen 3.5/3.6 DeltaNet layers are auto-skipped — only full-attention layers are tiered.
- Triton kernels: fused attention via
fused_int4_sdpa(), unpack-only viaint4_sdpa(force_triton=True). Runcheck_fused_parity()to validate on your hardware.
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fade_kv-0.5.1.tar.gz.
File metadata
- Download URL: fade_kv-0.5.1.tar.gz
- Upload date:
- Size: 91.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea3b6b28d574ff3a85b2c535b74c492bfca651c0f2b0d4409e671074bd69acf4
|
|
| MD5 |
f0a7095a6b75f940a5f8f15e79e297e4
|
|
| BLAKE2b-256 |
3110fe4a0ba63275677b7047b4063920a0cf19fbd06bc9f42f617b3db4929e00
|
Provenance
The following attestation bundles were made for fade_kv-0.5.1.tar.gz:
Publisher:
ci.yml on Omodaka9375/fade
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fade_kv-0.5.1.tar.gz -
Subject digest:
ea3b6b28d574ff3a85b2c535b74c492bfca651c0f2b0d4409e671074bd69acf4 - Sigstore transparency entry: 1376343407
- Sigstore integration time:
-
Permalink:
Omodaka9375/fade@b1d032e03de84fa4fd4ae569ab891ee9041932ac -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/Omodaka9375
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@b1d032e03de84fa4fd4ae569ab891ee9041932ac -
Trigger Event:
push
-
Statement type:
File details
Details for the file fade_kv-0.5.1-py3-none-any.whl.
File metadata
- Download URL: fade_kv-0.5.1-py3-none-any.whl
- Upload date:
- Size: 73.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ecab344dd7b9b1a0289eadfde15b7d4419bc549c697acc08dca8c232c0c653f
|
|
| MD5 |
f7432d85b4b012853ce675e91ccd147a
|
|
| BLAKE2b-256 |
cd1f0d4dd43c04d4fa9c5b392864dcd7ed4df6d36f247d93f75276d25c63300b
|
Provenance
The following attestation bundles were made for fade_kv-0.5.1-py3-none-any.whl:
Publisher:
ci.yml on Omodaka9375/fade
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fade_kv-0.5.1-py3-none-any.whl -
Subject digest:
6ecab344dd7b9b1a0289eadfde15b7d4419bc549c697acc08dca8c232c0c653f - Sigstore transparency entry: 1376343424
- Sigstore integration time:
-
Permalink:
Omodaka9375/fade@b1d032e03de84fa4fd4ae569ab891ee9041932ac -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/Omodaka9375
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@b1d032e03de84fa4fd4ae569ab891ee9041932ac -
Trigger Event:
push
-
Statement type: