Skip to main content

FADE: Frequency-Adaptive Decay Encoding — attention-aware tiered KV cache compression for LLM inference.

Project description

FADE

CI License Python 3.10+ PyPI Open In Colab

Frequency-Adaptive Decay Encoding — drop-in KV cache compression for HuggingFace transformers. Shrinks the KV cache up to 23 times depending on config, with near-baseline quality.

from fade import FadeConfig, create_tiered_cache

cache = create_tiered_cache(model, config=FadeConfig.safe())
out = model.generate(input_ids, past_key_values=cache, max_new_tokens=256)

Works with model.generate() — greedy, sampling, beam search. No manual decode loop needed.

How it works

Tokens live in tiers based on age and attention importance:

Tier What's stored When
FP16 Full precision First N_SINK tokens + last RECENT_WINDOW tokens
INT4 Bit-packed 4-bit Middle-aged tokens (the bulk of the cache)
INT2 Grouped 2-bit Optional deeper compression (lossy)
PQ Product-quantized codes ~2 bits/element via trained codebook (Phase 3)
Evicted Nothing Dropped when INT4_BUDGET is finite

When tokens are evicted, surviving K tensors are un-RoPE'd at old positions and re-RoPE'd with contiguous StreamingLLM positions.

flowchart LR
    A["New Token"] --> B["FP16 Tier\n(sinks + recent)"]
    B -->|reassign| C["INT4 Tier\n(middle)"]
    C -->|budget full| D["INT2 / PQ\n(deep compress)"]
    D -->|evict| E["Dropped\n(re-RoPE survivors)"]
    style A fill:#4CAF50,color:#fff
    style B fill:#2196F3,color:#fff
    style C fill:#FF9800,color:#fff
    style D fill:#f44336,color:#fff
    style E fill:#9E9E9E,color:#fff

Compression at a glance

Config KV cache Compression
Baseline FP16 24.00 MiB
Safe (INT4) 6.78 MiB 3.5×
Rotated 2-bit 3.88 MiB 6.2×
Balanced (eviction) 2.01 MiB 11.9×
Aggressive 1.03 MiB 23.3×

Qwen2.5-0.5B-Instruct, 2048 tokens, RTX 3060. Needle: 3/3 PASS. PPL: 1.24.

Fused kernel speed

Path Latency vs FP16
FP16 FlashAttention 0.133 ms 1.0×
Fused INT4 (FADE) 0.189 ms 1.4×
Dequant + SDPA (old) 0.932 ms 7.0×

How FADE compares (2026)

FADE kvpress (NVIDIA) TurboQuant KVTC
Approach Tiered quant + eviction + re-RoPE Token eviction (20+ scoring methods) Rotation + optimal codebook PCA + DP bit allocation + entropy coding
Compression 3.5–23× 2–10× (eviction only) 4–6× 6–9×
Quantization INT4/INT2/PQ + rotated 2-bit None (drops tokens) 3–4 bit 1–6 bit
Eviction H2O, EMA, position, adaptive, learned 20+ methods (SnapKV, TOVA, etc.) None None
Re-RoPE ✅ StreamingLLM contiguous ✅ (undo before PCA)
Fused kernel ✅ Triton INT4 FlashAttn ✅ Triton fused ✅ Triton
HF generate() ✅ Drop-in Pipeline only ✅ Drop-in
Serving ✅ fade-server (OpenAI API) ✅ turboquant-server
Hybrid models ✅ Qwen 3.5 DeltaNet skip
Per-sequence batching ✅ Ragged tiers
Stars New 1K+ 30–40 11
Install pip install fade-kv pip install kvpress pip install turboquant From source

FADE's unique advantage: only system that combines quantization + attention-aware eviction + correct re-RoPE in one drop-in cache.

Install

From PyPI:

pip install fade-kv
pip install fade-kv[server]  # adds fade-server CLI

From source (development):

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install torch  # match your CUDA version: https://pytorch.org/get-started/locally/
pip install -e ".[dev]"

Optional extras: pip install fade-kv[cuda] (accelerate), fade-kv[eval] (datasets), fade-kv[codebook] (scikit-learn for PQ).

Quick start

Presets

from fade import FadeConfig, create_tiered_cache

# Safe: ~3-4x compression, 100% greedy match. No eviction.
cache = create_tiered_cache(model, config=FadeConfig.safe())

# Balanced: ~5x compression with H2O eviction.
cache = create_tiered_cache(model, config=FadeConfig.balanced())

# Aggressive: ~7-8x compression. Validate on your workload first.
cache = create_tiered_cache(model, config=FadeConfig.aggressive())

Custom config

cache = create_tiered_cache(model, config=FadeConfig(
    phase="2",
    n_sink=4,
    recent_window=64,
    int4_budget=400,
    eviction_policy="h2o",       # "h2o", "ema", "position", or "learned"
    middle_k_bits=4,             # K stays INT4 (outlier-sensitive)
    middle_v_bits=2,             # V at INT2 (~30% more compression)
))

Rotated 2-bit backend (~6× compression)

from fade.backends import get_backend

cache = create_tiered_cache(model, config=FadeConfig.safe(),
    quant_backend=get_backend("rotated", head_dim=64, bits=2))

Random orthogonal rotation spreads per-channel outliers before quantization, making 2-bit viable. Uses native PyTorch — no external dependencies.

Manual decode with tier reassignment

from fade.patch import forward_with_tracking, load_model
from fade.policy import reassign_tiers
from fade.tracker import AttentionTracker

model, tokenizer = load_model("Qwen/Qwen2.5-3B-Instruct", attn_impl="auto", need_attentions=True)
cache = create_tiered_cache(model, config=FadeConfig.balanced())
tracker = AttentionTracker(num_layers=model.config.num_hidden_layers)

out = forward_with_tracking(model, input_ids, cache, tracker=tracker)
for step in range(max_tokens):
    out = forward_with_tracking(model, next_token, cache, tracker=tracker)
    if (step + 1) % 64 == 0:
        reassign_tiers(cache, tracker, model.config.num_hidden_layers)

Eviction policies

Policy Quality Speed Needs attention?
h2o Best Normal Yes (prefill only)
ema Good Normal Yes (decode only)
adaptive Good Normal Yes (decode EMA)
position Fair Fast No
learned Good* Fast No

adaptive splits middle tokens by attention score: high→INT4, low→INT2, lowest→evict.

*Learned policy requires a trained checkpoint: python scripts/train_eviction_mlp.py

Supported models

FADE auto-detects the RoPE scheme from the model config:

  • Qwen2 / Qwen3 — vanilla RoPE, GQA
  • Llama / Llama-3.1 — vanilla + frequency-dependent scaling
  • Mistral — vanilla RoPE, sliding-window
  • Phi-3 — vanilla RoPE
  • Gemma-2 — vanilla RoPE
  • Gemma 4 — proportional RoPE with partial_rotary_factor + per-layer-type dispatch
  • Falcon — ALiBi (non-RoPE; re-RoPE is a no-op)
  • Qwen 3.5 / 3.6 — hybrid DeltaNet + softmax attention. FADE auto-detects layer_types and skips DeltaNet layers (only full-attention layers are tiered).

RoPE scaling types: linear, llama3, ntk, dynamic, yarn, proportional. Non-RoPE models (ALiBi, Bloom, MPT) work via the NoRope sentinel.

Batching

Two modes:

  • Shared-tier (default): all rows share positions and tier decisions. For lockstep decoding.
  • Per-sequence (apply_tier_assignment_per_sequence): each row gets independent [B, S] tiers. For continuous-batching where sequences diverge.

Performance

  • Fused INT4 FlashAttention kernel — single Triton kernel reads packed INT4 K/V, computes attention with online softmax, writes fp16 output. Never materializes fp16 K/V. 4.9× faster than the old dequant+SDPA path, within 1.4× of pure fp16 FlashAttention on RTX 3060.
  • Pre-allocated FP16 buffer — doubling buffer eliminates torch.cat on every decode step.
  • torch.compilecache.enable_compile() wraps _materialize between graph-break boundaries.
  • Dequant-cache age evictioncache.max_dequant_age = N periodically refreshes cached dequant buffers.
  • Benchmarkspython benchmarks/tps.py (decode throughput), python benchmarks/divergence.py (quality).
# Use the fused kernel directly:
from fade.kernels.fused_int4_attn import fused_int4_sdpa
out = fused_int4_sdpa(q, k_packed, k_scale, v_packed, v_scale)

Inference server

OpenAI-compatible API with automatic tier management:

fade-server --model Qwen/Qwen2.5-0.5B-Instruct --preset balanced --port 8000
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],\"max_tokens\":100}"

Endpoints: /v1/chat/completions (greedy + sampling), /v1/models, /health.

Checkpointing

sd = cache.cache_state_dict()
torch.save(sd, "cache.pt")
cache.load_cache_state_dict(torch.load("cache.pt"))

Observability

from fade.telemetry import JsonlExporter, attach_telemetry
attach_telemetry(cache, JsonlExporter("events.jsonl"))

Debug dump: cache.dump_debug("snapshot.json")

PQ codebook

from fade.codebook import PQCodebook
cb = PQCodebook.train(calibration_vectors, sub_dim=32, num_centroids=256)
cache.set_codebooks(cb)  # enables TIER_PQ in tier assignment

Train codebooks from a real model: python scripts/train_codebook.py

Results

Benchmarked on Qwen2.5-0.5B-Instruct, 2048 tokens, RTX 3060 12GB.

Compression

Config KV cache Compression Notes
Baseline FP16 24.00 MiB 1.0×
Safe (INT4, no eviction) 6.78 MiB 3.5× 100% greedy match
Rotated 2-bit 3.88 MiB 6.2× Rotation + 2-bit packing
Balanced (INT4 + eviction) 2.01 MiB 11.9× Position-based eviction
Aggressive 1.03 MiB 23.3× Smaller budget

Quality

Test Result
Needle @512 tokens ✅ PASS
Needle @1024 tokens ✅ PASS
Needle @2048 tokens ✅ PASS
Baseline PPL 1.24

Performance (fused Triton kernel)

Path Time vs FP16
FP16 SDPA 0.133 ms 1.0×
Dequant + SDPA (old) 0.932 ms 7.0× slower
Fused INT4 (new) 0.189 ms 1.4×

Run benchmarks yourself: python benchmarks/full_suite.py, python benchmarks/pareto.py --csv pareto.csv

Project layout

fade/
  cache.py           # TieredKVCache with 5 tiers (FP16/INT4/INT2/PQ/evict)
  config.py          # FadeConfig with presets
  quant.py           # INT4/INT2 quantization + bit-packing
  rope.py            # 7 RoPE schemes incl. Gemma 4 proportional
  policy.py          # Tier assignment: h2o, ema, position
  learned_policy.py  # Learned eviction MLP
  tracker.py         # AttentionTracker (per-layer EMA)
  patch.py           # load_model, create_tiered_cache, forward_with_tracking
  codebook.py        # PQ codebook train/encode/decode
  telemetry.py       # Structured telemetry + exporters
  kernels/           # Fused INT4 FlashAttention kernel + unpack kernel + fallback
  serving/           # vLLM / SGLang adapter stubs
  eval/              # Perplexity, needle, quality suite
examples/            # quickstart.py
experiments/         # run_baseline.py, run_tiered.py
benchmarks/          # tps.py, divergence.py
scripts/             # train_eviction_mlp.py, train_codebook.py
tests/               # 136 tests, all CPU, no downloads

Gotchas

  1. Attention impl: eager only needed for H2O prefill. Use load_model(attn_impl="auto").
  2. Transformers version: verified on 4.45 and 5.3. Weekly canary CI runs against transformers@main.
  3. Memory: use cache.compressed_storage_bytes(), not nvidia-smi.
  4. RoPE precision: all math in float32, cast through model dtype to match rounding.
  5. Hybrid models: Qwen 3.5/3.6 DeltaNet layers are auto-skipped — only full-attention layers are tiered.
  6. Triton kernels: fused attention via fused_int4_sdpa(), unpack-only via int4_sdpa(force_triton=True). Run check_fused_parity() to validate on your hardware.

Citations

FADE builds on ideas from these papers (all independently reimplemented — see NOTICE for details):

  • H2O — Zhang et al., 2023. Heavy-Hitter Oracle for Efficient Generative Inference of LLMs. arXiv:2306.14048
  • StreamingLLM — Xiao et al., 2023. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453
  • KIVI — Liu et al., 2024. A Tuning-Free KV Cache Quantization Algorithm. arXiv:2402.02750
  • TurboQuant — Zandieh et al., ICLR 2026. Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874
  • KnormPress — Devoto et al., 2024. A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression. arXiv:2406.11430

If you use FADE in your work:

@software{fade2026,
  title  = {FADE: Frequency-Adaptive Decay Encoding},
  author = {Branislav Đalić},
  url    = {https://github.com/Omodaka9375/fade},
  year   = {2026},
}

License

Apache-2.0. See LICENSE and NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fade_kv-0.7.0.tar.gz (94.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fade_kv-0.7.0-py3-none-any.whl (75.5 kB view details)

Uploaded Python 3

File details

Details for the file fade_kv-0.7.0.tar.gz.

File metadata

  • Download URL: fade_kv-0.7.0.tar.gz
  • Upload date:
  • Size: 94.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fade_kv-0.7.0.tar.gz
Algorithm Hash digest
SHA256 6c7d62ad0b3e06acd83810d7d49ebb88c726b7a174e27a6a592edcc3064d15f6
MD5 d2af5576f3938718658305bfefa5419c
BLAKE2b-256 52d5e38e190353c9fcef51f30b7632482b7d5926576922be8c98b9aa1e302d39

See more details on using hashes here.

Provenance

The following attestation bundles were made for fade_kv-0.7.0.tar.gz:

Publisher: ci.yml on Omodaka9375/fade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fade_kv-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: fade_kv-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 75.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fade_kv-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 29bb077b7584e56cbce4c5050cd4ed265a16068feb8cd8fbcdd995f50b9fc41e
MD5 a1babbba806a3a3fd2971051171ae3e4
BLAKE2b-256 8daac6ad2208db75491687fdc72e5e7d0eaed67deb38d2cdd25f95f021673f29

See more details on using hashes here.

Provenance

The following attestation bundles were made for fade_kv-0.7.0-py3-none-any.whl:

Publisher: ci.yml on Omodaka9375/fade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page