FADE: Frequency-Adaptive Decay Encoding — attention-aware tiered KV cache compression for LLM inference.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

FADE

Frequency-Adaptive Decay Encoding — drop-in KV cache compression for HuggingFace transformers. Shrinks the KV cache 3.5–12× with near-baseline quality (up to 23× in aggressive mode — validate on your workload).

from fade import FadeConfig, create_tiered_cache

cache = create_tiered_cache(model, config=FadeConfig.safe())
out = model.generate(input_ids, past_key_values=cache, max_new_tokens=256)

Works with model.generate() — greedy, sampling, beam search. No manual decode loop needed.

How it works

Tokens live in tiers based on age and attention importance:

Tier	What's stored	When
FP16	Full precision	First `N_SINK` tokens + last `RECENT_WINDOW` tokens
INT4	Bit-packed 4-bit	Middle-aged tokens (the bulk of the cache)
INT2	Grouped 2-bit	Optional deeper compression (lossy)
PQ	Product-quantized codes	~2 bits/element via trained codebook (Phase 3)
Evicted	Nothing	Dropped when `INT4_BUDGET` is finite

When tokens are evicted, surviving K tensors are un-RoPE'd at old positions and re-RoPE'd with contiguous StreamingLLM positions.

flowchart LR
    A["New Token"] --> B["FP16 Tier\n(sinks + recent)"]
    B -->|reassign| C["INT4 Tier\n(middle)"]
    C -->|budget full| D["INT2 / PQ\n(deep compress)"]
    D -->|evict| E["Dropped\n(re-RoPE survivors)"]
    style A fill:#4CAF50,color:#fff
    style B fill:#2196F3,color:#fff
    style C fill:#FF9800,color:#fff
    style D fill:#f44336,color:#fff
    style E fill:#9E9E9E,color:#fff

Compression at a glance

Config	KV cache	Compression
Baseline FP16	112.00 MiB	1×
Safe (INT4)	31.24 MiB	3.6×
Rotated 2-bit	17.70 MiB	6.3×
Balanced (eviction)	9.30 MiB	12.0×
Aggressive	4.77 MiB	23.5×

Qwen2.5-7B-Instruct, 2048 tokens, DGX Spark. Needle: 4/4 PASS (512–4096). WikiText-2 PPL: 6.56.

Fused kernel speed

Path	Latency	vs FP16
FP16 FlashAttention	0.133 ms	1.0×
Fused INT4 (FADE)	0.189 ms	1.4×
Dequant + SDPA (old)	0.932 ms	7.0×

How FADE compares (2026)

	FADE	kvpress (NVIDIA)	TurboQuant (Google, ICLR 2026)	KVTC (NVIDIA, ICLR 2026)
Approach	Tiered quant + eviction + re-RoPE	Token eviction / scoring (30+ methods)	Rotation + Lloyd-Max codebook	PCA + DP bit allocation + entropy coding
Compression	3.5–12× (23× aggressive)	2–10× (eviction only)	4–6× (3.5-bit zero-loss claimed)	6–20× (up to 40× with entropy)
Quantization	INT4/INT2/PQ + rotated 2-bit	Via HF `QuantizedCache`	3–4 bit	1–6 bit adaptive
Eviction	H2O, EMA, position, adaptive, learned	30+ methods (SnapKV, TOVA, KVzap, etc.)	None	None
Re-RoPE	✅ StreamingLLM contiguous	Partial (KeyRerotationPress, FinchPress)	❌	✅ (undo before PCA)
Fused kernel	✅ Triton INT4 FlashAttn	❌	✅ Triton fused	✅ Triton
HF generate()	✅ Drop-in	Pipeline + context manager	✅ Drop-in	❌
Serving	✅ fade-server (OpenAI API)	❌	✅ vLLM / SGLang integration	❌
Hybrid models	✅ Qwen 3.5 DeltaNet skip	❌	❌	❌
Per-sequence batching	✅ Ragged tiers	❌	❌	❌
Stars	New	1K+	1K+ (across implementations)	~10
Install	`pip install fade-kv`	`pip install kvpress`	`pip install turboquant-kv`	From source

FADE's unique advantage: only system that combines quantization + attention-aware eviction + correct re-RoPE in one drop-in cache.

Install

From PyPI:

pip install fade-kv
pip install fade-kv[server]  # adds fade-server CLI

From source (development):

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install torch  # match your CUDA version: https://pytorch.org/get-started/locally/
pip install -e ".[dev]"

Optional extras: pip install fade-kv[cuda] (accelerate), fade-kv[eval] (datasets), fade-kv[codebook] (scikit-learn for PQ).

Quick start

Presets

from fade import FadeConfig, create_tiered_cache

# Safe: ~3-4x compression, no eviction.
cache = create_tiered_cache(model, config=FadeConfig.safe())

# Balanced: ~5x compression with H2O eviction.
cache = create_tiered_cache(model, config=FadeConfig.balanced())

# Aggressive: ~7-8x compression. Validate on your workload first.
cache = create_tiered_cache(model, config=FadeConfig.aggressive())

Custom config

cache = create_tiered_cache(model, config=FadeConfig(
    phase="2",
    n_sink=4,
    recent_window=64,
    int4_budget=400,
    eviction_policy="h2o",       # "h2o", "ema", "position", or "learned"
    middle_k_bits=4,             # K stays INT4 (outlier-sensitive)
    middle_v_bits=2,             # V at INT2 (~30% more compression)
))

Rotated 2-bit backend (~6× compression)

from fade.backends import get_backend

cache = create_tiered_cache(model, config=FadeConfig.safe(),
    quant_backend=get_backend("rotated", head_dim=64, bits=2))

Random orthogonal rotation spreads per-channel outliers before quantization, making 2-bit viable. Uses native PyTorch — no external dependencies.

Manual decode with tier reassignment

from fade.patch import forward_with_tracking, load_model
from fade.policy import reassign_tiers
from fade.tracker import AttentionTracker

model, tokenizer = load_model("Qwen/Qwen2.5-3B-Instruct", attn_impl="auto", need_attentions=True)
cache = create_tiered_cache(model, config=FadeConfig.balanced())
tracker = AttentionTracker(num_layers=model.config.num_hidden_layers)

out = forward_with_tracking(model, input_ids, cache, tracker=tracker)
for step in range(max_tokens):
    out = forward_with_tracking(model, next_token, cache, tracker=tracker)
    if (step + 1) % 64 == 0:
        reassign_tiers(cache, tracker, model.config.num_hidden_layers)

Eviction policies

Policy	Quality	Speed	Needs attention?
`h2o`	Best	Normal	Yes (prefill only)
`ema`	Good	Normal	Yes (decode only)
`adaptive`	Good	Normal	Yes (decode EMA)
`position`	Fair	Fast	No
`learned`	Good*	Fast	No

adaptive splits middle tokens by attention score: high→INT4, low→INT2, lowest→evict.

*Learned policy requires a trained checkpoint: python scripts/train_eviction_mlp.py

Supported models

FADE auto-detects the RoPE scheme from the model config:

Qwen2 / Qwen3 — vanilla RoPE, GQA
Llama / Llama-3.1 — vanilla + frequency-dependent scaling
Mistral — vanilla RoPE, sliding-window
Phi-3 — vanilla RoPE
Gemma-2 — vanilla RoPE
Gemma 4 — proportional RoPE with partial_rotary_factor + per-layer-type dispatch
Falcon — ALiBi (non-RoPE; re-RoPE is a no-op)
Qwen 3.5 / 3.6 — hybrid DeltaNet + softmax attention. FADE auto-detects layer_types and skips DeltaNet layers (only full-attention layers are tiered).

RoPE scaling types: linear, llama3, ntk, dynamic, yarn, proportional. Non-RoPE models (ALiBi, Bloom, MPT) work via the NoRope sentinel.

Batching

Two modes:

Shared-tier (default): all rows share positions and tier decisions. For lockstep decoding.
Per-sequence (apply_tier_assignment_per_sequence): each row gets independent [B, S] tiers. For continuous-batching where sequences diverge.

Performance

Fused INT4 FlashAttention kernel — single Triton kernel reads packed INT4 K/V, computes attention with online softmax, writes fp16 output. Never materializes fp16 K/V. 4.9× faster than the old dequant+SDPA path, within 1.4× of pure fp16 FlashAttention on RTX 3060.
Pre-allocated FP16 buffer — doubling buffer eliminates torch.cat on every decode step.
torch.compile — cache.enable_compile() wraps _materialize between graph-break boundaries.
Dequant-cache age eviction — cache.max_dequant_age = N periodically refreshes cached dequant buffers.
Benchmarks — python benchmarks/tps.py (decode throughput), python benchmarks/divergence.py (quality).

# Use the fused kernel directly:
from fade.kernels.fused_int4_attn import fused_int4_sdpa
out = fused_int4_sdpa(q, k_packed, k_scale, v_packed, v_scale)

Inference server

OpenAI-compatible API with automatic tier management:

fade-server --model Qwen/Qwen2.5-0.5B-Instruct --preset balanced --port 8000

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],\"max_tokens\":100}"

Endpoints: /v1/chat/completions (greedy + sampling), /v1/models, /health.

Checkpointing

sd = cache.cache_state_dict()
torch.save(sd, "cache.pt")
cache.load_cache_state_dict(torch.load("cache.pt"))

Observability

from fade.telemetry import JsonlExporter, attach_telemetry
attach_telemetry(cache, JsonlExporter("events.jsonl"))

Debug dump: cache.dump_debug("snapshot.json")

PQ codebook

from fade.codebook import PQCodebook
cb = PQCodebook.train(calibration_vectors, sub_dim=32, num_centroids=256)
cache.set_codebooks(cb)  # enables TIER_PQ in tier assignment

Train codebooks from a real model: python scripts/train_codebook.py

Results

DGX Spark — Qwen2.5-7B-Instruct (2048 tokens)

Config	KV cache	Compression	Decode TPS
Baseline FP16	112.00 MiB	1.0×	13.3 tok/s
Safe (INT4)	31.24 MiB	3.6×	13.3 tok/s
Rotated 2-bit	17.70 MiB	6.3×	13.3 tok/s
Balanced (eviction)	9.30 MiB	12.0×	13.3 tok/s
Aggressive	4.77 MiB	23.5×	13.3 tok/s

Needle: 4/4 PASS (512–4096 tokens). WikiText-2 baseline PPL: 6.56.

DGX Spark — Mistral-7B-Instruct-v0.3 (2048 tokens)

Config	KV cache	Compression	Decode TPS
Baseline FP16	256.00 MiB	1.0×	15.3 tok/s
Safe (INT4)	71.40 MiB	3.6×	15.2 tok/s
Rotated 2-bit	40.47 MiB	6.3×	15.2 tok/s
Balanced (eviction)	21.26 MiB	12.0×	15.2 tok/s
Aggressive	10.91 MiB	23.5×	15.2 tok/s

Needle: 4/4 PASS (512–4096 tokens). WikiText-2 baseline PPL: 4.98.

DGX Spark — Llama-3.1-8B-Instruct (2048 tokens)

Config	KV cache	Compression	Decode TPS
Baseline FP16	256.00 MiB	1.0×	14.4 tok/s
Safe (INT4)	71.40 MiB	3.6×	14.3 tok/s
Rotated 2-bit	40.47 MiB	6.3×	14.3 tok/s
Balanced (eviction)	21.26 MiB	12.0×	14.3 tok/s
Aggressive	10.91 MiB	23.5×	14.3 tok/s

Needle: 4/4 PASS (512–4096 tokens). WikiText-2 baseline PPL: 6.45. Llama-3.1 frequency-dependent RoPE scaling.

All DGX Spark benchmarks: NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory).

RTX 3060 — Qwen2.5-0.5B-Instruct (2048 tokens)

Config	KV cache	Compression	Decode TPS
Baseline FP16	24.00 MiB	1.0×	128.5 tok/s
Safe (INT4)	6.78 MiB	3.5×	125.8 tok/s
Rotated 2-bit	3.88 MiB	6.2×	125.9 tok/s
Balanced (eviction)	2.01 MiB	11.9×	125.8 tok/s
Aggressive	1.03 MiB	23.3×	125.9 tok/s

Needle: 4/4 PASS (512–4096 tokens). Baseline FP16 PPL: 1.24. TPS overhead: ~2%.

Fused Triton kernel (RTX 3060)

Path	Time	vs FP16
FP16 SDPA	0.133 ms	1.0×
Dequant + SDPA (old)	0.932 ms	7.0× slower
Fused INT4 (new)	0.189 ms	1.4×

Run benchmarks yourself: python benchmarks/production_suite.py, python benchmarks/full_suite.py

Project layout

fade/
  cache.py           # TieredKVCache with 5 tiers (FP16/INT4/INT2/PQ/evict)
  config.py          # FadeConfig with presets
  quant.py           # INT4/INT2 quantization + bit-packing
  rope.py            # 7 RoPE schemes incl. Gemma 4 proportional
  policy.py          # Tier assignment: h2o, ema, position
  learned_policy.py  # Learned eviction MLP
  tracker.py         # AttentionTracker (per-layer EMA)
  patch.py           # load_model, create_tiered_cache, forward_with_tracking
  codebook.py        # PQ codebook train/encode/decode
  telemetry.py       # Structured telemetry + exporters
  kernels/           # Fused INT4 FlashAttention kernel + unpack kernel + fallback
  serving/           # vLLM / SGLang adapter stubs
  eval/              # Perplexity, needle, quality suite
examples/            # quickstart.py
experiments/         # run_baseline.py, run_tiered.py
benchmarks/          # tps.py, divergence.py
scripts/             # train_eviction_mlp.py, train_codebook.py
tests/               # 136 tests, all CPU, no downloads

Gotchas

Attention impl: eager only needed for H2O prefill. Use load_model(attn_impl="auto").
Transformers version: verified on 4.45 and 5.3. Weekly canary CI runs against transformers@main.
Memory: use cache.compressed_storage_bytes(), not nvidia-smi.
RoPE precision: all math in float32, cast through model dtype to match rounding.
Hybrid models: Qwen 3.5/3.6 DeltaNet layers are auto-skipped — only full-attention layers are tiered.
Triton kernels: fused attention via fused_int4_sdpa(), unpack-only via int4_sdpa(force_triton=True). Run check_fused_parity() to validate on your hardware.

Citations

FADE builds on ideas from these papers (all independently reimplemented — see NOTICE for details):

H2O — Zhang et al., 2023. Heavy-Hitter Oracle for Efficient Generative Inference of LLMs. arXiv:2306.14048
StreamingLLM — Xiao et al., 2023. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453
KIVI — Liu et al., 2024. A Tuning-Free KV Cache Quantization Algorithm. arXiv:2402.02750
TurboQuant — Zandieh et al., ICLR 2026. Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874
KVTC — Staniszewski & Łańcucki, ICLR 2026. KV Cache Transform Coding for Compact Storage in LLM Inference. arXiv:2511.01815
KnormPress — Devoto et al., 2024. A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression. arXiv:2406.11430

If you use FADE in your work:

@software{fade2026,
  title  = {FADE: Frequency-Adaptive Decay Encoding},
  author = {Branislav Đalić},
  url    = {https://github.com/Omodaka9375/fade},
  year   = {2026},
}

License

Apache-2.0. See LICENSE and NOTICE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Omodaka

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.0

Apr 28, 2026

This version

1.0.0

Apr 26, 2026

0.9.0

Apr 25, 2026

0.8.0

Apr 25, 2026

0.7.0

Apr 25, 2026

0.6.0

Apr 25, 2026

0.5.1

Apr 25, 2026

0.5.0

Apr 25, 2026

0.4.0

Apr 24, 2026

0.3.0

Apr 24, 2026

0.2.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fade_kv-1.0.0.tar.gz (103.4 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fade_kv-1.0.0-py3-none-any.whl (80.9 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file fade_kv-1.0.0.tar.gz.

File metadata

Download URL: fade_kv-1.0.0.tar.gz
Upload date: Apr 26, 2026
Size: 103.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fade_kv-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`1f12c58d964a43a3be17c3de97c98e78a605c5c3e4f2820e07fdf4d822b5a879`
MD5	`501dc07719dd842ff24d476f0cc3aa71`
BLAKE2b-256	`4be07f0e62d935b04cb5cd93bdfe15e4fcb32c317e0abd9aca7b83c9d44bfb3a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fade_kv-1.0.0.tar.gz:

Publisher: ci.yml on Omodaka9375/fade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fade_kv-1.0.0.tar.gz
- Subject digest: 1f12c58d964a43a3be17c3de97c98e78a605c5c3e4f2820e07fdf4d822b5a879
- Sigstore transparency entry: 1385499434
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: Omodaka9375/fade@1e7cc822756fe84586b3b0b4bbef2fc07314bab2
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/Omodaka9375
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@1e7cc822756fe84586b3b0b4bbef2fc07314bab2
- Trigger Event: push

File details

Details for the file fade_kv-1.0.0-py3-none-any.whl.

File metadata

Download URL: fade_kv-1.0.0-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 80.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fade_kv-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df890613a6c96f0033f584ca6ca7ed2312784697061d5cb9b34535f64b56a705`
MD5	`acddb79640d37c8ff6d24384d904c254`
BLAKE2b-256	`473e6f1c34eea5cc56dd426c2c597f4aa00673ef06a912c2f301bdf18a70e51f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fade_kv-1.0.0-py3-none-any.whl:

Publisher: ci.yml on Omodaka9375/fade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fade_kv-1.0.0-py3-none-any.whl
- Subject digest: df890613a6c96f0033f584ca6ca7ed2312784697061d5cb9b34535f64b56a705
- Sigstore transparency entry: 1385499448
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: Omodaka9375/fade@1e7cc822756fe84586b3b0b4bbef2fc07314bab2
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/Omodaka9375
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@1e7cc822756fe84586b3b0b4bbef2fc07314bab2
- Trigger Event: push

fade-kv 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

FADE

How it works

Compression at a glance

Fused kernel speed

How FADE compares (2026)

Install

Quick start

Presets

Custom config

Rotated 2-bit backend (~6× compression)

Manual decode with tier reassignment

Eviction policies

Supported models

Batching

Performance

Inference server

Checkpointing

Observability

PQ codebook

Results

DGX Spark — Qwen2.5-7B-Instruct (2048 tokens)

DGX Spark — Mistral-7B-Instruct-v0.3 (2048 tokens)

DGX Spark — Llama-3.1-8B-Instruct (2048 tokens)

RTX 3060 — Qwen2.5-0.5B-Instruct (2048 tokens)

Fused Triton kernel (RTX 3060)

Project layout

Gotchas

Citations

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance