Skip to main content

3-tier distributed KV cache for LLM inference — preserve evicted KV across cluster nodes

Project description

tierkv

Build Wheels

3-tier distributed KV cache for LLM inference.

When your GPU evicts a KV cache entry, tierkv ships it to another machine over gRPC instead of dropping it. On the next request with the same prompt, the KV is fetched back in a single batch call — skipping the expensive prefill entirely.

Tested on Qwen3.6-35B-A3B across a DGX GB10 + Mac Pro + Mac Air cluster:

EXO integration (BF16, 8k–15k token prompts):

Scenario TTFT vs Cold
Cold start, 8,000-token prompt 30.83s baseline
Restored from cold tier 4.11s 7.3× faster
Cold start, 3,707-token prompt 23.78s baseline
Restored from cold tier 4.59s 5.2× faster

vLLM integration (Apple FY2025 10-K, GB10 GPU, real-world document Q&A):

Prompt size Cold prefill GPU cache hit Cold restore Speedup
30k tokens (measured) 10.75s 1.19s 0.52s 20×
60k tokens (projected) ~26s ~1.2s ~1.0s ~26×
128k tokens (projected) ~70s ~1.5s ~2.0s ~35×

Cold vault restore beats GPU cache hit — blocks land directly into the KV cache skipping attention recomputation entirely. The speedup grows with context length because prefill scales super-linearly while restore is near-linear (network transfer). Answer quality is bit-for-bit identical across all three paths.


How It Works

tierkv supports two inference backends. The cold-storage layer (vault servers, gRPC, TurboQuant) is identical in both cases.

EXO backend (monkey-patch):

  DGX GB10 — inference only
  ┌─────────────────────────────────┐
  │  EXO + Qwen3.6-35B-A3B (BF16)  │  ← EXO runs HERE only
  │  KVPrefixCache (GPU hot tier)   │
  │         │ evict (60% RAM)       │
  │         ▼                       │
  │   tierkv hook (monkey-patch)    │
  └────┬──────────────┬─────────────┘
       │ KVCache       │ ArraysCache
       │ (10 layers)   │ (30 layers)
       ▼               ▼
  Mac Pro LAN      Mac Air LAN        ← cold storage only, no EXO
  0.5ms RTT        1ms RTT
  tierkv vault     tierkv vault
  (in-memory)      (in-memory)

vLLM backend (KVConnectorBase_V1 plugin):

  DGX GB10 — inference only
  ┌──────────────────────────────────────────┐
  │  vLLM + Qwen3.6-35B-A3B                 │
  │  Paged KV cache (GPU hot tier, 40 blocks)│
  │         │ block evicted                  │
  │         ▼                                │
  │   TierKVConnector (KVConnectorBase_V1)   │
  │   ├─ request_finished  → store to vault  │
  │   ├─ get_num_new_matched_tokens → plan   │
  │   └─ start_load_kv / wait_for_layer_load │
  └────┬──────────────────┬──────────────────┘
       │ full-attention KV │ SSM / linear-attn
       │ (10 layers)       │ (30 layers)
       ▼                   ▼
  Mac Pro LAN          Mac Air LAN     ← cold storage only, no vLLM
  0.5ms RTT            1ms RTT
  tierkv vault         tierkv vault
  (in-memory)          (in-memory)

Three tiers:

  • Hot — GPU KV cache on the inference node (EXO's KVPrefixCache or vLLM's paged KV cache). Fast, limited by GPU/HBM capacity.
  • Cold KV — Full-attention layer tensors shipped to a LAN node via gRPC, compressed with TurboQuant INT8 (~3.9× ratio, ≥52 dB SNR).
  • Cold SSM — Linear-attention / SSM layer states shipped to a second node. Qwen3.6-35B-A3B is a hybrid MoE — 10/40 layers use full attention, 30/40 use linear attention.

On a cache miss, two parallel BatchPromote RPCs fetch all blocks in 2 network round-trips, with parallel decode across a thread pool (decode releases the GIL, so N CPU cores work simultaneously).

For vLLM, the layer_type_map in tierkv.toml routes each layer group to the correct vault. For EXO, layer types are auto-detected via isinstance checks — no manual configuration needed.


Hardware Requirements

You need at least 2 machines: one running inference, one as cold storage. Three machines lets you split KV and SSM tiers across separate nodes for better throughput.

Role What runs on it Example
inference EXO or vLLM + your model + tierkv DGX GB10
kv_cold tierkv vault only Mac Pro (32 GB)
ssm_cold tierkv vault only Mac Air (16 GB)

EXO only runs on the inference node. The cold-tier machines (Mac Pro, Mac Air) only run the tierkv vault server — a lightweight gRPC process that holds KV data in RAM.


Installation

EXO compatibility: tierkv patches EXO's cache.py and builder.py in-place. Tested with EXO as of May 2026. EXO moves fast — if tierkv install errors, check that the patch targets in cache.py and builder.py still match. EXO version auto-detection is on the roadmap.

Download the prebuilt wheel for your platform from the latest release:

pip install tierkv

Or build from source (requires Rust toolchain):

git clone https://github.com/tierkv/tierkv.git
cd tierkv
cd tierkv-core && maturin develop --release && cd ..
pip install -e .

Prebuilt wheels are available for:

  • Linux aarch64 (DGX Spark, Jetson, ARM servers)
  • Linux x86_64
  • macOS arm64 (Apple Silicon — Mac Pro, Mac Air)

Setup — Step by Step

tierkv runs on all three machines, but each machine has a different role and a different config. Install tierkv on every node first, then configure each one.

Step 1 — Configure each machine

Each machine gets its own tierkv.toml with its role and the addresses of the other nodes. Copy the example and edit it:

cp tierkv.toml.example tierkv.toml

On the inference node (DGX Spark) — set role = "inference" and point to the cold nodes:

[cluster]
role = "inference"

[cluster.kv_cold]
host = "192.168.50.11"      # Mac Pro LAN IP
port = 50051

[cluster.ssm_cold]
host = "192.168.50.12"      # Mac Air LAN IP (5GbE)
port = 50051

[cluster.recompute]
host = "127.0.0.1"
port = 50052

[inference]
exo_path = "/home/user/exo/src/exo"   # path to your EXO installation
log_file  = "/tmp/tierkv.log"
memory_threshold = 0.60
kv_dim   = 256

On the KV cold node (Mac Pro) — set role = "kv_cold", addresses don't matter here:

[cluster]
role = "kv_cold"

[vault]
port = 50051

On the SSM cold node (Mac Air) — set role = "ssm_cold":

[cluster]
role = "ssm_cold"

[vault]
port = 50051

tierkv.toml is gitignored — it contains your private IPs. Only tierkv.toml.example is committed.

Step 2 — Start vault servers on cold nodes

Warning — unbounded RAM growth: The vault holds all received KV data in RAM and currently has no eviction policy. On a Mac Air (16 GB) running a long session, vault RAM will grow until the process is killed. Monitor with tierkv status and restart vault servers between sessions if needed. LRU eviction is on the roadmap.

On Mac Pro and Mac Air (not on DGX):

tierkv vault

This starts the ColdVault gRPC server that the inference node will send KV data to. Keep it running as a background service (macOS launchd / Linux systemd).

Step 3 — Install the EXO hook on the inference node

On DGX only:

tierkv install --exo-path /path/to/exo/src/exo

That's it. This command:

  1. Copies the tierkv hook into EXO's engine directory
  2. Patches EXO's cache.py to set the memory eviction threshold
  3. Patches EXO's builder.py to auto-load the hook on startup

Restart EXO. The hook reads tierkv.toml from the working directory and connects to the cold nodes automatically.

Step 4 — Verify

From the inference node, check all nodes are reachable:

tierkv status
[tierkv status] Cluster role: inference

  kv_cold      192.168.50.11:50051   ✓  0.4ms
  ssm_cold     192.168.10.174:50051  ✓  5.9ms
  recompute    127.0.0.1:50052       ✓  0.1ms

[tierkv status] All nodes reachable.

Step 5 — Benchmark

tierkv bench --exo-api http://192.168.50.11:52415

Expected output:

[tierkv bench] EXO API: http://192.168.50.11:52415

  Request 1 — cold start (TARGET): 8.41s   response: 'The key advantage of mixture-of-experts…'
  Waiting 12s for async eviction to complete...
  Request 2 — evict step (different prompt): 1.12s   response: 'Sure, here is a short poem…'
  Waiting 12s for eviction gRPC to settle...
  Request 3 — restore (TARGET from cold): 1.62s   response: 'The key advantage of mixture-of-experts…'

  Speedup (cold → restore): 5.2×
  Time saved per request:   6.79s

A speedup below 1.5× means the cold tier isn't being hit — check tierkv status to confirm the vault servers are running and reachable.


vLLM Integration

tierkv ships a native vLLM KV Connector that plugs into vLLM's KVConnectorBase_V1 API. It uses the same cold vault infrastructure as the EXO hook — the same tierkv_core Rust backend, the same tierkv.toml config, and the same gRPC vault servers on Mac Pro / Mac Air.

Tested with: vLLM 0.20.1, torch 2.11.0+cu130, CUDA 13.0 on DGX GB10 (aarch64).

Install vLLM

# Linux aarch64 (DGX GB10/Spark) — requires Python dev headers for fastsafetensors
sudo apt-get install -y python3.12-dev
pip install vllm tierkv
# Linux x86_64 / macOS arm64
pip install vllm tierkv

Start vault servers

Same as for EXO — start tierkv vault on Mac Pro and Mac Air before launching vLLM.

Launch vLLM with TierKV

vllm serve Qwen/Qwen3.6-35B-A3B \
  --kv-transfer-config '{
    "kv_connector": "TierKVConnector",
    "kv_connector_module_path": "tierkv.connectors.vllm.connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {"config_path": "/path/to/tierkv.toml"}
  }' \
  --enable-prefix-caching \
  --block-size 16 \
  --no-disable-hybrid-kv-cache-manager \
  --max-model-len 20000 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 32

--no-disable-hybrid-kv-cache-manager is required for hybrid models like Qwen3.5 MoE that mix full-attention and SSM/linear-attention layers. vLLM auto-disables HMA when a KV connector is set; this flag re-enables it. TierKVConnector implements SupportsHMA so the override is safe.

Or pass config inline without a TOML file:

vllm serve Qwen/Qwen3.6-35B-A3B \
  --kv-transfer-config '{
    "kv_connector": "TierKVConnector",
    "kv_connector_module_path": "tierkv.connectors.vllm.connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
      "kv_cold_host": "192.168.50.11",
      "kv_cold_port": 50051,
      "ssm_cold_host": "192.168.10.174",
      "ssm_cold_port": 50051,
      "kv_dim": 256,
      "turbo_quant": true,
      "block_size": 16
    }
  }' \
  --enable-prefix-caching \
  --block-size 16

Note: vLLM 0.20+ uses --kv-transfer-config (not --kv-connector / --kv-connector-extra-config). The connector must be specified as kv_connector (class name) + kv_connector_module_path (module path) — passing the full dotted path as kv_connector will fail.

vLLM Performance

Measured on DGX Spark (GB10, aarch64) with Qwen3.6-35B-A3B (35B MoE, 40 layers: 10 full-attention + 30 linear-attention), Apple FY2025 10-K (30,561-token real document), cold vaults on Mac Pro + Mac Air (5GbE LAN, 1ms RTT):

Scenario TTFT vs Full Prefill Notes
Full prefill (30k tokens) 10.75s 1× baseline cold GPU cache, no vault
GPU cache hit 1.19s 9× faster same prompt, blocks in GPU
Cold vault restore 0.52s 20× faster blocks from LAN vault, skip attention

Cold vault restore beats GPU cache hit — vault blocks are inserted directly into the KV cache without running attention, so TTFT is pure network + insertion latency. GPU cache hit still runs partial attention over the matched prefix. The gap widens at longer contexts because prefill scales super-linearly while restore is near-linear (network transfer + KV insertion).

Projected scaling (Qwen3.6-35B-A3B, 5GbE LAN vault):

Prompt size Cold prefill GPU cache hit Cold restore Speedup
30k tokens (measured) 10.75s 1.19s 0.52s 20×
60k tokens (projected) ~26s ~1.2s ~1.0s ~26×
128k tokens (projected) ~70s ~1.5s ~2.0s ~35×

Answer quality: cold restore produces bit-for-bit identical output to full prefill. TurboQuant INT8 is lossy but per-group quantization preserves KV distributions well enough that the model's output is indistinguishable. The tensor_hash field in each BlockRecord detects any in-flight corruption.

Cold Prefill TTFT:   10.75s  (30k-token Apple 10-K, no cache)
GPU Cache Hit TTFT:   1.19s  (9× faster — same document, blocks in GPU)
Cold Restore TTFT:    0.52s  (20× faster — blocks in vault, skip attention)
Answer quality:       identical output across all three paths
Vault:                Mac Pro + Mac Air, 5GbE LAN, 1ms RTT

Pre-launch smoke test

Run this before any benchmark to catch issues early (context overflow, vault unreachable, vLLM misconfiguration):

python -m tierkv.connectors.vllm.smoke_test \
  --base http://localhost:8000 \
  --model Qwen/Qwen3.6-35B-A3B \
  --toml /path/to/tierkv.toml \
  --bench /path/to/bench.py

Expected output:

[1] vLLM health
  [PASS] vLLM /health: HTTP 200
[2] Model
  [PASS] Model loaded: Qwen/Qwen3.6-35B-A3B
  [PASS] max_model_len >= 20000: 20000
[3] Context fit check
  [PASS] Context fits (with longest Q): 19724 tokens, 276 headroom
  [PASS] Headroom > 100 tokens: 276 tokens to spare
[4] Vault connectivity
  [PASS] TCP kv_cold (Mac Pro): 192.168.50.11:50051
  [PASS] TCP ssm_cold (Mac Air): 192.168.10.174:50051
[5] Quick inference
  [PASS] Inference responds: 8.97s
[PASS] Smoke test: 8/8 checks passed

How it works

The vLLM connector uses a reactive eviction model — it intercepts vLLM's block eviction path, not a periodic snapshot:

  1. Eviction (request_finished): vLLM signals that GPU blocks are about to be freed. TierKV reads the KV tensors, quantizes them with TurboQuant INT8, and ships them to the cold vault over gRPC. GPU blocks are freed after the store completes.
  2. Restore (get_num_new_matched_tokens + start_load_kv): On the next request with the same prompt prefix, TierKV finds the blocks in the cold registry, fires a BatchPromote RPC, dequantizes, and writes directly into vLLM's paged KV buffer.
  3. No-op save (save_kv_layer): vLLM's eager save path is a no-op — eviction is the only trigger.

The connector integrates as a standard vLLM KV Transfer plugin — no vLLM source changes needed.

Configuration reference

All fields can be set in tierkv.toml under [tierkv] or passed via --kv-connector-extra-config:

Field Default Description
kv_cold_host 127.0.0.1 Cold vault host for full-attention KV layers
kv_cold_port 50051 Cold vault port
ssm_cold_host None Cold vault host for SSM/linear-attention layers (uses kv_cold if unset)
ssm_cold_port 50052 SSM vault port
block_size 16 Must match vLLM --block-size
kv_dim 128 Must match model head_dim — see Troubleshooting
turbo_quant true INT8 compression (~3.9× ratio)
max_inflight_stores 8 Concurrent eviction-to-vault gRPC calls
max_inflight_promotes 4 Concurrent restore-from-vault threads

kv_dim is critical. Wrong value causes silent incorrect compression. Find the right value with:

from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("your/model")
print(cfg.hidden_size // cfg.num_attention_heads)

TurboQuant

tierkv includes a per-group INT8 quantizer for KV tensor compression before sending over the network.

  • Group size: kv_dim floats — must match your model's attention head dimension (default 256 for Qwen3.6-35B-A3B; use 128 for Llama-3, Qwen2.5, Mistral; see tierkv.toml.example for how to find the right value for other models)
  • Each group gets its own absmax scale: scale = max(|x|) / 127
  • Wire format: [scale: f32 LE][i8 × 256] per group
  • Compression ratio: ~3.9× (BF16 input → INT8 output)
  • SNR: ≥52 dB on real KV distributions (per-group isolates outliers)
from tierkv_core import TurboQuant
q = TurboQuant(dim=256)
compressed = q.encode(f32_bytes)   # ~3.9× smaller
recovered  = q.decode(compressed)  # ≥52 dB SNR

Architecture Notes

Why not standard KV offloading? Most KV offload systems evict to local SSD or CPU RAM on the same machine. tierkv evicts across the network to separate machines, letting idle hardware on your LAN participate in serving long-context requests.

Why EXO? EXO provides an OpenAI-compatible API layer and handles model loading across Apple Silicon and CUDA devices. tierkv monkey-patches EXO's KVPrefixCache eviction and retrieval paths without modifying EXO's core. EXO runs only on the inference node — cold nodes run only the tierkv vault.

What about multi-node inference? EXO supports pipeline-parallel inference (splitting layers across machines). tierkv is currently designed for single-node inference with distributed cold storage. The two can coexist but require separate configuration.


Cluster Tested

Node Role Memory Network
DGX Spark (GB10, aarch64) Inference — EXO or vLLM + Qwen3.6-35B-A3B 128 GB RAM + 96 GB HBM 5GbE LAN
Mac Pro (M2 Pro) KV cold tier — tierkv vault only 32 GB 5GbE LAN (0.5ms to DGX)
Mac Air (M2) SSM cold tier — tierkv vault only 16 GB 5GbE LAN (1ms to DGX)

Over one session (EXO): 227 evictions, 6 cold restores, ~26s saved per restore. Over one session (vLLM, Apple 10-K 30k tokens): cold restore 0.52s vs 10.75s cold prefill (20×); GPU cache hit 1.19s (9×).


Troubleshooting

See TROUBLESHOOTING.md for documented failures and fixes, including:

  • gRPC 4 MB message size limit (silent empty responses)
  • kv_dim mismatch causing silent incorrect compression
  • KVCache.offset semantics and garbage output after restore
  • Stale semaphores after kill -9
  • EXO Nack loop and election storm after hard reset
  • SSH lockout during model load
  • Wrong platform wheel installed on Linux
  • vLLM fastsafetensors build failure on aarch64

Roadmap

  • Persistent cold storage (SQLite / memory-mapped file — survive reboots)
  • TurboQuant codebook training on real KV activations (push SNR higher)
  • Longer contexts or slower GPUs where prefill is the bottleneck
  • Quantization quality validation with _kv_offsets fix in place
  • EXO version detection for hook compatibility
  • LRU eviction inside the cold vault (configurable max capacity — currently vaults grow unbounded in RAM)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tierkv-0.1.0.tar.gz (620.2 kB view details)

Uploaded Source

File details

Details for the file tierkv-0.1.0.tar.gz.

File metadata

  • Download URL: tierkv-0.1.0.tar.gz
  • Upload date:
  • Size: 620.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tierkv-0.1.0.tar.gz
Algorithm Hash digest
SHA256 62ee1a3084c0b9f735b7078a1f15995eca628528f49be4542058838cd2b7a0d8
MD5 9f381e4bdf40a814d2bcf5472732189e
BLAKE2b-256 0cc148bc6f0b4417b533b3214f133f3ec794655139074e55b0913ec514b15192

See more details on using hashes here.

Provenance

The following attestation bundles were made for tierkv-0.1.0.tar.gz:

Publisher: wheels.yml on tierkv/tierkv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page