3-tier distributed KV cache for LLM inference — preserve evicted KV across cluster nodes
Project description
tierkv
3-tier distributed KV cache for LLM inference.
When your GPU evicts a KV cache entry, tierkv ships it to another machine over gRPC instead of dropping it. On the next request with the same prompt, the KV is fetched back in a single batch call — skipping the expensive prefill entirely.
Tested on Qwen3.6-35B-A3B across a DGX GB10 + Mac Pro + Mac Air cluster:
EXO integration (BF16, 8k–15k token prompts):
| Scenario | TTFT | vs Cold |
|---|---|---|
| Cold start, 8,000-token prompt | 30.83s | baseline |
| Restored from cold tier | 4.11s | 7.3× faster |
| Cold start, 3,707-token prompt | 23.78s | baseline |
| Restored from cold tier | 4.59s | 5.2× faster |
vLLM integration (Apple FY2025 10-K, GB10 GPU, real-world document Q&A):
| Prompt size | Cold prefill | GPU cache hit | Cold restore | Speedup |
|---|---|---|---|---|
| 30k tokens (measured) | 10.75s | 1.19s | 0.52s | 20× |
| 60k tokens (projected) | ~26s | ~1.2s | ~1.0s | ~26× |
| 128k tokens (projected) | ~70s | ~1.5s | ~2.0s | ~35× |
Cold vault restore beats GPU cache hit — blocks land directly into the KV cache skipping attention recomputation entirely. The speedup grows with context length because prefill scales super-linearly while restore is near-linear (network transfer). Answer quality is bit-for-bit identical across all three paths.
How It Works
tierkv supports two inference backends. The cold-storage layer (vault servers, gRPC, TurboQuant) is identical in both cases.
EXO backend (monkey-patch):
DGX GB10 — inference only
┌─────────────────────────────────┐
│ EXO + Qwen3.6-35B-A3B (BF16) │ ← EXO runs HERE only
│ KVPrefixCache (GPU hot tier) │
│ │ evict (60% RAM) │
│ ▼ │
│ tierkv hook (monkey-patch) │
└────┬──────────────┬─────────────┘
│ KVCache │ ArraysCache
│ (10 layers) │ (30 layers)
▼ ▼
Mac Pro LAN Mac Air LAN ← cold storage only, no EXO
0.5ms RTT 1ms RTT
tierkv vault tierkv vault
(in-memory) (in-memory)
vLLM backend (KVConnectorBase_V1 plugin):
DGX GB10 — inference only
┌──────────────────────────────────────────┐
│ vLLM + Qwen3.6-35B-A3B │
│ Paged KV cache (GPU hot tier, 40 blocks)│
│ │ block evicted │
│ ▼ │
│ TierKVConnector (KVConnectorBase_V1) │
│ ├─ request_finished → store to vault │
│ ├─ get_num_new_matched_tokens → plan │
│ └─ start_load_kv / wait_for_layer_load │
└────┬──────────────────┬──────────────────┘
│ full-attention KV │ SSM / linear-attn
│ (10 layers) │ (30 layers)
▼ ▼
Mac Pro LAN Mac Air LAN ← cold storage only, no vLLM
0.5ms RTT 1ms RTT
tierkv vault tierkv vault
(in-memory) (in-memory)
Three tiers:
- Hot — GPU KV cache on the inference node (EXO's
KVPrefixCacheor vLLM's paged KV cache). Fast, limited by GPU/HBM capacity. - Cold KV — Full-attention layer tensors shipped to a LAN node via gRPC, compressed with TurboQuant INT8 (~3.9× ratio, ≥52 dB SNR).
- Cold SSM — Linear-attention / SSM layer states shipped to a second node. Qwen3.6-35B-A3B is a hybrid MoE — 10/40 layers use full attention, 30/40 use linear attention.
On a cache miss, two parallel BatchPromote RPCs fetch all blocks in 2 network round-trips, with parallel decode across a thread pool (decode releases the GIL, so N CPU cores work simultaneously).
For vLLM, the layer_type_map in tierkv.toml routes each layer group to the correct vault. For EXO, layer types are auto-detected via isinstance checks — no manual configuration needed.
Hardware Requirements
You need at least 2 machines: one running inference, one as cold storage. Three machines lets you split KV and SSM tiers across separate nodes for better throughput.
| Role | What runs on it | Example |
|---|---|---|
inference |
EXO or vLLM + your model + tierkv | DGX GB10 |
kv_cold |
tierkv vault only |
Mac Pro (32 GB) |
ssm_cold |
tierkv vault only |
Mac Air (16 GB) |
EXO only runs on the inference node. The cold-tier machines (Mac Pro, Mac Air) only run the tierkv vault server — a lightweight gRPC process that holds KV data in RAM.
Installation
EXO compatibility: tierkv patches EXO's cache.py and builder.py in-place. Tested with EXO as of May 2026. EXO moves fast — if tierkv install errors, check that the patch targets in cache.py and builder.py still match. EXO version auto-detection is on the roadmap.
Download the prebuilt wheel for your platform from the latest release:
pip install tierkv
Or build from source (requires Rust toolchain):
git clone https://github.com/tierkv/tierkv.git
cd tierkv
cd tierkv-core && maturin develop --release && cd ..
pip install -e .
Prebuilt wheels are available for:
- Linux aarch64 (DGX Spark, Jetson, ARM servers)
- Linux x86_64
- macOS arm64 (Apple Silicon — Mac Pro, Mac Air)
Setup — Step by Step
tierkv runs on all three machines, but each machine has a different role and a different config. Install tierkv on every node first, then configure each one.
Step 1 — Configure each machine
Each machine gets its own tierkv.toml with its role and the addresses of the other nodes. Copy the example and edit it:
cp tierkv.toml.example tierkv.toml
On the inference node (DGX Spark) — set role = "inference" and point to the cold nodes:
[cluster]
role = "inference"
[cluster.kv_cold]
host = "192.168.50.11" # Mac Pro LAN IP
port = 50051
[cluster.ssm_cold]
host = "192.168.50.12" # Mac Air LAN IP (5GbE)
port = 50051
[cluster.recompute]
host = "127.0.0.1"
port = 50052
[inference]
exo_path = "/home/user/exo/src/exo" # path to your EXO installation
log_file = "/tmp/tierkv.log"
memory_threshold = 0.60
kv_dim = 256
On the KV cold node (Mac Pro) — set role = "kv_cold", addresses don't matter here:
[cluster]
role = "kv_cold"
[vault]
port = 50051
On the SSM cold node (Mac Air) — set role = "ssm_cold":
[cluster]
role = "ssm_cold"
[vault]
port = 50051
tierkv.toml is gitignored — it contains your private IPs. Only tierkv.toml.example is committed.
Step 2 — Start vault servers on cold nodes
Warning — unbounded RAM growth: The vault holds all received KV data in RAM and currently has no eviction policy. On a Mac Air (16 GB) running a long session, vault RAM will grow until the process is killed. Monitor with
tierkv statusand restart vault servers between sessions if needed. LRU eviction is on the roadmap.
On Mac Pro and Mac Air (not on DGX):
tierkv vault
This starts the ColdVault gRPC server that the inference node will send KV data to. Keep it running as a background service (macOS launchd / Linux systemd).
Step 3 — Install the EXO hook on the inference node
On DGX only:
tierkv install --exo-path /path/to/exo/src/exo
That's it. This command:
- Copies the tierkv hook into EXO's engine directory
- Patches EXO's
cache.pyto set the memory eviction threshold - Patches EXO's
builder.pyto auto-load the hook on startup
Restart EXO. The hook reads tierkv.toml from the working directory and connects to the cold nodes automatically.
Step 4 — Verify
From the inference node, check all nodes are reachable:
tierkv status
[tierkv status] Cluster role: inference
kv_cold 192.168.50.11:50051 ✓ 0.4ms
ssm_cold 192.168.10.174:50051 ✓ 5.9ms
recompute 127.0.0.1:50052 ✓ 0.1ms
[tierkv status] All nodes reachable.
Step 5 — Benchmark
tierkv bench --exo-api http://192.168.50.11:52415
Expected output:
[tierkv bench] EXO API: http://192.168.50.11:52415
Request 1 — cold start (TARGET): 8.41s response: 'The key advantage of mixture-of-experts…'
Waiting 12s for async eviction to complete...
Request 2 — evict step (different prompt): 1.12s response: 'Sure, here is a short poem…'
Waiting 12s for eviction gRPC to settle...
Request 3 — restore (TARGET from cold): 1.62s response: 'The key advantage of mixture-of-experts…'
Speedup (cold → restore): 5.2×
Time saved per request: 6.79s
A speedup below 1.5× means the cold tier isn't being hit — check tierkv status to confirm the vault servers are running and reachable.
vLLM Integration
tierkv ships a native vLLM KV Connector that plugs into vLLM's KVConnectorBase_V1 API. It uses the same cold vault infrastructure as the EXO hook — the same tierkv_core Rust backend, the same tierkv.toml config, and the same gRPC vault servers on Mac Pro / Mac Air.
Tested with: vLLM 0.20.1, torch 2.11.0+cu130, CUDA 13.0 on DGX GB10 (aarch64).
Install vLLM
# Linux aarch64 (DGX GB10/Spark) — requires Python dev headers for fastsafetensors
sudo apt-get install -y python3.12-dev
pip install vllm tierkv
# Linux x86_64 / macOS arm64
pip install vllm tierkv
Start vault servers
Same as for EXO — start tierkv vault on Mac Pro and Mac Air before launching vLLM.
Launch vLLM with TierKV
vllm serve Qwen/Qwen3.6-35B-A3B \
--kv-transfer-config '{
"kv_connector": "TierKVConnector",
"kv_connector_module_path": "tierkv.connectors.vllm.connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {"config_path": "/path/to/tierkv.toml"}
}' \
--enable-prefix-caching \
--block-size 16 \
--no-disable-hybrid-kv-cache-manager \
--max-model-len 20000 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 32
--no-disable-hybrid-kv-cache-manageris required for hybrid models like Qwen3.5 MoE that mix full-attention and SSM/linear-attention layers. vLLM auto-disables HMA when a KV connector is set; this flag re-enables it. TierKVConnector implementsSupportsHMAso the override is safe.
Or pass config inline without a TOML file:
vllm serve Qwen/Qwen3.6-35B-A3B \
--kv-transfer-config '{
"kv_connector": "TierKVConnector",
"kv_connector_module_path": "tierkv.connectors.vllm.connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"kv_cold_host": "192.168.50.11",
"kv_cold_port": 50051,
"ssm_cold_host": "192.168.10.174",
"ssm_cold_port": 50051,
"kv_dim": 256,
"turbo_quant": true,
"block_size": 16
}
}' \
--enable-prefix-caching \
--block-size 16
Note: vLLM 0.20+ uses
--kv-transfer-config(not--kv-connector/--kv-connector-extra-config). The connector must be specified askv_connector(class name) +kv_connector_module_path(module path) — passing the full dotted path askv_connectorwill fail.
vLLM Performance
Measured on DGX Spark (GB10, aarch64) with Qwen3.6-35B-A3B (35B MoE, 40 layers: 10 full-attention + 30 linear-attention), Apple FY2025 10-K (30,561-token real document), cold vaults on Mac Pro + Mac Air (5GbE LAN, 1ms RTT):
| Scenario | TTFT | vs Full Prefill | Notes |
|---|---|---|---|
| Full prefill (30k tokens) | 10.75s | 1× baseline | cold GPU cache, no vault |
| GPU cache hit | 1.19s | 9× faster | same prompt, blocks in GPU |
| Cold vault restore | 0.52s | 20× faster | blocks from LAN vault, skip attention |
Cold vault restore beats GPU cache hit — vault blocks are inserted directly into the KV cache without running attention, so TTFT is pure network + insertion latency. GPU cache hit still runs partial attention over the matched prefix. The gap widens at longer contexts because prefill scales super-linearly while restore is near-linear (network transfer + KV insertion).
Projected scaling (Qwen3.6-35B-A3B, 5GbE LAN vault):
| Prompt size | Cold prefill | GPU cache hit | Cold restore | Speedup |
|---|---|---|---|---|
| 30k tokens (measured) | 10.75s | 1.19s | 0.52s | 20× |
| 60k tokens (projected) | ~26s | ~1.2s | ~1.0s | ~26× |
| 128k tokens (projected) | ~70s | ~1.5s | ~2.0s | ~35× |
Answer quality: cold restore produces bit-for-bit identical output to full prefill. TurboQuant INT8 is lossy but per-group quantization preserves KV distributions well enough that the model's output is indistinguishable. The tensor_hash field in each BlockRecord detects any in-flight corruption.
Cold Prefill TTFT: 10.75s (30k-token Apple 10-K, no cache)
GPU Cache Hit TTFT: 1.19s (9× faster — same document, blocks in GPU)
Cold Restore TTFT: 0.52s (20× faster — blocks in vault, skip attention)
Answer quality: identical output across all three paths
Vault: Mac Pro + Mac Air, 5GbE LAN, 1ms RTT
Pre-launch smoke test
Run this before any benchmark to catch issues early (context overflow, vault unreachable, vLLM misconfiguration):
python -m tierkv.connectors.vllm.smoke_test \
--base http://localhost:8000 \
--model Qwen/Qwen3.6-35B-A3B \
--toml /path/to/tierkv.toml \
--bench /path/to/bench.py
Expected output:
[1] vLLM health
[PASS] vLLM /health: HTTP 200
[2] Model
[PASS] Model loaded: Qwen/Qwen3.6-35B-A3B
[PASS] max_model_len >= 20000: 20000
[3] Context fit check
[PASS] Context fits (with longest Q): 19724 tokens, 276 headroom
[PASS] Headroom > 100 tokens: 276 tokens to spare
[4] Vault connectivity
[PASS] TCP kv_cold (Mac Pro): 192.168.50.11:50051
[PASS] TCP ssm_cold (Mac Air): 192.168.10.174:50051
[5] Quick inference
[PASS] Inference responds: 8.97s
[PASS] Smoke test: 8/8 checks passed
How it works
The vLLM connector uses a reactive eviction model — it intercepts vLLM's block eviction path, not a periodic snapshot:
- Eviction (
request_finished): vLLM signals that GPU blocks are about to be freed. TierKV reads the KV tensors, quantizes them with TurboQuant INT8, and ships them to the cold vault over gRPC. GPU blocks are freed after the store completes. - Restore (
get_num_new_matched_tokens+start_load_kv): On the next request with the same prompt prefix, TierKV finds the blocks in the cold registry, fires aBatchPromoteRPC, dequantizes, and writes directly into vLLM's paged KV buffer. - No-op save (
save_kv_layer): vLLM's eager save path is a no-op — eviction is the only trigger.
The connector integrates as a standard vLLM KV Transfer plugin — no vLLM source changes needed.
Configuration reference
All fields can be set in tierkv.toml under [tierkv] or passed via --kv-connector-extra-config:
| Field | Default | Description |
|---|---|---|
kv_cold_host |
127.0.0.1 |
Cold vault host for full-attention KV layers |
kv_cold_port |
50051 |
Cold vault port |
ssm_cold_host |
None |
Cold vault host for SSM/linear-attention layers (uses kv_cold if unset) |
ssm_cold_port |
50052 |
SSM vault port |
block_size |
16 |
Must match vLLM --block-size |
kv_dim |
128 |
Must match model head_dim — see Troubleshooting |
turbo_quant |
true |
INT8 compression (~3.9× ratio) |
max_inflight_stores |
8 |
Concurrent eviction-to-vault gRPC calls |
max_inflight_promotes |
4 |
Concurrent restore-from-vault threads |
kv_dim is critical. Wrong value causes silent incorrect compression. Find the right value with:
from transformers import AutoConfig cfg = AutoConfig.from_pretrained("your/model") print(cfg.hidden_size // cfg.num_attention_heads)
TurboQuant
tierkv includes a per-group INT8 quantizer for KV tensor compression before sending over the network.
- Group size:
kv_dimfloats — must match your model's attention head dimension (default 256 for Qwen3.6-35B-A3B; use 128 for Llama-3, Qwen2.5, Mistral; seetierkv.toml.examplefor how to find the right value for other models) - Each group gets its own absmax scale:
scale = max(|x|) / 127 - Wire format:
[scale: f32 LE][i8 × 256]per group - Compression ratio: ~3.9× (BF16 input → INT8 output)
- SNR: ≥52 dB on real KV distributions (per-group isolates outliers)
from tierkv_core import TurboQuant
q = TurboQuant(dim=256)
compressed = q.encode(f32_bytes) # ~3.9× smaller
recovered = q.decode(compressed) # ≥52 dB SNR
Architecture Notes
Why not standard KV offloading? Most KV offload systems evict to local SSD or CPU RAM on the same machine. tierkv evicts across the network to separate machines, letting idle hardware on your LAN participate in serving long-context requests.
Why EXO? EXO provides an OpenAI-compatible API layer and handles model loading across Apple Silicon and CUDA devices. tierkv monkey-patches EXO's KVPrefixCache eviction and retrieval paths without modifying EXO's core. EXO runs only on the inference node — cold nodes run only the tierkv vault.
What about multi-node inference? EXO supports pipeline-parallel inference (splitting layers across machines). tierkv is currently designed for single-node inference with distributed cold storage. The two can coexist but require separate configuration.
Cluster Tested
| Node | Role | Memory | Network |
|---|---|---|---|
| DGX Spark (GB10, aarch64) | Inference — EXO or vLLM + Qwen3.6-35B-A3B | 128 GB RAM + 96 GB HBM | 5GbE LAN |
| Mac Pro (M2 Pro) | KV cold tier — tierkv vault only | 32 GB | 5GbE LAN (0.5ms to DGX) |
| Mac Air (M2) | SSM cold tier — tierkv vault only | 16 GB | 5GbE LAN (1ms to DGX) |
Over one session (EXO): 227 evictions, 6 cold restores, ~26s saved per restore. Over one session (vLLM, Apple 10-K 30k tokens): cold restore 0.52s vs 10.75s cold prefill (20×); GPU cache hit 1.19s (9×).
Troubleshooting
See TROUBLESHOOTING.md for documented failures and fixes, including:
- gRPC 4 MB message size limit (silent empty responses)
kv_dimmismatch causing silent incorrect compressionKVCache.offsetsemantics and garbage output after restore- Stale semaphores after
kill -9 - EXO Nack loop and election storm after hard reset
- SSH lockout during model load
- Wrong platform wheel installed on Linux
- vLLM
fastsafetensorsbuild failure on aarch64
Roadmap
- Persistent cold storage (SQLite / memory-mapped file — survive reboots)
- TurboQuant codebook training on real KV activations (push SNR higher)
- Longer contexts or slower GPUs where prefill is the bottleneck
- Quantization quality validation with
_kv_offsetsfix in place - EXO version detection for hook compatibility
- LRU eviction inside the cold vault (configurable max capacity — currently vaults grow unbounded in RAM)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tierkv-0.1.0.tar.gz.
File metadata
- Download URL: tierkv-0.1.0.tar.gz
- Upload date:
- Size: 620.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62ee1a3084c0b9f735b7078a1f15995eca628528f49be4542058838cd2b7a0d8
|
|
| MD5 |
9f381e4bdf40a814d2bcf5472732189e
|
|
| BLAKE2b-256 |
0cc148bc6f0b4417b533b3214f133f3ec794655139074e55b0913ec514b15192
|
Provenance
The following attestation bundles were made for tierkv-0.1.0.tar.gz:
Publisher:
wheels.yml on tierkv/tierkv
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tierkv-0.1.0.tar.gz -
Subject digest:
62ee1a3084c0b9f735b7078a1f15995eca628528f49be4542058838cd2b7a0d8 - Sigstore transparency entry: 1449448099
- Sigstore integration time:
-
Permalink:
tierkv/tierkv@dce925fa28c02dbd12e47aec6fdfb7348ede36e3 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/tierkv
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
wheels.yml@dce925fa28c02dbd12e47aec6fdfb7348ede36e3 -
Trigger Event:
push
-
Statement type: