3-tier distributed KV cache for LLM inference — preserve evicted KV across cluster nodes

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

PrasannaK

These details have not been verified by PyPI

Project description

tierkv

3-tier distributed KV cache for LLM inference.

When your GPU evicts a KV cache entry, tierkv ships it to another machine over gRPC instead of dropping it. On the next request with the same prompt, the KV is fetched back in a single batch call — skipping the expensive prefill entirely.

Tested on Qwen3.6-35B-A3B across a DGX GB10 + Mac Pro + Mac Air cluster:

EXO integration (BF16, 8k–15k token prompts):

Scenario	TTFT	vs Cold
Cold start, 8,000-token prompt	30.83s	baseline
Restored from cold tier	4.11s	7.3× faster
Cold start, 3,707-token prompt	23.78s	baseline
Restored from cold tier	4.59s	5.2× faster

vLLM integration (Apple FY2025 10-K, GB10 GPU, real-world document Q&A):

Prompt size	Cold prefill	GPU cache hit	Cold restore	Speedup
30k tokens (measured)	10.75s	1.19s	0.52s	20×
60k tokens (projected)	~26s	~1.2s	~1.0s	~26×
128k tokens (projected)	~70s	~1.5s	~2.0s	~35×

Cold vault restore beats GPU cache hit — blocks land directly into the KV cache skipping attention recomputation entirely. The speedup grows with context length because prefill scales super-linearly while restore is near-linear (network transfer). Answer quality is bit-for-bit identical across all three paths.

How It Works

tierkv supports two inference backends. The cold-storage layer (vault servers, gRPC, TurboQuant) is identical in both cases.

EXO backend (monkey-patch):

  DGX GB10 — inference only
  ┌─────────────────────────────────┐
  │  EXO + Qwen3.6-35B-A3B (BF16)  │  ← EXO runs HERE only
  │  KVPrefixCache (GPU hot tier)   │
  │         │ evict (60% RAM)       │
  │         ▼                       │
  │   tierkv hook (monkey-patch)    │
  └────┬──────────────┬─────────────┘
       │ KVCache       │ ArraysCache
       │ (10 layers)   │ (30 layers)
       ▼               ▼
  Mac Pro LAN      Mac Air LAN        ← cold storage only, no EXO
  0.5ms RTT        1ms RTT
  tierkv vault     tierkv vault
  (in-memory)      (in-memory)

vLLM backend (KVConnectorBase_V1 plugin):

  DGX GB10 — inference only
  ┌──────────────────────────────────────────┐
  │  vLLM + Qwen3.6-35B-A3B                 │
  │  Paged KV cache (GPU hot tier, 40 blocks)│
  │         │ block evicted                  │
  │         ▼                                │
  │   TierKVConnector (KVConnectorBase_V1)   │
  │   ├─ request_finished  → store to vault  │
  │   ├─ get_num_new_matched_tokens → plan   │
  │   └─ start_load_kv / wait_for_layer_load │
  └────┬──────────────────┬──────────────────┘
       │ full-attention KV │ SSM / linear-attn
       │ (10 layers)       │ (30 layers)
       ▼                   ▼
  Mac Pro LAN          Mac Air LAN     ← cold storage only, no vLLM
  0.5ms RTT            1ms RTT
  tierkv vault         tierkv vault
  (in-memory)          (in-memory)

Three tiers:

Hot — GPU KV cache on the inference node (EXO's KVPrefixCache or vLLM's paged KV cache). Fast, limited by GPU/HBM capacity.
Cold KV — Full-attention layer tensors shipped to a LAN node via gRPC, compressed with TurboQuant INT8 (~3.9× ratio, ≥52 dB SNR).
Cold SSM — Linear-attention / SSM layer states shipped to a second node. Qwen3.6-35B-A3B is a hybrid MoE — 10/40 layers use full attention, 30/40 use linear attention.

On a cache miss, two parallel BatchPromote RPCs fetch all blocks in 2 network round-trips, with parallel decode across a thread pool (decode releases the GIL, so N CPU cores work simultaneously).

For vLLM, the layer_type_map in tierkv.toml routes each layer group to the correct vault. For EXO, layer types are auto-detected via isinstance checks — no manual configuration needed.

Hardware Requirements

You need at least 2 machines: one running inference, one as cold storage. Three machines lets you split KV and SSM tiers across separate nodes for better throughput.

Role	What runs on it	Example
`inference`	EXO or vLLM + your model + tierkv	DGX GB10
`kv_cold`	`tierkv vault` only	Mac Pro (32 GB)
`ssm_cold`	`tierkv vault` only	Mac Air (16 GB)

EXO only runs on the inference node. The cold-tier machines (Mac Pro, Mac Air) only run the tierkv vault server — a lightweight gRPC process that holds KV data in RAM.

Installation

EXO compatibility: tierkv patches EXO's cache.py and builder.py in-place. Tested with EXO as of May 2026. EXO moves fast — if tierkv install errors, check that the patch targets in cache.py and builder.py still match. EXO version auto-detection is on the roadmap.

Download the prebuilt wheel for your platform from the latest release:

pip install tierkv

Or build from source (requires Rust toolchain):

git clone https://github.com/tierkv/tierkv.git
cd tierkv
cd tierkv-core && maturin develop --release && cd ..
pip install -e .

Prebuilt wheels are available for:

Linux aarch64 (DGX Spark, Jetson, ARM servers)
Linux x86_64
macOS arm64 (Apple Silicon — Mac Pro, Mac Air)

Setup — Step by Step

tierkv runs on all three machines, but each machine has a different role and a different config. Install tierkv on every node first, then configure each one.

Step 1 — Configure each machine

Each machine gets its own tierkv.toml with its role and the addresses of the other nodes. Copy the example and edit it:

cp tierkv.toml.example tierkv.toml

On the inference node (DGX Spark) — set role = "inference" and point to the cold nodes:

[cluster]
role = "inference"

[cluster.kv_cold]
host = "192.168.50.11"      # Mac Pro LAN IP
port = 50051

[cluster.ssm_cold]
host = "192.168.50.12"      # Mac Air LAN IP (5GbE)
port = 50051

[cluster.recompute]
host = "127.0.0.1"
port = 50052

[inference]
exo_path = "/home/user/exo/src/exo"   # path to your EXO installation
log_file  = "/tmp/tierkv.log"
memory_threshold = 0.60
kv_dim   = 256

On the KV cold node (Mac Pro) — set role = "kv_cold", addresses don't matter here:

[cluster]
role = "kv_cold"

[vault]
port = 50051

On the SSM cold node (Mac Air) — set role = "ssm_cold":

[cluster]
role = "ssm_cold"

[vault]
port = 50051

tierkv.toml is gitignored — it contains your private IPs. Only tierkv.toml.example is committed.

Step 2 — Start vault servers on cold nodes

Warning — unbounded RAM growth: The vault holds all received KV data in RAM and currently has no eviction policy. On a Mac Air (16 GB) running a long session, vault RAM will grow until the process is killed. Monitor with tierkv status and restart vault servers between sessions if needed. LRU eviction is on the roadmap.

On Mac Pro and Mac Air (not on DGX):

tierkv vault

This starts the ColdVault gRPC server that the inference node will send KV data to. Keep it running as a background service (macOS launchd / Linux systemd).

Step 3 — Install the EXO hook on the inference node

On DGX only:

tierkv install --exo-path /path/to/exo/src/exo

That's it. This command:

Copies the tierkv hook into EXO's engine directory
Patches EXO's cache.py to set the memory eviction threshold
Patches EXO's builder.py to auto-load the hook on startup

Restart EXO. The hook reads tierkv.toml from the working directory and connects to the cold nodes automatically.

Step 4 — Verify

From the inference node, check all nodes are reachable:

tierkv status

[tierkv status] Cluster role: inference

  kv_cold      192.168.50.11:50051   ✓  0.4ms
  ssm_cold     192.168.10.174:50051  ✓  5.9ms
  recompute    127.0.0.1:50052       ✓  0.1ms

[tierkv status] All nodes reachable.

Step 5 — Benchmark

tierkv bench --exo-api http://192.168.50.11:52415

Expected output:

[tierkv bench] EXO API: http://192.168.50.11:52415

  Request 1 — cold start (TARGET): 8.41s   response: 'The key advantage of mixture-of-experts…'
  Waiting 12s for async eviction to complete...
  Request 2 — evict step (different prompt): 1.12s   response: 'Sure, here is a short poem…'
  Waiting 12s for eviction gRPC to settle...
  Request 3 — restore (TARGET from cold): 1.62s   response: 'The key advantage of mixture-of-experts…'

  Speedup (cold → restore): 5.2×
  Time saved per request:   6.79s

A speedup below 1.5× means the cold tier isn't being hit — check tierkv status to confirm the vault servers are running and reachable.

vLLM Integration

tierkv ships a native vLLM KV Connector that plugs into vLLM's KVConnectorBase_V1 API. It uses the same cold vault infrastructure as the EXO hook — the same tierkv_core Rust backend, the same tierkv.toml config, and the same gRPC vault servers on Mac Pro / Mac Air.

Tested with: vLLM 0.20.1, torch 2.11.0+cu130, CUDA 13.0 on DGX GB10 (aarch64).

Install vLLM

# Linux aarch64 (DGX GB10/Spark) — requires Python dev headers for fastsafetensors
sudo apt-get install -y python3.12-dev
pip install vllm tierkv

# Linux x86_64 / macOS arm64
pip install vllm tierkv

Start vault servers

Same as for EXO — start tierkv vault on Mac Pro and Mac Air before launching vLLM.

Launch vLLM with TierKV

vllm serve Qwen/Qwen3.6-35B-A3B \
  --kv-transfer-config '{
    "kv_connector": "TierKVConnector",
    "kv_connector_module_path": "tierkv.connectors.vllm.connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {"config_path": "/path/to/tierkv.toml"}
  }' \
  --enable-prefix-caching \
  --block-size 16 \
  --no-disable-hybrid-kv-cache-manager \
  --max-model-len 20000 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 32

--no-disable-hybrid-kv-cache-manager is required for hybrid models like Qwen3.5 MoE that mix full-attention and SSM/linear-attention layers. vLLM auto-disables HMA when a KV connector is set; this flag re-enables it. TierKVConnector implements SupportsHMA so the override is safe.

Or pass config inline without a TOML file:

vllm serve Qwen/Qwen3.6-35B-A3B \
  --kv-transfer-config '{
    "kv_connector": "TierKVConnector",
    "kv_connector_module_path": "tierkv.connectors.vllm.connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
      "kv_cold_host": "192.168.50.11",
      "kv_cold_port": 50051,
      "ssm_cold_host": "192.168.10.174",
      "ssm_cold_port": 50051,
      "kv_dim": 256,
      "turbo_quant": true,
      "block_size": 16
    }
  }' \
  --enable-prefix-caching \
  --block-size 16

Note: vLLM 0.20+ uses --kv-transfer-config (not --kv-connector / --kv-connector-extra-config). The connector must be specified as kv_connector (class name) + kv_connector_module_path (module path) — passing the full dotted path as kv_connector will fail.

vLLM Performance

Measured on DGX Spark (GB10, aarch64) with Qwen3.6-35B-A3B (35B MoE, 40 layers: 10 full-attention + 30 linear-attention), Apple FY2025 10-K (30,561-token real document), cold vaults on Mac Pro + Mac Air (5GbE LAN, 1ms RTT):

Scenario	TTFT	vs Full Prefill	Notes
Full prefill (30k tokens)	10.75s	1× baseline	cold GPU cache, no vault
GPU cache hit	1.19s	9× faster	same prompt, blocks in GPU
Cold vault restore	0.52s	20× faster	blocks from LAN vault, skip attention

Cold vault restore beats GPU cache hit — vault blocks are inserted directly into the KV cache without running attention, so TTFT is pure network + insertion latency. GPU cache hit still runs partial attention over the matched prefix. The gap widens at longer contexts because prefill scales super-linearly while restore is near-linear (network transfer + KV insertion).

Projected scaling (Qwen3.6-35B-A3B, 5GbE LAN vault):

Prompt size	Cold prefill	GPU cache hit	Cold restore	Speedup
30k tokens (measured)	10.75s	1.19s	0.52s	20×
60k tokens (projected)	~26s	~1.2s	~1.0s	~26×
128k tokens (projected)	~70s	~1.5s	~2.0s	~35×

Answer quality: cold restore produces bit-for-bit identical output to full prefill. TurboQuant INT8 is lossy but per-group quantization preserves KV distributions well enough that the model's output is indistinguishable. The tensor_hash field in each BlockRecord detects any in-flight corruption.

Cold Prefill TTFT:   10.75s  (30k-token Apple 10-K, no cache)
GPU Cache Hit TTFT:   1.19s  (9× faster — same document, blocks in GPU)
Cold Restore TTFT:    0.52s  (20× faster — blocks in vault, skip attention)
Answer quality:       identical output across all three paths
Vault:                Mac Pro + Mac Air, 5GbE LAN, 1ms RTT

Pre-launch smoke test

Run this before any benchmark to catch issues early (context overflow, vault unreachable, vLLM misconfiguration):

python -m tierkv.connectors.vllm.smoke_test \
  --base http://localhost:8000 \
  --model Qwen/Qwen3.6-35B-A3B \
  --toml /path/to/tierkv.toml \
  --bench /path/to/bench.py

Expected output:

[1] vLLM health
  [PASS] vLLM /health: HTTP 200
[2] Model
  [PASS] Model loaded: Qwen/Qwen3.6-35B-A3B
  [PASS] max_model_len >= 20000: 20000
[3] Context fit check
  [PASS] Context fits (with longest Q): 19724 tokens, 276 headroom
  [PASS] Headroom > 100 tokens: 276 tokens to spare
[4] Vault connectivity
  [PASS] TCP kv_cold (Mac Pro): 192.168.50.11:50051
  [PASS] TCP ssm_cold (Mac Air): 192.168.10.174:50051
[5] Quick inference
  [PASS] Inference responds: 8.97s
[PASS] Smoke test: 8/8 checks passed

How it works

The vLLM connector uses a reactive eviction model — it intercepts vLLM's block eviction path, not a periodic snapshot:

Eviction (request_finished): vLLM signals that GPU blocks are about to be freed. TierKV reads the KV tensors, quantizes them with TurboQuant INT8, and ships them to the cold vault over gRPC. GPU blocks are freed after the store completes.
Restore (get_num_new_matched_tokens + start_load_kv): On the next request with the same prompt prefix, TierKV finds the blocks in the cold registry, fires a BatchPromote RPC, dequantizes, and writes directly into vLLM's paged KV buffer.
No-op save (save_kv_layer): vLLM's eager save path is a no-op — eviction is the only trigger.

The connector integrates as a standard vLLM KV Transfer plugin — no vLLM source changes needed.

Configuration reference

All fields can be set in tierkv.toml under [tierkv] or passed via --kv-connector-extra-config:

Field	Default	Description
`kv_cold_host`	`127.0.0.1`	Cold vault host for full-attention KV layers
`kv_cold_port`	`50051`	Cold vault port
`ssm_cold_host`	`None`	Cold vault host for SSM/linear-attention layers (uses kv_cold if unset)
`ssm_cold_port`	`50052`	SSM vault port
`block_size`	`16`	Must match vLLM `--block-size`
`kv_dim`	`128`	Must match model head_dim — see Troubleshooting
`turbo_quant`	`true`	INT8 compression (~3.9× ratio)
`max_inflight_stores`	`8`	Concurrent eviction-to-vault gRPC calls
`max_inflight_promotes`	`4`	Concurrent restore-from-vault threads

kv_dim is critical. Wrong value causes silent incorrect compression. Find the right value with:
from transformers import AutoConfig
cfg = AutoConfig.from_pretrained("your/model")
print(cfg.hidden_size // cfg.num_attention_heads)

TurboQuant

tierkv includes a per-group INT8 quantizer for KV tensor compression before sending over the network.

Group size: kv_dim floats — must match your model's attention head dimension (default 256 for Qwen3.6-35B-A3B; use 128 for Llama-3, Qwen2.5, Mistral; see tierkv.toml.example for how to find the right value for other models)
Each group gets its own absmax scale: scale = max(|x|) / 127
Wire format: [scale: f32 LE][i8 × 256] per group
Compression ratio: ~3.9× (BF16 input → INT8 output)
SNR: ≥52 dB on real KV distributions (per-group isolates outliers)

from tierkv_core import TurboQuant
q = TurboQuant(dim=256)
compressed = q.encode(f32_bytes)   # ~3.9× smaller
recovered  = q.decode(compressed)  # ≥52 dB SNR

Architecture Notes

Why not standard KV offloading? Most KV offload systems evict to local SSD or CPU RAM on the same machine. tierkv evicts across the network to separate machines, letting idle hardware on your LAN participate in serving long-context requests.

Why EXO? EXO provides an OpenAI-compatible API layer and handles model loading across Apple Silicon and CUDA devices. tierkv monkey-patches EXO's KVPrefixCache eviction and retrieval paths without modifying EXO's core. EXO runs only on the inference node — cold nodes run only the tierkv vault.

What about multi-node inference? EXO supports pipeline-parallel inference (splitting layers across machines). tierkv is currently designed for single-node inference with distributed cold storage. The two can coexist but require separate configuration.

Cluster Tested

Node	Role	Memory	Network
DGX Spark (GB10, aarch64)	Inference — EXO or vLLM + Qwen3.6-35B-A3B	128 GB RAM + 96 GB HBM	5GbE LAN
Mac Pro (M2 Pro)	KV cold tier — tierkv vault only	32 GB	5GbE LAN (0.5ms to DGX)
Mac Air (M2)	SSM cold tier — tierkv vault only	16 GB	5GbE LAN (1ms to DGX)

Over one session (EXO): 227 evictions, 6 cold restores, ~26s saved per restore. Over one session (vLLM, Apple 10-K 30k tokens): cold restore 0.52s vs 10.75s cold prefill (20×); GPU cache hit 1.19s (9×).

Troubleshooting

See TROUBLESHOOTING.md for documented failures and fixes, including:

gRPC 4 MB message size limit (silent empty responses)
kv_dim mismatch causing silent incorrect compression
KVCache.offset semantics and garbage output after restore
Stale semaphores after kill -9
EXO Nack loop and election storm after hard reset
SSH lockout during model load
Wrong platform wheel installed on Linux
vLLM fastsafetensors build failure on aarch64

Roadmap

Persistent cold storage (SQLite / memory-mapped file — survive reboots)
TurboQuant codebook training on real KV activations (push SNR higher)
Longer contexts or slower GPUs where prefill is the bottleneck
Quantization quality validation with _kv_offsets fix in place
EXO version detection for hook compatibility
LRU eviction inside the cold vault (configurable max capacity — currently vaults grow unbounded in RAM)

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

PrasannaK

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tierkv-0.1.0.tar.gz (620.2 kB view details)

Uploaded May 6, 2026 Source

File details

Details for the file tierkv-0.1.0.tar.gz.

File metadata

Download URL: tierkv-0.1.0.tar.gz
Upload date: May 6, 2026
Size: 620.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tierkv-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`62ee1a3084c0b9f735b7078a1f15995eca628528f49be4542058838cd2b7a0d8`
MD5	`9f381e4bdf40a814d2bcf5472732189e`
BLAKE2b-256	`0cc148bc6f0b4417b533b3214f133f3ec794655139074e55b0913ec514b15192`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tierkv-0.1.0.tar.gz:

Publisher: wheels.yml on tierkv/tierkv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tierkv-0.1.0.tar.gz
- Subject digest: 62ee1a3084c0b9f735b7078a1f15995eca628528f49be4542058838cd2b7a0d8
- Sigstore transparency entry: 1449448099
- Sigstore integration time: May 6, 2026
Source repository:
- Permalink: tierkv/tierkv@dce925fa28c02dbd12e47aec6fdfb7348ede36e3
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/tierkv
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: wheels.yml@dce925fa28c02dbd12e47aec6fdfb7348ede36e3
- Trigger Event: push

tierkv 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

tierkv

How It Works

Hardware Requirements

Installation

Setup — Step by Step

Step 1 — Configure each machine

Step 2 — Start vault servers on cold nodes

Step 3 — Install the EXO hook on the inference node

Step 4 — Verify

Step 5 — Benchmark

vLLM Integration

Install vLLM

Start vault servers

Launch vLLM with TierKV

vLLM Performance

Pre-launch smoke test

How it works

Configuration reference

TurboQuant

Architecture Notes

Cluster Tested

Troubleshooting

Roadmap

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Provenance