Chunk-level KV cache reuse for faster HuggingFace inference
Project description
KVBoost
Chunk-level KV cache reuse for HuggingFace inference.
Reuse KV tensors across requests that share long prefixes. Drop-in on any HF causal LM.
Quick Start • Benchmarks • How it works • When it helps • API • Docs
Quick start
pip install kvboost
from kvboost import KVBoost
engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B")
# Warm the shared prefix once
engine.warm("You are a helpful coding assistant. Always be concise...")
# Subsequent generates reuse cached chunks automatically
result = engine.generate(
"You are a helpful coding assistant. Always be concise...\n\n"
"User: How do I reverse a linked list?\nAssistant:",
max_new_tokens=128,
)
print(result.output_text)
print(f"TTFT: {result.ttft_ms:.1f} ms | reuse: {result.kv_reuse_ratio:.0%}")
From source:
git clone https://github.com/pythongiant/kvboost.git
cd kvboost
pip install -e .
Requirements: Python ≥ 3.9, PyTorch ≥ 2.1, Transformers ≥ 4.38.
How it works
The core idea is one sentence: split the prompt into fixed-size chunks, hash them, and on the next request load the K/V tensors for chunks you have already computed instead of recomputing them. Everything else is making that produce correct outputs.
1. Chunking
chunk_registry.py splits the token
stream into fixed-size blocks (default 128). A 1000-token prompt becomes
7 full chunks plus a 104-token tail. With --chunk-boundary-window=16
the cut point slides up to ±16 tokens to avoid splitting mid-sentence,
which reduces seam error on natural-language prompts.
2. Two-level hashing
Each chunk gets two keys (see models.py):
prefix_hash = SHA256(previous_chunk.prefix_hash || this_chunk.tokens)
content_hash = SHA256(this_chunk.tokens)
The prefix hash only matches when the tokens and every preceding chunk are identical — this is the case where stored K/V is directly usable. The content hash is a fallback: the tokens match but the history doesn't, so the stored K/V is approximately right but needs heavier correction.
3. Lookup and assembly
KVCacheManager.find_matching_chunks()
tries prefix hash, then falls back to content hash, and flags approximate
matches. PromptAssembler then splits
the prompt into a cached prefix (K/V loaded from memory) and a live
suffix (tokens the model still has to process).
Cache storage is an OrderedDict in CPU RAM with frequency-based
eviction; frequently-reused chunks (your system prompt) stay resident,
one-off chunks get evicted first. Overflow spills to a pre-allocated
binary file via disk_tier.py.
4. Seam repair
This is the part that makes stitching correct. Each cached chunk was originally computed without seeing the chunks now preceding it in the new prompt, so its K/V values are slightly wrong at the boundaries.
KVBoost has two strategies (recompute_strategy=):
selective(default) re-runs the model on the lastRtokens at each seam with the preceding cached context visible, and overwrites the stale K/V. Cheap but only fixes the boundary. (selective_recompute.py)cacheblenddoes one forward pass, measures per-token cosine deviation vs. what the K/V would be with full context, and recomputes only the ~15% most-deviated tokens. Catches mid-chunk errors selective misses. (cacheblend.py)
Approximate (content-hash) matches force CacheBlend regardless of the chosen strategy — position encodings are wrong in that case and boundary-only repair is not enough.
Two optional continuity features stack on top of either strategy:
--overlap-k=16: each chunk re-encodes the last K tokens of the previous chunk, so seam tokens always see K tokens of real preceding context at store time.--sink-tokens=32: always keep the first N tokens (the "attention sink") fully fresh, since many attention heads anchor on them.
5. Forward pass
The corrected cached K/V and the live suffix go into a single
model.forward(past_key_values=...) call in
engine.py. Autoregressive decoding then
proceeds normally. After generation, any newly-seen chunks are written
back to the cache so the next request with overlapping text hits without
an explicit warm().
6. Correctness guarantees
Under greedy decoding, the cached-and-corrected path is designed to
produce the argmax-equivalent token at every step — which matches what
the benchmark's cosine = 1.000 columns show on the KV-side logits.
Despite this, task accuracy still drifts by a few points at high reuse.
Why? Because "argmax matches at step 1" does not guarantee "full
generation matches" — small K/V perturbations can tilt later tokens onto
a different branch. The accuracy-by-reuse table is the ground truth;
treat the logit-cosine metric as a necessary but not sufficient check.
Under sampling (temperature > 0), outputs differ run-to-run by construction; the meaningful check is distributional (KL between logit distributions), not token-identity.
Optional: KV quantization
kv_cache_bits=8 quantizes cached tensors (per-channel for K,
per-token for V — the KIVI-paper asymmetry) for ~2× RAM savings with
minimal accuracy loss. kv_cache_bits=4 is available for 4× but you
should validate it with verify_correctness() on your workload before
trusting it.
API reference
Minimum surface:
KVBoost.from_pretrained(
model_name_or_path: str,
recompute_strategy: Literal["selective", "cacheblend", "none"] = "selective",
chunk_size: int = 128,
kv_cache_bits: Optional[Literal[4, 8]] = None,
device: Optional[str] = None, # "cuda" | "mps" | "cpu"
...
) -> KVBoost
engine.warm(text: str) -> WarmResult
engine.generate(prompt: str, max_new_tokens: int = ..., **kwargs) -> GenerationResult
engine.verify_correctness(prompts: list[str], ...) -> CorrectnessReport
GenerationResult exposes output_text, ttft_ms, total_ms,
kv_reuse_ratio, and the token-level traces used by the benchmarks.
Full docs: kvboost.readthedocs.io
Benchmarks
Results on Qwen/Qwen2.5-3B, 500 bug-localization samples (LongBench, max 6 000 context tokens). Each backend ran in an isolated process for a clean GPU state. Accuracy measured as exact-match on 4-choice multiple-choice questions.
KVBoost config: cacheblend strategy, 1.5 GB cache, recency window 8, boundary window 16, overlap-k 16, sink tokens 32.
Latency — Time to First Token
| Backend | TTFT mean | TTFT p95 | COLD mean | WARM mean | Throughput | vs Baseline |
|---|---|---|---|---|---|---|
| KVBoost | 142 ms | 506 ms | 222 ms | 63 ms | 11.7 tok/s | 4.49× |
| vLLM (prefix cache) | 166 ms | 653 ms | 269 ms | 62 ms | 13.2 tok/s | 3.86× |
| Baseline (HF) | 639 ms | 1 705 ms | 639 ms | 640 ms | 4.7 tok/s | 1.00× |
COLD = first query in a pair (no cached KVs). WARM = second query after the diff prefix is cached from the first.
KVBoost WARM TTFT is 3.5× faster than its own COLD and 10.1× faster than Baseline. Both caching backends reach nearly identical WARM latency (~62–63 ms); KVBoost has a lower overall mean because its COLD path (222 ms) is faster than vLLM's (269 ms) due to chunk-level partial cache hits on first access.
The CDF shows that KVBoost's advantage is consistent across percentiles, not just at the mean — even the p95 warm latency (101 ms) is far below the baseline median (440 ms).
KVBoost's chunk-level partial cache hits let it outperform vLLM on COLD queries at every context-length bucket, because even a first-time request can hit cached chunks from earlier requests with overlapping text.
Accuracy
| Backend | Overall | COLD | WARM | Avg KV reuse (warm) |
|---|---|---|---|---|
| KVBoost | 99.2% | 99.2% | 99.2% | 72.9% |
| vLLM (prefix cache) | 99.1% | 99.4% | 98.8% | — |
| Baseline (HF) | 99.1% | 99.2% | 99.0% | — |
Cold accuracy spread across backends is 0.2 pp, confirming all three backends process identical inputs. KVBoost WARM accuracy matches COLD exactly (99.2%) despite 72.9% average KV reuse — the CacheBlend seam repair produces no measurable quality degradation. The accuracy-by-reuse chart confirms this holds even at the 80–100% reuse bucket.
KV Reuse Distribution (KVBoost, warm queries only)
| Reuse bucket | Share of warm queries |
|---|---|
| 80–100% | 49% |
| 60–80% | 25% |
| 40–60% | 16% |
| 20–40% | 10% |
| 0–20% | 0% |
49% of warm queries reuse more than 80% of their diff prefix from cache. Average: 72.9%.
GPU Memory
| Backend | Peak mean | Peak p95 | COLD mean | WARM mean |
|---|---|---|---|---|
| KVBoost | 6 126 MB | 6 495 MB | 6 140 MB | 6 111 MB |
| Baseline (HF) | 6 141 MB | 6 517 MB | 6 140 MB | 6 141 MB |
KVBoost warm queries use ~29 MB less peak memory than cold queries, as cached chunks skip the full prefill activation spike.
vLLM peak memory is managed internally by its engine and is not tracked via torch.cuda.max_memory_allocated.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kvboost-0.4.0.tar.gz.
File metadata
- Download URL: kvboost-0.4.0.tar.gz
- Upload date:
- Size: 45.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a15e51e7e37f9a6dce36a95bd248621372284243c3083fdb74e30e0f19add4a6
|
|
| MD5 |
96fde8f533a0bb71a4e6a3b25766303f
|
|
| BLAKE2b-256 |
939cb8b4bd9bb59a26108c1ceabf58e3ed52169bf173db58dc005349e94d55bc
|
File details
Details for the file kvboost-0.4.0-py3-none-any.whl.
File metadata
- Download URL: kvboost-0.4.0-py3-none-any.whl
- Upload date:
- Size: 46.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d99e9cf18ff141617fe39fe26be5cee915905ffbc9d2427ffed0f1205f8609ee
|
|
| MD5 |
c776f3deb2e213a01c994e2e224ee13c
|
|
| BLAKE2b-256 |
b03b458c7f65aec290fdd7e517ef2c9267b232f6cfbb74edbe11ff730b733151
|