Semantic KV cache reuse for LLM inference engines (vLLM, SGLang, TRT-LLM)
Project description
SemBlend
Semantic KV cache reuse for LLM inference engines.
SemBlend extends exact-prefix KV caching (vLLM, LMCache, SGLang) with semantic donor discovery. When a prompt is semantically similar to a cached one but lexically different — different instruction phrasing, sentence order, or template fields — SemBlend finds and reuses the cached KV tensors, replacing a multi-second prefill with sub-second KV retrieval.
vLLM + LMCache alone: semantically similar prompt → 0% hit → full prefill
vLLM + LMCache + SemBlend: → 30–88% hit → reuse donor KV
Performance
Measured on A10G GPU, Qwen2.5-7B-AWQ, vLLM 0.14.1 + LMCache.
TTFT speedup vs cold prefill
| Context | Cold TTFT | Hit TTFT | Speedup | Break-even P_hit |
|---|---|---|---|---|
| 4K | 1,859 ms | 801 ms | 2.3x | <1% |
| 8K | 3,193 ms | 817 ms | 3.9x | 4.9% |
| 16K | 5,852 ms | 871 ms | 6.7x | 4.1% |
| 32K | 15,418 ms | 1,288 ms | 12.0x | — |
Hit TTFT is ~800ms regardless of context length — bounded by KV retrieval, not prefill. Miss overhead is 5–212ms (negligible). SemBlend is net-positive at virtually any nonzero hit rate for contexts ≥ 4K.
Hit rates on real workloads
| Workload | Hit Rate | Hit-only Speedup |
|---|---|---|
| WildChat-1M short prompts (≥4K) | 29.2% | 1.63x |
| WildChat-1M long prompts (≥8K) | 30.0% | 1.88x |
| Summarization (CNN/DM, SAMSum) | 50–88% | 2.3–2.4x |
| Multi-turn dialogue (turn 2+) | 99.5% | 5.1x |
| Cross-instruction RAG (8K) | 100% | 3.3–3.7x |
| Code generation (dissimilar) | 0% | 0.96x |
Hit rate scales with semantic similarity: 17% at cos≥0.50 → 60% at cos≥0.90.
Quality
RoPE position correction keeps output quality near baseline:
| Dataset | PPL ratio (SemBlend / cold) |
|---|---|
| CNN/DailyMail | 1.006 |
| WikiHow | 1.012 |
| XSum | 1.025 |
See the paper for full benchmark details.
Installation
pip install semblend # CPU-only core (numpy + rapidfuzz)
pip install semblend[vllm] # + vLLM/LMCache integration
pip install semblend[sglang] # + SGLang integration
pip install semblend[embedder] # + sentence-transformers (MiniLM GPU)
Quick Start: vLLM + LMCache
Integrates via LMCache's KVConnectorBase_V1 — no patching required.
pip install semblend[vllm] vllm lmcache
vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
--kv-transfer-config '{
"kv_connector": "SemBlendConnectorV1",
"kv_connector_module_path": "semblend.integration.vllm.connector_v1",
"kv_role": "kv_both"
}'
Quick Start: SGLang
pip install semblend[sglang] sglang
# CLI launcher — applies the RadixCache patch automatically
semblend-sglang --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000
Or programmatically — call before SGLang initializes:
from semblend.integration.sglang.radix_patcher import patch_radix_cache
patch_radix_cache()
# ... start SGLang server ...
A first-class SemanticPrefixProvider interface (no patching) is in progress upstream.
Configuration
| Variable | Default | Description |
|---|---|---|
SEMBLEND_ENABLED |
1 |
Enable semantic donor search |
SEMBLEND_MIN_SIMILARITY |
0.60 |
Cosine similarity threshold |
SEMBLEND_EMBEDDER |
minilm |
minilm · jaccard · onnx_gpu |
SEMBLEND_FUZZY_CHUNKS |
0 |
Fuzzy chunk matching for shifted prefixes |
How It Works
Request → Embed (5ms) → Search (1ms) → Align (1ms) → Inject KV
↓ ↓ ↓
MiniLM-L6-v2 cosine search MD5 chunk hash
384-dim donor store 256-token boundary
- Embed — 384-dim MiniLM-L6-v2 embedding; sliding-window sampling for long prompts
- Search — brute-force cosine similarity against the donor store (<1ms at 1K donors)
- Align — MD5 chunk hashing finds reusable 256-token KV chunks; optional fuzzy matching handles shifted boundaries
- Inject — donor token IDs substituted into the request; LMCache/RadixCache retrieves cached KV; RoPE correction applied in-place on K tensors
When SemBlend Helps
Most effective when prompts share a large common context:
- Document Q&A / RAG — same retrieved passages, different questions
- Summarization — same article, different instruction phrasing
- Multi-turn dialogue — conversation history prefix reused across turns
- Code completion — shared repository context across requests
Dissimilar workloads (code generation from scratch, fully novel queries) see ~4% overhead with 0% hit — negligible in practice.
Contributing
See CONTRIBUTING.md.
License
Built at WorldFlow AI. For enterprise support contact research@worldflowai.com.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semblend-0.2.0.tar.gz.
File metadata
- Download URL: semblend-0.2.0.tar.gz
- Upload date:
- Size: 113.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8ce32f5add5a4601082846b960762d4f9afdedb50de96ff2d83041889457f45
|
|
| MD5 |
1bc8c8cf9324b2335dd5eccd7ce34622
|
|
| BLAKE2b-256 |
010821e86dd50d5585941d315f37032f8d232987e21d5410cb57d73317c74854
|
Provenance
The following attestation bundles were made for semblend-0.2.0.tar.gz:
Publisher:
publish.yml on WorldFlowAI/semblend
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semblend-0.2.0.tar.gz -
Subject digest:
b8ce32f5add5a4601082846b960762d4f9afdedb50de96ff2d83041889457f45 - Sigstore transparency entry: 1133840830
- Sigstore integration time:
-
Permalink:
WorldFlowAI/semblend@8a9b5a8b3f844519c40e3489eae5b86d7d415255 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/WorldFlowAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8a9b5a8b3f844519c40e3489eae5b86d7d415255 -
Trigger Event:
push
-
Statement type:
File details
Details for the file semblend-0.2.0-py3-none-any.whl.
File metadata
- Download URL: semblend-0.2.0-py3-none-any.whl
- Upload date:
- Size: 128.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fba0edd41b243915e409f5b69edcd768a0f409ec0f4fdaf5e04b7adadffb6f5e
|
|
| MD5 |
64b811ea11c29ed7f84385c54f1237be
|
|
| BLAKE2b-256 |
cab475d80fe3d733c0b4f67426876a216f3b23395622ef248a7d7a3f432c594e
|
Provenance
The following attestation bundles were made for semblend-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on WorldFlowAI/semblend
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semblend-0.2.0-py3-none-any.whl -
Subject digest:
fba0edd41b243915e409f5b69edcd768a0f409ec0f4fdaf5e04b7adadffb6f5e - Sigstore transparency entry: 1133840884
- Sigstore integration time:
-
Permalink:
WorldFlowAI/semblend@8a9b5a8b3f844519c40e3489eae5b86d7d415255 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/WorldFlowAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8a9b5a8b3f844519c40e3489eae5b86d7d415255 -
Trigger Event:
push
-
Statement type: