Skip to main content

Semantic KV cache reuse for LLM inference engines (vLLM, SGLang, TRT-LLM)

Project description

SemBlend

PyPI Python CI License Paper

Semantic KV cache reuse for LLM inference engines.

SemBlend extends exact-prefix KV caching (vLLM, LMCache, SGLang) with semantic donor discovery. When a prompt is semantically similar to a cached one but lexically different — different instruction phrasing, sentence order, or template fields — SemBlend finds and reuses the cached KV tensors, replacing a multi-second prefill with sub-second KV retrieval.

vLLM + LMCache alone:        semantically similar prompt  →  0% hit   →  full prefill
vLLM + LMCache + SemBlend:                                →  83–100% hit  →  reuse donor KV

Performance

Measured on A10G GPU, Qwen2.5-7B-AWQ, vLLM 0.14.1 + LMCache.

TTFT speedup vs cold prefill

Context Cold TTFT Hit TTFT Speedup Break-even P_hit
4K 1,859 ms 801 ms 2.3x <1%
8K 3,193 ms 817 ms 3.9x 4.9%
16K 5,852 ms 871 ms 6.7x 4.1%
32K 15,418 ms 1,288 ms 12.0x

Hit TTFT is ~800ms regardless of context length — bounded by KV retrieval, not prefill. Miss overhead is 5–212ms (negligible). SemBlend is net-positive at virtually any nonzero hit rate for contexts ≥ 4K.

Hit rates on real workloads

Workload Hit Rate Hit-only Speedup
WildChat-1M conversations (≥4K) 82.7% 1.69x
Summarization (CNN/DM, SAMSum) 50–88% 2.3–2.4x
Multi-turn dialogue (turn 2+) 99.5% 5.1x
Cross-instruction RAG (8K) 100% 3.3x
Cross-instruction RAG (16K) 100% 5.3x
Code generation (dissimilar) 0% 0.96x

Full-document segmented GPU embedding (v0.2.0) achieves 100% coverage of the prompt regardless of length, enabling 82.7% hit rate on real WildChat conversations (up from 29% with sparse sampling).

Quality

RoPE position correction keeps output quality near baseline:

Dataset PPL ratio (SemBlend / cold)
CNN/DailyMail 1.006
WikiHow 1.012
XSum 1.025

See the paper for full benchmark details.

Installation

pip install semblend            # CPU-only core (numpy + rapidfuzz)
pip install semblend[vllm]      # + vLLM/LMCache integration
pip install semblend[sglang]    # + SGLang integration
pip install semblend[embedder]  # + sentence-transformers (MiniLM GPU)

Quick Start: vLLM + LMCache

Integrates via LMCache's KVConnectorBase_V1 — no patching required.

pip install semblend[vllm] vllm lmcache

vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
  --kv-transfer-config '{
    "kv_connector": "SemBlendConnectorV1",
    "kv_connector_module_path": "semblend.integration.vllm.connector_v1",
    "kv_role": "kv_both"
  }'

CacheBlend support: For selective layer recomputation (CacheBlend), vLLM must expose the loaded model to KV connectors via initialize_worker_connector(). This is available in vLLM builds that include PR #37339. Without it, SemBlend's semantic matching and KV injection still work — only CacheBlend's per-layer recomputation is unavailable.

Quick Start: SGLang

pip install semblend[sglang] sglang

# CLI launcher — applies the RadixCache patch automatically
semblend-sglang --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000

Or programmatically — call before SGLang initializes:

from semblend.integration.sglang.radix_patcher import patch_radix_cache
patch_radix_cache()
# ... start SGLang server ...

A first-class SemanticPrefixProvider interface (no patching) is in progress upstream.

Configuration

Variable Default Description
SEMBLEND_ENABLED 1 Enable semantic donor search
SEMBLEND_MIN_SIMILARITY 0.60 Cosine similarity threshold
SEMBLEND_EMBEDDER minilm minilm (auto GPU) · onnx_gpu
SEMBLEND_FUZZY_CHUNKS 0 Fuzzy chunk matching for shifted prefixes

How It Works

Request → Embed (2–15ms) → Search (1ms) → Align (1ms) → Inject KV
              ↓                 ↓              ↓
         MiniLM-L6-v2    cosine search   MD5 chunk hash
         GPU (ONNX RT)   donor store     256-token boundary
         segmented pool
  1. Embed — full-document segmented embedding on GPU via ONNX-runtime. Long prompts are split into overlapping 256-token windows, embedded in parallel, and mean-pooled into a single vector. 100% content coverage at any prompt length (~2ms short, ~10ms at 8K, ~15ms at 32K).
  2. Search — brute-force cosine similarity against the donor store (<1ms at 1K donors; CAGRA GPU ANN for larger pools)
  3. Align — MD5 chunk hashing finds reusable 256-token KV chunks; optional fuzzy matching handles shifted boundaries
  4. Inject — donor token IDs substituted into the request; LMCache/RadixCache retrieves cached KV; RoPE correction applied in-place on K tensors

When SemBlend Helps

Most effective when prompts share a large common context:

  • Document Q&A / RAG — same retrieved passages, different questions
  • Summarization — same article, different instruction phrasing
  • Multi-turn dialogue — conversation history prefix reused across turns
  • Code completion — shared repository context across requests

Dissimilar workloads (code generation from scratch, fully novel queries) see ~4% overhead with 0% hit — negligible in practice.

Contributing

See CONTRIBUTING.md.

License

Apache License 2.0.

Built at WorldFlow AI. For enterprise support contact research@worldflowai.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semblend-0.3.0.tar.gz (143.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semblend-0.3.0-py3-none-any.whl (166.2 kB view details)

Uploaded Python 3

File details

Details for the file semblend-0.3.0.tar.gz.

File metadata

  • Download URL: semblend-0.3.0.tar.gz
  • Upload date:
  • Size: 143.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for semblend-0.3.0.tar.gz
Algorithm Hash digest
SHA256 e001a74fb648c4dadd6d039cb532481a7e3a26a9504284e614690702f132f8c1
MD5 2ea3606c89fa4b077a9740646c084540
BLAKE2b-256 2e03240faa9040db9c021f107a33f496af474c7cab1a11dfdb2f137985f2488d

See more details on using hashes here.

File details

Details for the file semblend-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: semblend-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 166.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for semblend-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 71d81659166e025846cbf4474c72f81612c8cb6029597110f05c2d5652ecffac
MD5 42c0211f209296d5543689ea1f8a72e2
BLAKE2b-256 99385734bb45eb03b15e9704bb5cc2d5f5f5d77c270f1c26d6cd77b2381760ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page