Skip to main content

Semantic KV cache reuse for LLM inference engines (vLLM, SGLang, TRT-LLM)

Project description

SemBlend

PyPI Python CI License Paper

Semantic KV cache reuse for LLM inference engines.

SemBlend extends exact-prefix KV caching (vLLM, LMCache, SGLang) with semantic donor discovery. When a prompt is semantically similar to a cached one but lexically different — different instruction phrasing, sentence order, or template fields — SemBlend finds and reuses the cached KV tensors, replacing a multi-second prefill with sub-second KV retrieval.

vLLM + LMCache alone:        semantically similar prompt  →  0% hit   →  full prefill
vLLM + LMCache + SemBlend:                                →  30–88% hit  →  reuse donor KV

Performance

Measured on A10G GPU, Qwen2.5-7B-AWQ, vLLM 0.14.1 + LMCache.

TTFT speedup vs cold prefill

Context Cold TTFT Hit TTFT Speedup Break-even P_hit
4K 1,859 ms 801 ms 2.3x <1%
8K 3,193 ms 817 ms 3.9x 4.9%
16K 5,852 ms 871 ms 6.7x 4.1%
32K 15,418 ms 1,288 ms 12.0x

Hit TTFT is ~800ms regardless of context length — bounded by KV retrieval, not prefill. Miss overhead is 5–212ms (negligible). SemBlend is net-positive at virtually any nonzero hit rate for contexts ≥ 4K.

Hit rates on real workloads

Workload Hit Rate Hit-only Speedup
WildChat-1M short prompts (≥4K) 29.2% 1.63x
WildChat-1M long prompts (≥8K) 30.0% 1.88x
Summarization (CNN/DM, SAMSum) 50–88% 2.3–2.4x
Multi-turn dialogue (turn 2+) 99.5% 5.1x
Cross-instruction RAG (8K) 100% 3.3–3.7x
Code generation (dissimilar) 0% 0.96x

Hit rate scales with semantic similarity: 17% at cos≥0.50 → 60% at cos≥0.90.

Quality

RoPE position correction keeps output quality near baseline:

Dataset PPL ratio (SemBlend / cold)
CNN/DailyMail 1.006
WikiHow 1.012
XSum 1.025

See the paper for full benchmark details.

Installation

pip install semblend            # CPU-only core (numpy + rapidfuzz)
pip install semblend[vllm]      # + vLLM/LMCache integration
pip install semblend[sglang]    # + SGLang integration
pip install semblend[embedder]  # + sentence-transformers (MiniLM GPU)

Quick Start: vLLM + LMCache

Integrates via LMCache's KVConnectorBase_V1 — no patching required.

pip install semblend[vllm] vllm lmcache

vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
  --kv-transfer-config '{
    "kv_connector": "SemBlendConnectorV1",
    "kv_connector_module_path": "semblend.integration.vllm.connector_v1",
    "kv_role": "kv_both"
  }'

Quick Start: SGLang

pip install semblend[sglang] sglang

# CLI launcher — applies the RadixCache patch automatically
semblend-sglang --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000

Or programmatically — call before SGLang initializes:

from semblend.integration.sglang.radix_patcher import patch_radix_cache
patch_radix_cache()
# ... start SGLang server ...

A first-class SemanticPrefixProvider interface (no patching) is in progress upstream.

Configuration

Variable Default Description
SEMBLEND_ENABLED 1 Enable semantic donor search
SEMBLEND_MIN_SIMILARITY 0.60 Cosine similarity threshold
SEMBLEND_EMBEDDER minilm minilm · jaccard · onnx_gpu
SEMBLEND_FUZZY_CHUNKS 0 Fuzzy chunk matching for shifted prefixes

How It Works

Request → Embed (5ms) → Search (1ms) → Align (1ms) → Inject KV
             ↓               ↓              ↓
        MiniLM-L6-v2   cosine search   MD5 chunk hash
        384-dim         donor store     256-token boundary
  1. Embed — 384-dim MiniLM-L6-v2 embedding; sliding-window sampling for long prompts
  2. Search — brute-force cosine similarity against the donor store (<1ms at 1K donors)
  3. Align — MD5 chunk hashing finds reusable 256-token KV chunks; optional fuzzy matching handles shifted boundaries
  4. Inject — donor token IDs substituted into the request; LMCache/RadixCache retrieves cached KV; RoPE correction applied in-place on K tensors

When SemBlend Helps

Most effective when prompts share a large common context:

  • Document Q&A / RAG — same retrieved passages, different questions
  • Summarization — same article, different instruction phrasing
  • Multi-turn dialogue — conversation history prefix reused across turns
  • Code completion — shared repository context across requests

Dissimilar workloads (code generation from scratch, fully novel queries) see ~4% overhead with 0% hit — negligible in practice.

Contributing

See CONTRIBUTING.md.

License

Apache License 2.0.

Built at WorldFlow AI. For enterprise support contact research@worldflowai.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semblend-0.2.0.tar.gz (113.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semblend-0.2.0-py3-none-any.whl (128.5 kB view details)

Uploaded Python 3

File details

Details for the file semblend-0.2.0.tar.gz.

File metadata

  • Download URL: semblend-0.2.0.tar.gz
  • Upload date:
  • Size: 113.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for semblend-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b8ce32f5add5a4601082846b960762d4f9afdedb50de96ff2d83041889457f45
MD5 1bc8c8cf9324b2335dd5eccd7ce34622
BLAKE2b-256 010821e86dd50d5585941d315f37032f8d232987e21d5410cb57d73317c74854

See more details on using hashes here.

Provenance

The following attestation bundles were made for semblend-0.2.0.tar.gz:

Publisher: publish.yml on WorldFlowAI/semblend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file semblend-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: semblend-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 128.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for semblend-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fba0edd41b243915e409f5b69edcd768a0f409ec0f4fdaf5e04b7adadffb6f5e
MD5 64b811ea11c29ed7f84385c54f1237be
BLAKE2b-256 cab475d80fe3d733c0b4f67426876a216f3b23395622ef248a7d7a3f432c594e

See more details on using hashes here.

Provenance

The following attestation bundles were made for semblend-0.2.0-py3-none-any.whl:

Publisher: publish.yml on WorldFlowAI/semblend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page