Semantic KV cache reuse for LLM inference engines (vLLM, SGLang, TRT-LLM)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zbennett10

These details have not been verified by PyPI

Project links

Project description

SemBlend

Semantic KV cache reuse for LLM inference engines.

SemBlend extends exact-prefix KV caching (vLLM, LMCache, SGLang) with semantic donor discovery. When a prompt is semantically similar to a cached one but lexically different — different instruction phrasing, sentence order, or template fields — SemBlend finds and reuses the cached KV tensors, replacing a multi-second prefill with sub-second KV retrieval.

vLLM + LMCache alone:        semantically similar prompt  →  0% hit   →  full prefill
vLLM + LMCache + SemBlend:                                →  30–88% hit  →  reuse donor KV

Performance

Measured on A10G GPU, Qwen2.5-7B-AWQ, vLLM 0.14.1 + LMCache.

TTFT speedup vs cold prefill

Context	Cold TTFT	Hit TTFT	Speedup	Break-even P_hit
4K	1,859 ms	801 ms	2.3x	<1%
8K	3,193 ms	817 ms	3.9x	4.9%
16K	5,852 ms	871 ms	6.7x	4.1%
32K	15,418 ms	1,288 ms	12.0x	—

Hit TTFT is ~800ms regardless of context length — bounded by KV retrieval, not prefill. Miss overhead is 5–212ms (negligible). SemBlend is net-positive at virtually any nonzero hit rate for contexts ≥ 4K.

Hit rates on real workloads

Workload	Hit Rate	Hit-only Speedup
WildChat-1M short prompts (≥4K)	29.2%	1.63x
WildChat-1M long prompts (≥8K)	30.0%	1.88x
Summarization (CNN/DM, SAMSum)	50–88%	2.3–2.4x
Multi-turn dialogue (turn 2+)	99.5%	5.1x
Cross-instruction RAG (8K)	100%	3.3–3.7x
Code generation (dissimilar)	0%	0.96x

Hit rate scales with semantic similarity: 17% at cos≥0.50 → 60% at cos≥0.90.

Quality

RoPE position correction keeps output quality near baseline:

Dataset	PPL ratio (SemBlend / cold)
CNN/DailyMail	1.006
WikiHow	1.012
XSum	1.025

See the paper for full benchmark details.

Installation

pip install semblend            # CPU-only core (numpy + rapidfuzz)
pip install semblend[vllm]      # + vLLM/LMCache integration
pip install semblend[sglang]    # + SGLang integration
pip install semblend[embedder]  # + sentence-transformers (MiniLM GPU)

Quick Start: vLLM + LMCache

Integrates via LMCache's KVConnectorBase_V1 — no patching required.

pip install semblend[vllm] vllm lmcache

vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ \
  --kv-transfer-config '{
    "kv_connector": "SemBlendConnectorV1",
    "kv_connector_module_path": "semblend.integration.vllm.connector_v1",
    "kv_role": "kv_both"
  }'

Quick Start: SGLang

pip install semblend[sglang] sglang

# CLI launcher — applies the RadixCache patch automatically
semblend-sglang --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000

Or programmatically — call before SGLang initializes:

from semblend.integration.sglang.radix_patcher import patch_radix_cache
patch_radix_cache()
# ... start SGLang server ...

A first-class SemanticPrefixProvider interface (no patching) is in progress upstream.

Configuration

Variable	Default	Description
`SEMBLEND_ENABLED`	`1`	Enable semantic donor search
`SEMBLEND_MIN_SIMILARITY`	`0.60`	Cosine similarity threshold
`SEMBLEND_EMBEDDER`	`minilm`	`minilm` · `jaccard` · `onnx_gpu`
`SEMBLEND_FUZZY_CHUNKS`	`0`	Fuzzy chunk matching for shifted prefixes

How It Works

Request → Embed (5ms) → Search (1ms) → Align (1ms) → Inject KV
             ↓               ↓              ↓
        MiniLM-L6-v2   cosine search   MD5 chunk hash
        384-dim         donor store     256-token boundary

Embed — 384-dim MiniLM-L6-v2 embedding; sliding-window sampling for long prompts
Search — brute-force cosine similarity against the donor store (<1ms at 1K donors)
Align — MD5 chunk hashing finds reusable 256-token KV chunks; optional fuzzy matching handles shifted boundaries
Inject — donor token IDs substituted into the request; LMCache/RadixCache retrieves cached KV; RoPE correction applied in-place on K tensors

When SemBlend Helps

Most effective when prompts share a large common context:

Document Q&A / RAG — same retrieved passages, different questions
Summarization — same article, different instruction phrasing
Multi-turn dialogue — conversation history prefix reused across turns
Code completion — shared repository context across requests

Dissimilar workloads (code generation from scratch, fully novel queries) see ~4% overhead with 0% hit — negligible in practice.

Contributing

See CONTRIBUTING.md.

License

Apache License 2.0.

Built at WorldFlow AI. For enterprise support contact research@worldflowai.com.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zbennett10

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Mar 28, 2026

0.3.0

Mar 22, 2026

This version

0.2.0

Mar 19, 2026

0.1.1

Mar 18, 2026

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semblend-0.2.0.tar.gz (113.4 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semblend-0.2.0-py3-none-any.whl (128.5 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file semblend-0.2.0.tar.gz.

File metadata

Download URL: semblend-0.2.0.tar.gz
Upload date: Mar 19, 2026
Size: 113.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for semblend-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b8ce32f5add5a4601082846b960762d4f9afdedb50de96ff2d83041889457f45`
MD5	`1bc8c8cf9324b2335dd5eccd7ce34622`
BLAKE2b-256	`010821e86dd50d5585941d315f37032f8d232987e21d5410cb57d73317c74854`

See more details on using hashes here.

Provenance

The following attestation bundles were made for semblend-0.2.0.tar.gz:

Publisher: publish.yml on WorldFlowAI/semblend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: semblend-0.2.0.tar.gz
- Subject digest: b8ce32f5add5a4601082846b960762d4f9afdedb50de96ff2d83041889457f45
- Sigstore transparency entry: 1133840830
- Sigstore integration time: Mar 19, 2026
Source repository:
- Permalink: WorldFlowAI/semblend@8a9b5a8b3f844519c40e3489eae5b86d7d415255
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/WorldFlowAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8a9b5a8b3f844519c40e3489eae5b86d7d415255
- Trigger Event: push

File details

Details for the file semblend-0.2.0-py3-none-any.whl.

File metadata

Download URL: semblend-0.2.0-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 128.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for semblend-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fba0edd41b243915e409f5b69edcd768a0f409ec0f4fdaf5e04b7adadffb6f5e`
MD5	`64b811ea11c29ed7f84385c54f1237be`
BLAKE2b-256	`cab475d80fe3d733c0b4f67426876a216f3b23395622ef248a7d7a3f432c594e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for semblend-0.2.0-py3-none-any.whl:

Publisher: publish.yml on WorldFlowAI/semblend

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: semblend-0.2.0-py3-none-any.whl
- Subject digest: fba0edd41b243915e409f5b69edcd768a0f409ec0f4fdaf5e04b7adadffb6f5e
- Sigstore transparency entry: 1133840884
- Sigstore integration time: Mar 19, 2026
Source repository:
- Permalink: WorldFlowAI/semblend@8a9b5a8b3f844519c40e3489eae5b86d7d415255
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/WorldFlowAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8a9b5a8b3f844519c40e3489eae5b86d7d415255
- Trigger Event: push

semblend 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SemBlend

Performance

TTFT speedup vs cold prefill

Hit rates on real workloads

Quality

Installation

Quick Start: vLLM + LMCache

Quick Start: SGLang

Configuration

How It Works

When SemBlend Helps

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance