Tenant-fair LLM inference orchestration on a single GPU. No Kubernetes.

These details have not been verified by PyPI

Project links

Project description

kvwarden

Tenant-fair LLM inference on one GPU. Sits in front of vLLM/SGLang, rate-limits per tenant at admission, and keeps a quiet user fast while a noisy neighbor floods the same engine.

Quiet-tenant TTFT under noisy-neighbor contention — solo 53.9 ms vs FIFO 1,585 ms vs token-bucket 61.5 ms on A100 + Llama-3.1-8B + vLLM 0.19.1

Hero number. A100-SXM4, Llama-3.1-8B, vLLM 0.19.1, two tenants sharing one engine, 300 s sustained:

	Quiet user TTFT p99
Solo (no contention)	53.9 ms
FIFO under flooder (no rate-limit)	1,585 ms (29x starvation)
kvwarden token-bucket under flooder	61.5 ms (1.14x solo)

Ten lines of YAML. No application code change. Raw artifacts: results/gate2_preprint_v3/.

pip install kvwarden

Quickstart

kvwarden does not bundle vLLM. Install an engine separately, then let kvwarden spawn and proxy it.

# 1. Install kvwarden + the engine.
pip install kvwarden
pip install vllm  # needs a GPU box; see vLLM docs for your CUDA stack

# 2. Start kvwarden. It launches vLLM as a subprocess per the model list
#    in the config and exposes one OpenAI-compatible endpoint.
kvwarden serve --config configs/quickstart_fairness.yaml

# 3. In another shell, wait for /health and send two tenants at the same model.
until curl -fs localhost:8000/health > /dev/null; do sleep 2; done

curl localhost:8000/v1/completions -H "X-Tenant-ID: noisy" \
  -d '{"model":"llama31-8b","prompt":"Hello","max_tokens":64,"stream":true}'
curl localhost:8000/v1/completions -H "X-Tenant-ID: quiet" \
  -d '{"model":"llama31-8b","prompt":"Hello","max_tokens":64,"stream":true}'

# 4. Watch the token bucket fire and the engine queue stay composed.
curl localhost:8000/metrics | grep -E "tenant_rejected|admission_queue_depth"

First call returns 503 until vLLM finishes loading (30-90 s on A100 for an 8B). The config at configs/quickstart_fairness.yaml is heavily commented; every knob traces back to a specific experiment.

A Docker Compose bundle is not shipping yet — if you want one, #109 is open as a good-first-issue.

Is this for me?

Tool	Orchestration	Tenant fairness	Engines	Target scale
NVIDIA Dynamo	Kubernetes	No	Multi	Datacenter
llm-d	Kubernetes	No	vLLM	Datacenter
Ollama	None	No	llama.cpp	Single-user
kvwarden	None	Yes	vLLM + SGLang	Single-node, 2-8 tenants

kvwarden is the "shared GPU, a handful of tenants, no cluster" cell. If you already run Kubernetes or you're a single user on your own box, you probably want one of the others.

How it works

A thin orchestration layer (~3,500 LOC src) sits between your app and vLLM/SGLang. On request arrival, a per-tenant token bucket decides admit-or-429 before the request reaches the engine queue. Admitted requests flow through a length-bucketed admission controller and a DRR priority scheduler, then out to the engine subprocess kvwarden manages. Engines never see tenant identity; kvwarden does, and that's the entire trick. Multi-model lifecycle (freq+recency eviction, hot-swap) lives at the same layer. The HTTP API is OpenAI-compatible so your client code doesn't change.

Deeper read and the component diagram live in docs/architecture/overview.md.

Reproduce the hero number

# Terminal A — start kvwarden on the hero config (vLLM 0.19.1, Llama-3.1-8B).
kvwarden serve --config configs/gate2_fairness_token_bucket.yaml --port 8000

# Terminal B — wait for /health, then run the 300-second bench.
until curl -fs localhost:8000/health > /dev/null; do sleep 5; done
kvwarden bench reproduce-hero --flavor=2tenant

Needs an A100-SXM4 80GB (or equivalent — 1xH100 works) with vLLM 0.19.1 installed. Runtime is about 300 s for the bench plus ~30 s preflight. Output goes to ./kvwarden-reproduce-<timestamp>/report.json with your numbers side-by-side against the published reference so you can file an issue with concrete data if they diverge. Other flavors: --flavor=n6, --flavor=n8. Full doc: docs/reproduce_hero.md.

Frontier coverage

The same admission mechanism holds on larger models. Gate 2.3 (70B dense, TP=4 on 4x H100) and Gate 2.4 (Mixtral-8x7B MoE, TP=2 on 2x H100) both land quiet-to-solo p50 ratios between 1.07x and 1.94x — the mechanism lives before the engine boundary, so sharding topology, expert routing, and attention shape don't change the fairness picture. Matrix, caveats, and raw bench pointers in docs/launch/frontier_coverage.md. The 8B A100 run is the only one with a single-command reproduce path today; 70B and Mixtral wrappers are a roadmap item.

About the name

kvwarden ships tenant-fair admission today. Tenant-aware KV cache eviction — the feature the name implies — is a scaffold in src/kvwarden/cache/manager.py, not a shipping feature. The path to making the name true is tracked in #103 and targets the 0.2 release. If you pip install 0.1.3 expecting KV-cache isolation today, you will not get it; you get admission-gate fairness, which is what the hero number measures.

What's next

#102 T1: Distribution — 10 onboarding installs in week 1
#103 T2: Name-truth — tenant-aware KV eviction for 0.2
#104 T3: Moat — vllm-project/production-stack router + LiteLLM adapter
#105 W1: Launch blockers — pre-launch QA + day-0 ops

Telemetry

kvwarden ships opt-in, anonymous install/usage telemetry. First interactive run prompts once; default is no; answer n or hit Enter and nothing is ever transmitted. Opt in and each command sends seven fields: a locally-minted uuid4 install ID, kvwarden version, Python major.minor, OS, bucketed GPU class, command name, and a unix timestamp. No prompts, model names, tenant IDs, or receiver-side IP capture. Toggle with kvwarden telemetry off/on/status; hard-disable with export KVWARDEN_TELEMETRY=0. Non-interactive sessions auto-opt-out. Worker source: telemetry-worker/. Full policy: docs/privacy/telemetry.md.

Tests

pytest tests/unit/        # 153 tests, no GPU needed, ~10 s
ruff check src/ tests/
ruff format --check src/ tests/

CI runs this matrix on Python 3.11 and 3.12; a red PR cannot merge.

Honesty log

Every metric we under-counted and the fix is in results/CORRECTIONS.md. TTFT measurement was rebuilt mid-project after a shadow review caught the original harness timing SSE first-frame RTT instead of first non-empty token (C2/C5). The 8B hero numbers exclude a 10 s JIT warmup window per C7; all 29 post-warmup windows sit between 36 ms and 65 ms.

Getting help

File a bug: GitHub Issues. A prometheus_dump.txt plus server.log is worth more than a star.
Questions + launch ops context: docs/ops/onboarding_playbook.md.
Contributing: CONTRIBUTING.md. Start with a good first issue.

License

MIT. See LICENSE.

Cite as

@software{kvwarden_2026,
  title  = {kvwarden: tenant-fair LLM inference on a single GPU},
  author = {Patel, Shrey and {Coconut Labs contributors}},
  year   = {2026},
  version = {0.1.3},
  url    = {https://github.com/coconut-labs/kvwarden}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.4

Apr 22, 2026

0.1.3

Apr 22, 2026

0.0.1

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvwarden-0.1.4.tar.gz (65.7 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kvwarden-0.1.4-py3-none-any.whl (70.5 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file kvwarden-0.1.4.tar.gz.

File metadata

Download URL: kvwarden-0.1.4.tar.gz
Upload date: Apr 22, 2026
Size: 65.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kvwarden-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`fdbcd674495549d7df9e9b822e0f808e49a1b8185d925436c75501dedfade330`
MD5	`3544b8bd7386571efa5e1f2f578194e8`
BLAKE2b-256	`545154eeecb71e88d45d925bffe95c20e5fbd78e02680674804c3400454d2b9a`

See more details on using hashes here.

File details

Details for the file kvwarden-0.1.4-py3-none-any.whl.

File metadata

Download URL: kvwarden-0.1.4-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 70.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kvwarden-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4737b9f1df5f6821e49ddabb9cf72d11fbc32c4d843bc239d522d03e5f2b835a`
MD5	`9371ffbc616f5c58484d5d1852900a38`
BLAKE2b-256	`793a78455e9a5bb7e003419506d4a5e63903a4984f0aa0dbbd2bb2f2d3f7e21e`

See more details on using hashes here.

kvwarden 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

kvwarden

Quickstart

Is this for me?

How it works

Reproduce the hero number

Frontier coverage

About the name

What's next

Telemetry

Tests

Honesty log

Getting help

License

Cite as

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes