Skip to main content

Tenant-fair LLM inference orchestration on a single GPU. No Kubernetes.

Project description

kvwarden

PyPI Python CI License: MIT

Tenant-fair LLM inference on one GPU. Sits in front of vLLM/SGLang, rate-limits per tenant at admission, and keeps a quiet user fast while a noisy neighbor floods the same engine.

Quiet-tenant TTFT under noisy-neighbor contention — solo 53.9 ms vs FIFO 1,585 ms vs token-bucket 61.5 ms on A100 + Llama-3.1-8B + vLLM 0.19.1

Hero number. A100-SXM4, Llama-3.1-8B, vLLM 0.19.1, two tenants sharing one engine, 300 s sustained:

Quiet user TTFT p99
Solo (no contention) 53.9 ms
FIFO under flooder (no rate-limit) 1,585 ms (29x starvation)
kvwarden token-bucket under flooder 61.5 ms (1.14x solo)

Ten lines of YAML. No application code change. Raw artifacts: results/gate2_preprint_v3/.

pip install kvwarden

Quickstart

kvwarden does not bundle vLLM. Install an engine separately, then let kvwarden spawn and proxy it.

# 1. Install kvwarden + the engine.
pip install kvwarden
pip install vllm  # needs a GPU box; see vLLM docs for your CUDA stack

# 2. Start kvwarden. It launches vLLM as a subprocess per the model list
#    in the config and exposes one OpenAI-compatible endpoint.
kvwarden serve --config configs/quickstart_fairness.yaml

# 3. In another shell, wait for /health and send two tenants at the same model.
until curl -fs localhost:8000/health > /dev/null; do sleep 2; done

curl localhost:8000/v1/completions -H "X-Tenant-ID: noisy" \
  -d '{"model":"llama31-8b","prompt":"Hello","max_tokens":64,"stream":true}'
curl localhost:8000/v1/completions -H "X-Tenant-ID: quiet" \
  -d '{"model":"llama31-8b","prompt":"Hello","max_tokens":64,"stream":true}'

# 4. Watch the token bucket fire and the engine queue stay composed.
curl localhost:8000/metrics | grep -E "tenant_rejected|admission_queue_depth"

First call returns 503 until vLLM finishes loading (30-90 s on A100 for an 8B). The config at configs/quickstart_fairness.yaml is heavily commented; every knob traces back to a specific experiment.

A Docker Compose bundle is not shipping yet — if you want one, #109 is open as a good-first-issue.

Is this for me?

Tool Orchestration Tenant fairness Engines Target scale
NVIDIA Dynamo Kubernetes No Multi Datacenter
llm-d Kubernetes No vLLM Datacenter
Ollama None No llama.cpp Single-user
kvwarden None Yes vLLM + SGLang Single-node, 2-8 tenants

kvwarden is the "shared GPU, a handful of tenants, no cluster" cell. If you already run Kubernetes or you're a single user on your own box, you probably want one of the others.

How it works

A thin orchestration layer (~3,500 LOC src) sits between your app and vLLM/SGLang. On request arrival, a per-tenant token bucket decides admit-or-429 before the request reaches the engine queue. Admitted requests flow through a length-bucketed admission controller and a DRR priority scheduler, then out to the engine subprocess kvwarden manages. Engines never see tenant identity; kvwarden does, and that's the entire trick. Multi-model lifecycle (freq+recency eviction, hot-swap) lives at the same layer. The HTTP API is OpenAI-compatible so your client code doesn't change.

Deeper read and the component diagram live in docs/architecture/overview.md.

Reproduce the hero number

# Terminal A — start kvwarden on the hero config (vLLM 0.19.1, Llama-3.1-8B).
kvwarden serve --config configs/gate2_fairness_token_bucket.yaml --port 8000

# Terminal B — wait for /health, then run the 300-second bench.
until curl -fs localhost:8000/health > /dev/null; do sleep 5; done
kvwarden bench reproduce-hero --flavor=2tenant

Needs an A100-SXM4 80GB (or equivalent — 1xH100 works) with vLLM 0.19.1 installed. Runtime is about 300 s for the bench plus ~30 s preflight. Output goes to ./kvwarden-reproduce-<timestamp>/report.json with your numbers side-by-side against the published reference so you can file an issue with concrete data if they diverge. Other flavors: --flavor=n6, --flavor=n8. Full doc: docs/reproduce_hero.md.

Frontier coverage

The same admission mechanism holds on larger models. Gate 2.3 (70B dense, TP=4 on 4x H100) and Gate 2.4 (Mixtral-8x7B MoE, TP=2 on 2x H100) both land quiet-to-solo p50 ratios between 1.07x and 1.94x — the mechanism lives before the engine boundary, so sharding topology, expert routing, and attention shape don't change the fairness picture. Matrix, caveats, and raw bench pointers in docs/launch/frontier_coverage.md. The 8B A100 run is the only one with a single-command reproduce path today; 70B and Mixtral wrappers are a roadmap item.

About the name

kvwarden ships tenant-fair admission today. Tenant-aware KV cache eviction — the feature the name implies — is a scaffold in src/kvwarden/cache/manager.py, not a shipping feature. The path to making the name true is tracked in #103 and targets the 0.2 release. If you pip install 0.1.3 expecting KV-cache isolation today, you will not get it; you get admission-gate fairness, which is what the hero number measures.

What's next

Telemetry

kvwarden ships opt-in, anonymous install/usage telemetry. First interactive run prompts once; default is no; answer n or hit Enter and nothing is ever transmitted. Opt in and each command sends seven fields: a locally-minted uuid4 install ID, kvwarden version, Python major.minor, OS, bucketed GPU class, command name, and a unix timestamp. No prompts, model names, tenant IDs, or receiver-side IP capture. Toggle with kvwarden telemetry off/on/status; hard-disable with export KVWARDEN_TELEMETRY=0. Non-interactive sessions auto-opt-out. Worker source: telemetry-worker/. Full policy: docs/privacy/telemetry.md.

Tests

pytest tests/unit/        # 153 tests, no GPU needed, ~10 s
ruff check src/ tests/
ruff format --check src/ tests/

CI runs this matrix on Python 3.11 and 3.12; a red PR cannot merge.

Honesty log

Every metric we under-counted and the fix is in results/CORRECTIONS.md. TTFT measurement was rebuilt mid-project after a shadow review caught the original harness timing SSE first-frame RTT instead of first non-empty token (C2/C5). The 8B hero numbers exclude a 10 s JIT warmup window per C7; all 29 post-warmup windows sit between 36 ms and 65 ms.

Getting help

License

MIT. See LICENSE.

Cite as

@software{kvwarden_2026,
  title  = {kvwarden: tenant-fair LLM inference on a single GPU},
  author = {Patel, Shrey and {Coconut Labs contributors}},
  year   = {2026},
  version = {0.1.3},
  url    = {https://github.com/coconut-labs/kvwarden}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvwarden-0.1.4.tar.gz (65.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kvwarden-0.1.4-py3-none-any.whl (70.5 kB view details)

Uploaded Python 3

File details

Details for the file kvwarden-0.1.4.tar.gz.

File metadata

  • Download URL: kvwarden-0.1.4.tar.gz
  • Upload date:
  • Size: 65.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kvwarden-0.1.4.tar.gz
Algorithm Hash digest
SHA256 fdbcd674495549d7df9e9b822e0f808e49a1b8185d925436c75501dedfade330
MD5 3544b8bd7386571efa5e1f2f578194e8
BLAKE2b-256 545154eeecb71e88d45d925bffe95c20e5fbd78e02680674804c3400454d2b9a

See more details on using hashes here.

File details

Details for the file kvwarden-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: kvwarden-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 70.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for kvwarden-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4737b9f1df5f6821e49ddabb9cf72d11fbc32c4d843bc239d522d03e5f2b835a
MD5 9371ffbc616f5c58484d5d1852900a38
BLAKE2b-256 793a78455e9a5bb7e003419506d4a5e63903a4984f0aa0dbbd2bb2f2d3f7e21e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page