Optimization layer for LLM inference with KV-cache management
Project description
KV-OptKit
KV-OptKit optimizes KV-cache memory for LLM inference to meet latency SLOs while staying within memory budgets. It provides an advisor for recommendations, a safe autopilot with rollback, and a simple UI (QuickView) to observe KPIs and apply plans.
Why KV-OptKit
- Keep P95 latency within SLOs while controlling HBM/VRAM usage
- Safe, revertible optimization via plan-based apply and rollback
- Works out-of-the-box on CPU (SIM + vLLM demo sequences); easy GPU upgrade path
- Clear, observable UI (QuickView) and Prometheus metrics for production
Features
- Advisor Mode: Read-only recommendations for KV-cache optimization
- SIM Adapter: No-GPU testing environment included
- Policy Engine: Configurable policies for eviction and memory management
- REST API: Easy integration with existing systems
- Docker Support: Containerized deployment
- Autopilot Mode: Automated optimization with safety guards and shadow testing
Run demos and learn more:
- See the Demo Guide: docs/README-demos.md
Get Started
Quick start in a new terminal:
# (optional) create/activate venv
python -m pip install --upgrade pip
pip install -e .
# Run server on :9001
$env:KVOPT_PORT = "9001"
python -m kvopt.server.main
# Open QuickView
# http://localhost:9001/
Demos
For a quick, hands-on walkthrough of Phase 1 and Phase 2 demos (combined and per-action), see the demo guide:
Quick Start
Prerequisites
- Python 3.10+
- Docker and Docker Compose (for containerized deployment)
Local Installation
-
Clone the repository:
git clone <repository-url> cd kv-optkit
-
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the server (SIM adapter by default):
# Option A: use the sample config set KVOPT_CONFIG=config\sample_config.yaml # Windows PowerShell: $env:KVOPT_CONFIG="config/sample_config.yaml" # Serve on :9001 (QuickView) set KVOPT_PORT=9001 # PowerShell: $env:KVOPT_PORT="9001" kvopt-server # or python -m kvopt.server.main
-
Verify health:
curl http://localhost:9001/healthz -
In another terminal, run the demo (generates activity for SIM):
python examples/demo_trace.pyOr use the new CLI and one-click PowerShell demo:
-
CLI demo (Python):
# Inspect live telemetry and advisor python examples/demo_cli.py telemetry python examples/demo_cli.py report # Reset and create a couple sequences python examples/demo_cli.py reset python examples/demo_cli.py submit --seq seq_1 --tokens 2000 python examples/demo_cli.py submit --seq seq_2 --tokens 1200 # Apply optimizations python examples/demo_cli.py quantize --seq seq_1 --start 0 --end 999 --factor 0.5 python examples/demo_cli.py offload --seq seq_2 --start 0 --end 799 # Verify effects python examples/demo_cli.py telemetry python examples/demo_cli.py report
-
One-click PowerShell demo (Windows):
# From the repository root powershell -ExecutionPolicy Bypass -File .\examples\demo.ps1
-
Docker Compose
-
Build and start the service:
docker compose -f docker/compose.yml up --build -d
-
Check the service status:
docker compose -f docker/compose.yml ps
-
Run the demo:
python examples/demo_trace.py
API Documentation
Health Check
GET /healthz
Get Advisor Report
GET /advisor/report
SIM Adapter Endpoints (for testing)
The SIM adapter is used internally for local testing. Interact with the public endpoints below:
GET /telemetry— current adapter telemetryGET /metrics— Prometheus metrics
Autopilot Mode
The Autopilot feature automates KV cache optimization with safety guarantees:
Key Components
- Policy Engine: Generates optimization plans based on system state
- Guard System: Validates and monitors plan execution with rollback capability
- Action Executor: Safely applies optimization actions
- REST API: Control and monitor the optimization process
Using Autopilot
1. Create an Optimization Plan
curl -X POST "http://localhost:9001/autopilot/plan" \
-H "Content-Type: application/json" \
-d '{
"target_hbm_util": 0.7,
"max_actions": 5,
"priority": "high"
}'
2. Check Plan Status
curl "http://localhost:9001/autopilot/plan/{plan_id}"
3. Monitor Metrics
curl "http://localhost:9001/autopilot/metrics"
Python example (requests)
import requests
base = "http://localhost:9001"
# Create a plan
r = requests.post(
f"{base}/autopilot/plan",
json={"target_hbm_util": 0.7, "max_actions": 5, "priority": "high"},
)
r.raise_for_status()
plan_id = r.json()["plan_id"]
# Poll status
s = requests.get(f"{base}/autopilot/plan/{plan_id}").json()
print("status:", s["status"], "completed:", s["actions_completed"], "/", s["actions_total"])
# Metrics
m = requests.get(f"{base}/autopilot/metrics").json()
print("guard metrics:", m)
Configuration
Edit config/sample_config.yaml to customize the behavior:
slo:
latency_p95_ms: 2000.0
max_accuracy_delta_pct: 0.5
budgets:
hbm_util_target: 0.85
offload_bw_gbps: 120.0
policy:
keep_recent_tokens: 4096
eviction:
- age_decay
tiers:
- HBM
- DDR
- CXL
- NVMe
plugins: {}
guardrails:
ab_shadow_fraction: 0.05
rollback_on_acc_delta: true
autopilot:
enabled: true
default_target_hbm_util: 0.8
default_max_actions: 10
default_priority: "medium"
guard:
enabled: true
shadow_fraction: 0.1 # Fraction of requests to execute in shadow mode
max_accuracy_delta: 0.05 # Maximum allowed accuracy impact
rollback_on_high_impact: true
accuracy_weights:
evict: 1.0
offload: 0.8
quantize: 0.5
policy_engine:
min_sequence_length: 100
min_sequence_utilization: 0.3
quantization_scales: [0.5, 0.25]
action_priority: ["EVICT", "OFFLOAD", "QUANTIZE"]
Development
Running Tests
pytest tests/
Code Style
This project uses black for code formatting and flake8 for linting.
black .
flake8
Installation and Quickstart
- pip (PyPI):
pip install kv-optkit
# Optional: point to a custom config (PowerShell)
$env:KVOPT_CONFIG = "config/sample_config.yaml"
# Start the server on :9000
kvopt-server
- pip with extras:
# vLLM adapter support and NVML telemetry
pip install "kv-optkit[vllm]"
# TensorRT-LLM route (Linux-only for TensorRT)
pip install "kv-optkit[trtllm]"
# Text Generation Inference route
pip install "kv-optkit[tgi]"
# DeepSpeed-MII route
pip install "kv-optkit[deepspeed]"
# Dev tools for contributing (pytest, ruff, mypy, etc.)
pip install -e ".[dev]"
Extras summary:
-
vllm: vLLM engine adapter and NVML GPU telemetry. -
trtllm: TensorRT-LLM and Triton client tooling (Linux preferred). -
tgi: Text Generation Inference client/adapter. -
deepspeed: DeepSpeed-MII adapter route. -
dev: local development and CI tooling. -
Docker (GHCR):
docker run -p 9001:9001 -e KVOPT_PORT=9001 ghcr.io/archokshi/kv-optkit:latest
- Docker Compose profiles:
# Agent only
docker compose -f docker/compose.yml --profile sim up -d
# Observability (Prometheus + Grafana)
docker compose -f docker/compose.yml --profile obs up -d
- Helm (local chart):
helm upgrade --install kv-optkit ./deploy/helm/kv-optkit \
--set image.repository=ghcr.io/archokshi/kv-optkit \
--set image.tag=latest
Environment configuration (Sidecar/Auto-attach)
KVOPT_ENGINE_ENDPOINT: target engine endpoint (e.g.,http://localhost:8000).KVOPT_ENGINE_SETTINGS: optional JSON settings blob consumed at server startup.
These variables are read at server start; sidecar mode uses them to auto-attach.
Observability & Reporting
Provides Prometheus metrics, a Grafana dashboard, and a report generator for go/no-go decisions.
Metrics exposed at /metrics
- Gauges
kvopt_hbm_utilizationkvopt_hbm_used_gbkvopt_p95_latency_mskvopt_ttft_mskvopt_ddr_utilizationkvopt_ddr_used_gb
- Counters
kvopt_tokens_evicted_totalkvopt_tokens_quantized_totalkvopt_reuse_hits_total,kvopt_reuse_misses_totalkvopt_autopilot_applies_total,kvopt_autopilot_rollbacks_total
Prometheus exposition is compliant (text/plain; version=0.0.4) with HELP/TYPE headers.
Docker Compose stack
From docker/ directory:
docker compose -f docker/compose.yml up -d prometheus grafana
- Prometheus UI: http://localhost:9090
- Example queries:
kvopt_hbm_utilization,kvopt_p95_latency_ms,kvopt_ttft_ms
- Example queries:
- Grafana UI: http://localhost:3001
- Dashboard: "KV-OptKit" (provisioned via
docker/grafana-dashboard.json)
- Dashboard: "KV-OptKit" (provisioned via
Report generator
Generate a live report by sampling metrics for ~30s:
python tools/make_report.py --from live --base http://localhost:8000 --samples 6 --interval 5 --out outputs/run_report.md
Or from a CSV (CI-friendly):
python tools/make_report.py --from file --csv tests/fixtures/metrics_sample.csv --out outputs/run_report.md
Artifacts:
- Markdown:
outputs/run_report.md - Charts in
outputs/charts/:hbm.png,latency.png,ttft.png,ddr.png
The report includes before/after summaries, action counter deltas, and a Go/No-Go decision vs the P95 SLO.
QuickView Screenshots
Below are example views from the built-in QuickView at /.
-
Overview dashboard with HBM utilization, adapter capabilities, and sequence counts
-
Sequences table with per-sequence utilization and qscale
-
Autopilot controls and current plan status
Place your screenshots in docs/ with the file names above or adjust links as needed.
Containerized LMCache demo
The demo runs entirely in Docker and writes results to a host-mounted folder.
Build the demo image
docker build -f Dockerfile.demo -t kvopt-demo .
Start with Docker Compose
docker compose -f docker/compose.yml up --force-recreate
This launches:
redis:7askvopt-rediskvopt-demorunningexamples/demo_reuse.pyagainstredis://redis:6379
Where results are saved
- CSV output is persisted on the host at
outputs/kv_reuse.csv. - On Windows, you can inspect it with:
type outputs\kv_reuse.csv
Stopping containers
In the same directory:
docker compose -f docker/compose.yml down
Compatibility matrix
The following combinations have been smoke-tested with the SIM adapter (no GPU) and with vLLM where noted.
| Component | Version(s) |
|---|---|
| Python | 3.10, 3.11 |
| vLLM (optional) | 0.5.x (basic adapter compatibility) |
| CUDA | N/A for SIM; 12.x recommended for GPU deployments |
| GPU SKUs (indicative) | A10, A100 40/80GB, L4 (adapter-level tests) |
Notes:
- SIM adapter requires no GPU and is the default for quickstart.
- For GPU deployments with vLLM, ensure CUDA drivers match container/base image.
Adapters & capability levels
| Adapter | Levels | Notes |
|---|---|---|
| vLLM | L0, L2 | L0 observe-only; L2 safe EVICT-only apply |
| SIM | L0–L3 | Full feature surface for development/testing |
| TGI | L0 | Early support; subject to change |
| DeepSpeed-MII | L0 | Early support; subject to change |
Demos
For all demo flows (Phase 1, 2, and Phase 5 quickstarts including sidecar), see the dedicated demo guide:
Releases
- Semantic Versioning (
vX.Y.Z). - Release notes and change history: see CHANGELOG.md.
- Pin a specific version:
- PyPI:
pip install kv-optkit==X.Y.Z - GHCR:
ghcr.io/archokshi/kv-optkit:X.Y.Z
- PyPI:
License
Apache 2.0 - See LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kv_optkit-0.1.0a2.tar.gz.
File metadata
- Download URL: kv_optkit-0.1.0a2.tar.gz
- Upload date:
- Size: 78.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03b24d7ad9e746e75c8921f9789d9629ae6f520ed7fc27a4cf1a154a8ed65e9d
|
|
| MD5 |
ca4710bec777581b7d20172f65a82599
|
|
| BLAKE2b-256 |
5510691aa1442e62a6a128b527264ecdeffe5bc20a287849e5ed9706ed6d80b9
|
File details
Details for the file kv_optkit-0.1.0a2-py3-none-any.whl.
File metadata
- Download URL: kv_optkit-0.1.0a2-py3-none-any.whl
- Upload date:
- Size: 80.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c35bf826854fcb5fbb722ab96a417447036e448c826f71079f112635104af5c0
|
|
| MD5 |
44f74a79ddf27c6eacd342397eff31f1
|
|
| BLAKE2b-256 |
c237d60c2998193c66e47667e390c99b88cda4815fe9f7fded1a0aaba16c9b0b
|