Skip to main content

Optimization layer for LLM inference with KV-cache management

Project description

KV-OptKit

KV-OptKit is a drop-in optimization layer for LLM inference that observes runtime behavior and continuously recommends—or automatically enforces—settings to keep KV-cache memory, batching, and tiered storage within budget while meeting P95 latency SLOs.

Features

  • Advisor Mode: Read-only recommendations for KV-cache optimization
  • SIM Adapter: No-GPU testing environment included
  • Policy Engine: Configurable policies for eviction and memory management
  • REST API: Easy integration with existing systems
  • Docker Support: Containerized deployment
  • Autopilot Mode: Automated optimization with safety guards and shadow testing

Quick Start

Prerequisites

  • Python 3.10+
  • Docker and Docker Compose (for containerized deployment)

Local Installation

  1. Clone the repository:

    git clone <repository-url>
    cd kv-optkit
    
  2. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Run the server (SIM adapter by default):

    # Option A: use the sample config
    set KVOPT_CONFIG=config\sample_config.yaml   # Windows PowerShell: $env:KVOPT_CONFIG="config/sample_config.yaml"
    kvopt-server  # serves on :9000
    # or
    python -m kvopt.server.main
    
  5. Verify health:

    curl http://localhost:9000/healthz
    
  6. In another terminal, run the demo (generates activity for SIM):

    python examples/demo_trace.py
    

Docker Compose

  1. Build and start the service:

    docker compose -f docker/compose.yml up --build -d
    
  2. Check the service status:

    docker compose -f docker/compose.yml ps
    
  3. Run the demo:

    python examples/demo_trace.py
    

API Documentation

Health Check

GET /healthz

Get Advisor Report

GET /advisor/report

SIM Adapter Endpoints (for testing)

The SIM adapter is used internally for local testing. Interact with the public endpoints below:

  • GET /telemetry — current adapter telemetry
  • GET /metrics — Prometheus metrics

Autopilot Mode

The Autopilot feature automates KV cache optimization with safety guarantees:

Key Components

  1. Policy Engine: Generates optimization plans based on system state
  2. Guard System: Validates and monitors plan execution with rollback capability
  3. Action Executor: Safely applies optimization actions
  4. REST API: Control and monitor the optimization process

Using Autopilot

1. Create an Optimization Plan

curl -X POST "http://localhost:9000/autopilot/plan" \
  -H "Content-Type: application/json" \
  -d '{
    "target_hbm_util": 0.7,
    "max_actions": 5,
    "priority": "high"
  }'

2. Check Plan Status

curl "http://localhost:9000/autopilot/plan/{plan_id}"

3. Monitor Metrics

curl "http://localhost:9000/autopilot/metrics"

Python example (requests)

import requests

base = "http://localhost:9000"

# Create a plan
r = requests.post(
    f"{base}/autopilot/plan",
    json={"target_hbm_util": 0.7, "max_actions": 5, "priority": "high"},
)
r.raise_for_status()
plan_id = r.json()["plan_id"]

# Poll status
s = requests.get(f"{base}/autopilot/plan/{plan_id}").json()
print("status:", s["status"], "completed:", s["actions_completed"], "/", s["actions_total"]) 

# Metrics
m = requests.get(f"{base}/autopilot/metrics").json()
print("guard metrics:", m)

Configuration

Edit config/sample_config.yaml to customize the behavior:

slo:
  latency_p95_ms: 2000.0
  max_accuracy_delta_pct: 0.5

budgets:
  hbm_util_target: 0.85
  offload_bw_gbps: 120.0

policy:
  keep_recent_tokens: 4096
  eviction:
    - age_decay
  tiers:
    - HBM
    - DDR
    - CXL
    - NVMe

plugins: {}

guardrails:
  ab_shadow_fraction: 0.05
  rollback_on_acc_delta: true

autopilot:
  enabled: true
  default_target_hbm_util: 0.8
  default_max_actions: 10
  default_priority: "medium"

guard:
  enabled: true
  shadow_fraction: 0.1  # Fraction of requests to execute in shadow mode
  max_accuracy_delta: 0.05  # Maximum allowed accuracy impact
  rollback_on_high_impact: true
  accuracy_weights:
    evict: 1.0
    offload: 0.8
    quantize: 0.5

policy_engine:
  min_sequence_length: 100
  min_sequence_utilization: 0.3
  quantization_scales: [0.5, 0.25]
  action_priority: ["EVICT", "OFFLOAD", "QUANTIZE"]

Development

Running Tests

pytest tests/

Code Style

This project uses black for code formatting and flake8 for linting.

black .
flake8

Installation and Quickstart

  • pip (PyPI):
pip install kv-optkit

# Optional: point to a custom config (PowerShell)
$env:KVOPT_CONFIG = "config/sample_config.yaml"

# Start the server on :9000
kvopt-server
  • pip with extras:
# vLLM adapter support and NVML telemetry
pip install "kv-optkit[vllm]"

# TensorRT-LLM route (Linux-only for TensorRT)
pip install "kv-optkit[trtllm]"

# Text Generation Inference route
pip install "kv-optkit[tgi]"

# DeepSpeed-MII route
pip install "kv-optkit[deepspeed]"

# Dev tools for contributing (pytest, ruff, mypy, etc.)
pip install -e ".[dev]"

Extras summary:

  • vllm: vLLM engine adapter and NVML GPU telemetry.

  • trtllm: TensorRT-LLM and Triton client tooling (Linux preferred).

  • tgi: Text Generation Inference client/adapter.

  • deepspeed: DeepSpeed-MII adapter route.

  • dev: local development and CI tooling.

  • Docker (GHCR):

docker run -p 9000:9000 ghcr.io/archokshi/kv-optkit:latest
  • Docker Compose profiles:
# Agent only
docker compose -f docker/compose.yml --profile sim up -d

# Observability (Prometheus + Grafana)
docker compose -f docker/compose.yml --profile obs up -d
  • Helm (local chart):
helm upgrade --install kv-optkit ./deploy/helm/kv-optkit \
  --set image.repository=ghcr.io/archokshi/kv-optkit \
  --set image.tag=latest

Environment configuration (Sidecar/Auto-attach)

  • KVOPT_ENGINE_ENDPOINT: target engine endpoint (e.g., http://localhost:8000).
  • KVOPT_ENGINE_SETTINGS: optional JSON settings blob consumed at server startup.

These variables are read at server start; sidecar mode uses them to auto-attach.

Observability & Reporting

Provides Prometheus metrics, a Grafana dashboard, and a report generator for go/no-go decisions.

Metrics exposed at /metrics

  • Gauges
    • kvopt_hbm_utilization
    • kvopt_hbm_used_gb
    • kvopt_p95_latency_ms
    • kvopt_ttft_ms
    • kvopt_ddr_utilization
    • kvopt_ddr_used_gb
  • Counters
    • kvopt_tokens_evicted_total
    • kvopt_tokens_quantized_total
    • kvopt_reuse_hits_total, kvopt_reuse_misses_total
    • kvopt_autopilot_applies_total, kvopt_autopilot_rollbacks_total

Prometheus exposition is compliant (text/plain; version=0.0.4) with HELP/TYPE headers.

Docker Compose stack

From docker/ directory:

docker compose -f docker/compose.yml up -d prometheus grafana
  • Prometheus UI: http://localhost:9090
    • Example queries: kvopt_hbm_utilization, kvopt_p95_latency_ms, kvopt_ttft_ms
  • Grafana UI: http://localhost:3000
    • Dashboard: "KV-OptKit" (provisioned via docker/grafana-dashboard.json)

Report generator

Generate a live report by sampling metrics for ~30s:

python tools/make_report.py --from live --base http://localhost:8000 --samples 6 --interval 5 --out outputs/run_report.md

Or from a CSV (CI-friendly):

python tools/make_report.py --from file --csv tests/fixtures/metrics_sample.csv --out outputs/run_report.md

Artifacts:

  • Markdown: outputs/run_report.md
  • Charts in outputs/charts/: hbm.png, latency.png, ttft.png, ddr.png

The report includes before/after summaries, action counter deltas, and a Go/No-Go decision vs the P95 SLO.

QuickView Screenshots

Below are example views from the built-in QuickView at /.

  • Overview dashboard with HBM utilization, adapter capabilities, and sequence counts QuickView Overview

  • Sequences table with per-sequence utilization and qscale QuickView Sequences

  • Autopilot controls and current plan status QuickView Autopilot

Place your screenshots in docs/ with the file names above or adjust links as needed.

Containerized LMCache demo

The demo runs entirely in Docker and writes results to a host-mounted folder.

Build the demo image

docker build -f Dockerfile.demo -t kvopt-demo .

Start with Docker Compose

docker compose -f docker/compose.yml up --force-recreate

This launches:

  • redis:7 as kvopt-redis
  • kvopt-demo running examples/demo_reuse.py against redis://redis:6379

Where results are saved

  • CSV output is persisted on the host at outputs/kv_reuse.csv.
  • On Windows, you can inspect it with:
    type outputs\kv_reuse.csv
    

Stopping containers

In the same directory:

docker compose -f docker/compose.yml down

Compatibility matrix

The following combinations have been smoke-tested with the SIM adapter (no GPU) and with vLLM where noted.

Component Version(s)
Python 3.10, 3.11
vLLM (optional) 0.5.x (basic adapter compatibility)
CUDA N/A for SIM; 12.x recommended for GPU deployments
GPU SKUs (indicative) A10, A100 40/80GB, L4 (adapter-level tests)

Notes:

  • SIM adapter requires no GPU and is the default for quickstart.
  • For GPU deployments with vLLM, ensure CUDA drivers match container/base image.

Adapters & capability levels

Adapter Levels Notes
vLLM L0, L2 L0 observe-only; L2 safe EVICT-only apply
SIM L0–L3 Full feature surface for development/testing
TGI L0 Early support; subject to change
DeepSpeed-MII L0 Early support; subject to change

Releases

  • Semantic Versioning (vX.Y.Z).
  • Release notes and change history: see CHANGELOG.md.
  • Pin a specific version:
    • PyPI: pip install kv-optkit==X.Y.Z
    • GHCR: ghcr.io/archokshi/kv-optkit:X.Y.Z

License

Apache 2.0 - See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kv_optkit-0.5.3.tar.gz (51.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kv_optkit-0.5.3-py3-none-any.whl (51.9 kB view details)

Uploaded Python 3

File details

Details for the file kv_optkit-0.5.3.tar.gz.

File metadata

  • Download URL: kv_optkit-0.5.3.tar.gz
  • Upload date:
  • Size: 51.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for kv_optkit-0.5.3.tar.gz
Algorithm Hash digest
SHA256 217382a63fc202ba5023cae5faab59636784107cf1be5817cb28b6f9834fb277
MD5 5f682ffe4fb78b311dbcc043bb7b7a22
BLAKE2b-256 28838f35fe517b74dadf65c8b4dd2ddede0eec6b58e1576b1455340ab04f6c40

See more details on using hashes here.

File details

Details for the file kv_optkit-0.5.3-py3-none-any.whl.

File metadata

  • Download URL: kv_optkit-0.5.3-py3-none-any.whl
  • Upload date:
  • Size: 51.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for kv_optkit-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 eb56c010918caee803517250c30a71efaea7dbb5b2f2da7e18e015d73d780842
MD5 651fc0ede3bc25daa4149ba74f12877e
BLAKE2b-256 eeaa8a7e98d730227b761f8644d852ebab0e87a25514993f6c6ee9688f5f221d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page