Skip to main content

Optimization layer for LLM inference with KV-cache management

Project description

KV-OptKit

SIM Smoke Test GPU Telemetry Parity Metrics Report

KV-OptKit optimizes KV-cache memory for LLM inference to meet latency SLOs while staying within memory budgets. It provides an advisor for recommendations, a safe autopilot with rollback, and a simple UI (QuickView) to observe KPIs and apply plans.

Why KV-OptKit

  • Keep P95 latency within SLOs while controlling HBM/VRAM usage
  • Safe, revertible optimization via plan-based apply and rollback
  • Works out-of-the-box on CPU (SIM + vLLM demo sequences); easy GPU upgrade path
  • Clear, observable UI (QuickView) and Prometheus metrics for production

Features

  • Advisor Mode: Read-only recommendations for KV-cache optimization
  • SIM Adapter: No-GPU testing environment included
  • Policy Engine: Configurable policies for eviction and memory management
  • REST API: Easy integration with existing systems
  • Docker Support: Containerized deployment
  • Autopilot Mode: Automated optimization with safety guards and shadow testing

Run demos and learn more:

Get Started

Quick start in a new terminal:

# (optional) create/activate venv
python -m pip install --upgrade pip
pip install -e .

# Run server on :9001
$env:KVOPT_PORT = "9001"
python -m kvopt.server.main

# Open QuickView
# http://localhost:9001/

Demos

For a quick, hands-on walkthrough of Phase 1 and Phase 2 demos (combined and per-action), see the demo guide:

Quick Start

Prerequisites

  • Python 3.10+
  • Docker and Docker Compose (for containerized deployment)

Local Installation

  1. Clone the repository:

    git clone <repository-url>
    cd kv-optkit
    
  2. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Run the server (SIM adapter by default):

    # Option A: use the sample config
    set KVOPT_CONFIG=config\sample_config.yaml   # Windows PowerShell: $env:KVOPT_CONFIG="config/sample_config.yaml"
    # Serve on :9001 (QuickView)
    set KVOPT_PORT=9001   # PowerShell: $env:KVOPT_PORT="9001"
    kvopt-server
    # or
    python -m kvopt.server.main
    
  5. Verify health:

    curl http://localhost:9001/healthz
    
  6. In another terminal, run the demo (generates activity for SIM):

    python examples/demo_trace.py
    

    Or use the new CLI and one-click PowerShell demo:

    • CLI demo (Python):

      # Inspect live telemetry and advisor
      python examples/demo_cli.py telemetry
      python examples/demo_cli.py report
      
      # Reset and create a couple sequences
      python examples/demo_cli.py reset
      python examples/demo_cli.py submit --seq seq_1 --tokens 2000
      python examples/demo_cli.py submit --seq seq_2 --tokens 1200
      
      # Apply optimizations
      python examples/demo_cli.py quantize --seq seq_1 --start 0 --end 999 --factor 0.5
      python examples/demo_cli.py offload --seq seq_2 --start 0 --end 799
      
      # Verify effects
      python examples/demo_cli.py telemetry
      python examples/demo_cli.py report
      
    • One-click PowerShell demo (Windows):

      # From the repository root
      powershell -ExecutionPolicy Bypass -File .\examples\demo.ps1
      

Docker Compose

  1. Build and start the service:

    docker compose -f docker/compose.yml up --build -d
    
  2. Check the service status:

    docker compose -f docker/compose.yml ps
    
  3. Run the demo:

    python examples/demo_trace.py
    

API Documentation

Health Check

GET /healthz

Get Advisor Report

GET /advisor/report

SIM Adapter Endpoints (for testing)

The SIM adapter is used internally for local testing. Interact with the public endpoints below:

  • GET /telemetry — current adapter telemetry
  • GET /metrics — Prometheus metrics

Autopilot Mode

The Autopilot feature automates KV cache optimization with safety guarantees:

Key Components

  1. Policy Engine: Generates optimization plans based on system state
  2. Guard System: Validates and monitors plan execution with rollback capability
  3. Action Executor: Safely applies optimization actions
  4. REST API: Control and monitor the optimization process

Using Autopilot

1. Create an Optimization Plan

curl -X POST "http://localhost:9001/autopilot/plan" \
  -H "Content-Type: application/json" \
  -d '{
    "target_hbm_util": 0.7,
    "max_actions": 5,
    "priority": "high"
  }'

2. Check Plan Status

curl "http://localhost:9001/autopilot/plan/{plan_id}"

3. Monitor Metrics

curl "http://localhost:9001/autopilot/metrics"

Python example (requests)

import requests

base = "http://localhost:9001"

# Create a plan
r = requests.post(
    f"{base}/autopilot/plan",
    json={"target_hbm_util": 0.7, "max_actions": 5, "priority": "high"},
)
r.raise_for_status()
plan_id = r.json()["plan_id"]

# Poll status
s = requests.get(f"{base}/autopilot/plan/{plan_id}").json()
print("status:", s["status"], "completed:", s["actions_completed"], "/", s["actions_total"]) 

# Metrics
m = requests.get(f"{base}/autopilot/metrics").json()
print("guard metrics:", m)

Configuration

Edit config/sample_config.yaml to customize the behavior:

slo:
  latency_p95_ms: 2000.0
  max_accuracy_delta_pct: 0.5

budgets:
  hbm_util_target: 0.85
  offload_bw_gbps: 120.0

policy:
  keep_recent_tokens: 4096
  eviction:
    - age_decay
  tiers:
    - HBM
    - DDR
    - CXL
    - NVMe

plugins: {}

guardrails:
  ab_shadow_fraction: 0.05
  rollback_on_acc_delta: true

autopilot:
  enabled: true
  default_target_hbm_util: 0.8
  default_max_actions: 10
  default_priority: "medium"

guard:
  enabled: true
  shadow_fraction: 0.1  # Fraction of requests to execute in shadow mode
  max_accuracy_delta: 0.05  # Maximum allowed accuracy impact
  rollback_on_high_impact: true
  accuracy_weights:
    evict: 1.0
    offload: 0.8
    quantize: 0.5

policy_engine:
  min_sequence_length: 100
  min_sequence_utilization: 0.3
  quantization_scales: [0.5, 0.25]
  action_priority: ["EVICT", "OFFLOAD", "QUANTIZE"]

Development

Running Tests

pytest tests/

Code Style

This project uses black for code formatting and flake8 for linting.

black .
flake8

Installation and Quickstart

  • pip (PyPI):
pip install kv-optkit

# Optional: point to a custom config (PowerShell)
$env:KVOPT_CONFIG = "config/sample_config.yaml"

# Start the server on :9000
kvopt-server
  • pip with extras:
# vLLM adapter support and NVML telemetry
pip install "kv-optkit[vllm]"

# TensorRT-LLM route (Linux-only for TensorRT)
pip install "kv-optkit[trtllm]"

# Text Generation Inference route
pip install "kv-optkit[tgi]"

# DeepSpeed-MII route
pip install "kv-optkit[deepspeed]"

# Dev tools for contributing (pytest, ruff, mypy, etc.)
pip install -e ".[dev]"

Extras summary:

  • vllm: vLLM engine adapter and NVML GPU telemetry.

  • trtllm: TensorRT-LLM and Triton client tooling (Linux preferred).

  • tgi: Text Generation Inference client/adapter.

  • deepspeed: DeepSpeed-MII adapter route.

  • dev: local development and CI tooling.

  • Docker (GHCR):

docker run -p 9001:9001 -e KVOPT_PORT=9001 ghcr.io/archokshi/kv-optkit:latest
  • Docker Compose profiles:
# Agent only
docker compose -f docker/compose.yml --profile sim up -d

# Observability (Prometheus + Grafana)
docker compose -f docker/compose.yml --profile obs up -d
  • Helm (local chart):
helm upgrade --install kv-optkit ./deploy/helm/kv-optkit \
  --set image.repository=ghcr.io/archokshi/kv-optkit \
  --set image.tag=latest

Environment configuration (Sidecar/Auto-attach)

  • KVOPT_ENGINE_ENDPOINT: target engine endpoint (e.g., http://localhost:8000).
  • KVOPT_ENGINE_SETTINGS: optional JSON settings blob consumed at server startup.

These variables are read at server start; sidecar mode uses them to auto-attach.

Observability & Reporting

Provides Prometheus metrics, a Grafana dashboard, and a report generator for go/no-go decisions.

Metrics exposed at /metrics

  • Gauges
    • kvopt_hbm_utilization
    • kvopt_hbm_used_gb
    • kvopt_p95_latency_ms
    • kvopt_ttft_ms
    • kvopt_ddr_utilization
    • kvopt_ddr_used_gb
  • Counters
    • kvopt_tokens_evicted_total
    • kvopt_tokens_quantized_total
    • kvopt_reuse_hits_total, kvopt_reuse_misses_total
    • kvopt_autopilot_applies_total, kvopt_autopilot_rollbacks_total

Prometheus exposition is compliant (text/plain; version=0.0.4) with HELP/TYPE headers.

Docker Compose stack

From docker/ directory:

docker compose -f docker/compose.yml up -d prometheus grafana
  • Prometheus UI: http://localhost:9090
    • Example queries: kvopt_hbm_utilization, kvopt_p95_latency_ms, kvopt_ttft_ms
  • Grafana UI: http://localhost:3001
    • Dashboard: "KV-OptKit" (provisioned via docker/grafana-dashboard.json)

Report generator

Generate a live report by sampling metrics for ~30s:

python tools/make_report.py --from live --base http://localhost:8000 --samples 6 --interval 5 --out outputs/run_report.md

Or from a CSV (CI-friendly):

python tools/make_report.py --from file --csv tests/fixtures/metrics_sample.csv --out outputs/run_report.md

Artifacts:

  • Markdown: outputs/run_report.md
  • Charts in outputs/charts/: hbm.png, latency.png, ttft.png, ddr.png

The report includes before/after summaries, action counter deltas, and a Go/No-Go decision vs the P95 SLO.

QuickView Screenshots

Below are example views from the built-in QuickView at /.

  • Overview dashboard with HBM utilization, adapter capabilities, and sequence counts QuickView Overview

  • Sequences table with per-sequence utilization and qscale QuickView Sequences

  • Autopilot controls and current plan status QuickView Autopilot

Place your screenshots in docs/ with the file names above or adjust links as needed.

Containerized LMCache demo

The demo runs entirely in Docker and writes results to a host-mounted folder.

Build the demo image

docker build -f Dockerfile.demo -t kvopt-demo .

Start with Docker Compose

docker compose -f docker/compose.yml up --force-recreate

This launches:

  • redis:7 as kvopt-redis
  • kvopt-demo running examples/demo_reuse.py against redis://redis:6379

Where results are saved

  • CSV output is persisted on the host at outputs/kv_reuse.csv.
  • On Windows, you can inspect it with:
    type outputs\kv_reuse.csv
    

Stopping containers

In the same directory:

docker compose -f docker/compose.yml down

Compatibility matrix

The following combinations have been smoke-tested with the SIM adapter (no GPU) and with vLLM where noted.

Component Version(s)
Python 3.10, 3.11
vLLM (optional) 0.5.x (basic adapter compatibility)
CUDA N/A for SIM; 12.x recommended for GPU deployments
GPU SKUs (indicative) A10, A100 40/80GB, L4 (adapter-level tests)

Notes:

  • SIM adapter requires no GPU and is the default for quickstart.
  • For GPU deployments with vLLM, ensure CUDA drivers match container/base image.

Adapters & capability levels

Adapter Levels Notes
vLLM L0, L2 L0 observe-only; L2 safe EVICT-only apply
SIM L0–L3 Full feature surface for development/testing
TGI L0 Early support; subject to change
DeepSpeed-MII L0 Early support; subject to change

Demos

For all demo flows (Phase 1, 2, and Phase 5 quickstarts including sidecar), see the dedicated demo guide:

Releases

  • Semantic Versioning (vX.Y.Z).
  • Release notes and change history: see CHANGELOG.md.
  • Pin a specific version:
    • PyPI: pip install kv-optkit==X.Y.Z
    • GHCR: ghcr.io/archokshi/kv-optkit:X.Y.Z

License

Apache 2.0 - See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kv_optkit-0.1.0a2.tar.gz (78.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kv_optkit-0.1.0a2-py3-none-any.whl (80.3 kB view details)

Uploaded Python 3

File details

Details for the file kv_optkit-0.1.0a2.tar.gz.

File metadata

  • Download URL: kv_optkit-0.1.0a2.tar.gz
  • Upload date:
  • Size: 78.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for kv_optkit-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 03b24d7ad9e746e75c8921f9789d9629ae6f520ed7fc27a4cf1a154a8ed65e9d
MD5 ca4710bec777581b7d20172f65a82599
BLAKE2b-256 5510691aa1442e62a6a128b527264ecdeffe5bc20a287849e5ed9706ed6d80b9

See more details on using hashes here.

File details

Details for the file kv_optkit-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: kv_optkit-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 80.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for kv_optkit-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 c35bf826854fcb5fbb722ab96a417447036e448c826f71079f112635104af5c0
MD5 44f74a79ddf27c6eacd342397eff31f1
BLAKE2b-256 c237d60c2998193c66e47667e390c99b88cda4815fe9f7fded1a0aaba16c9b0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page