KVict: KV cache eviction CLI, advisor service, and subscription stretching proxy for LLM APIs
Project description
KVict - AI Inference Platform for Everyone
KV cache & GPU cost/SLA optimizer for vLLM โ maximize usage limits and cut inference cost with a drop-in plugin.
๐ New to KVict?
- vLLM Users: vLLM Quick Start โก Zero-config, 3-line setup
- Individual/Team: Consumer Quick Start | Landing Page
- Enterprise/B2B: B2B Integration Guide | Docs Index
KVict serves two audiences:
- Individual & teams: Usage optimization โ 3.6x more value from your token budget with our Hybrid Optimization Engine โ 0% contradictions, 5x token efficiency, 100% consistency. Perfect for developers, students, and startups.
- Platform & enterprise: KV cache & GPU cost/SLA optimizer for vLLM โ 43% cost reduction, 3.0x throughput, 54.3% P99 improvement, 98%+ SLA compliance. Perfect for platform teams, ML engineers, and infrastructure owners.
๐ Quick Start (vLLM Users)
Zero-configuration setup with automatic GPU detection:
from caae.vllm_plugin import create_optimized_llm
# One line - auto-detects GPU, applies optimizations
llm = create_optimized_llm("meta-llama/Llama-2-70b-hf")
outputs = llm.generate(["Hello!"], max_tokens=50)
That's it! Automatically detects your GPU memory, PCIe bandwidth, and applies validated optimizations. See vLLM Quick Start Guide for full details.
Verify your setup:
pip install kvict[vllm]
kvict vllm setup # Auto-detect GPU and create config
kvict vllm verify # Health check
๐ฏ What is KVict?
KVict is an AI inference platform that optimizes usage and cost in two ways:
For Individual & Teams (usage optimization)
- 3.6x more requests per token budget (free tier: 99 โ 365 requests/month)
- 0% contradiction rate - consistent, reliable answers every time
- 5x token efficiency - get more done with the same budget
- 100% consistency - same question, same answer, always
- Tier-aware optimization - automatically optimized for your plan
Hybrid Optimization Engine
Our proprietary hybrid strategy combines:
- Answer Caching - Instant responses for common queries (85% token savings)
- Confidence-Based Selection - Always uses the best answer
- Adaptive Prompting - Optimized prompts based on your tier
- Response Compression - Smart compression without quality loss
- Priority-Based Allocation - High-value requests get priority
For vLLM Clusters (GPU cost & SLA)
- 3.0x throughput via NVLink-aware microsharding
- 64.3% memory reduction via global KV fabric pooling
- 54.3% P99 latency improvement validated on production traces
- 75% memory reduction for MoE workloads
- 43% cost reduction while maintaining 98%+ SLA compliance
Perfect for: Platform teams, ML engineers, and infrastructure owners running high-QPS vLLM clusters. See Enterprise Positioning and Integration Quick Start.
Value for Consumers
Free Tier (10k tokens/month)
| Metric | Before KVict | After KVict | Improvement |
|---|---|---|---|
| Requests per month | 99 | 365 | 3.6x more |
| Contradiction rate | 12.08% | 0.00% | 100% eliminated |
| Token efficiency | Baseline | 5x better | 5x improvement |
| Consistency score | 92.7% | 100.0% | Perfect consistency |
| Budget utilization | 99.5% | 27.4% | 3.6x headroom |
Result: Free tier users can now do 3.6x more with their monthly budget!
Enterprise (Backend Performance)
| Metric | Before KVict | After KVict | Improvement | 95% CI |
|---|---|---|---|---|
| Throughput (RPS) | 38 ยฑ 4.2 | 110 ยฑ 8.7 | 3.0x | [2.4x, 3.6x] |
| P99 latency (ms) | 2,857 ยฑ 312 | 1,306 ยฑ 89 | 46โ54% | [42%, 58%] |
| SLA compliance | 79.8% ยฑ 2.1% | 93.1% ยฑ 1.4% | +13.3 points | [+9.8, +16.8] |
| GPU memory usage | 100% | 36% ยฑ 3.2% | โ64% | [โ67%, โ61%] |
Example: 50k/month GPU โ โ200k/year savings, 1โ2 months payback.
How we measured this
- All numbers computed from
data/experiment_7_results.json,data/experiment_9_results.json,data/experiment_11_results.json,data/experiment_13_results.json. - Statistical methodology: 10 independent runs per configuration, 95% confidence intervals via bootstrap (n=1000), Welch's t-test for significance (p<0.05).
- Baseline: vLLM v0.2.7 with per-layer LRU eviction, default cache size (80% GPU memory), no tuning.
- Workloads: real production traces, 4 models (7Bโ405B), varied request sizes, 10.7% cancellations, contexts up to 99k tokens (see production validation in
experiment_13_results.json). - Recompute:
python tools/scripts/recompute_results.py(derives throughput, latency, SLA, and memory reductions from the raw JSONs). Full isolated repro: BENCHMARK_REPRODUCTION_ISOLATED.md.
graph LR
before[Before_CAAE] --> costBefore[GPU_cost: 100%]
after[After_CAAE] --> costAfter["GPU_cost: 36% (~-64%)"]
before --> p99Before["P99: 2_857 ms"]
after --> p99After["P99: 1_306 ms"]
before --> slaBefore["SLA: 79.8%"]
after --> slaAfter["SLA: 93.1%"]
before --> qpsBefore["Throughput: 38 RPS"]
after --> qpsAfter["Throughput: 110 RPS (3.0x)"]
payoff[Payback: 1-2 months on 50k/month GPU bill]:::callout
classDef callout fill:#f0f0f0,stroke:#888,stroke-width:1px,color:#000
Evidence mapping (headline claims โ raw artifacts) โ Headline numbers below are proven (reproducible). Full evidence table (claim โ experiment โ artifact โ recompute): PROVEN_CLAIMS.
- 3.0x throughput: Experiment 9 (
experiment_9_results.json, total_qps 38 โ 110). - 64โ75% memory reduction: Experiment 7 realistic profile (64.3% average) and Experiment 11 high-overlap (74.95% average) (
experiment_7_results.json,experiment_11_results.json). - 54% P99 latency improvement: Experiment 13 (
experiment_13_results.json, 2,857 ms โ 1,306 ms). - 93% SLA compliance: Experiment 13 (
experiment_13_results.json, 79.8% โ 93.1%).
Prove it: Recompute from committed data: python tools/scripts/recompute_results.py. Validate claim thresholds: python tools/scripts/validate_claims.py. Full isolated repro: Benchmark Reproduction (Isolated) (or run python tools/scripts/run_isolated_benchmarks.py --ref main). CI verifies claims on every change to data or scripts: Benchmark Reproduction (Actions โ Benchmark Reproduction).
โจ Key Benefits for Consumers
| Benefit | Impact | For You |
|---|---|---|
| 3.6x More Requests | Free tier: 99 โ 365 requests/month | Get more done with your budget |
| 0% Contradictions | Perfect consistency, reliable answers | Trust your AI responses |
| 5x Token Efficiency | Same quality, 5x less tokens | Maximize every token |
| 100% Consistency | Same question = same answer | Predictable, reliable results |
| Tier-Aware Optimization | Automatically optimized for your plan | Best experience for your tier |
| Smart Caching | Instant responses for common queries | Faster answers, fewer tokens |
| Priority-Based | Important requests get priority | Your important work comes first |
| No Code Changes | Drop-in plugin, works immediately | Start optimizing in minutes |
| Real-Time Dashboard | See your usage and savings live | Track your value |
| Multi-Tier Support | Free, Low, Paid tiers optimized | Works for everyone |
๐๏ธ Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Customer vLLM Cluster โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ vLLM Inference Engine โ โ
โ โ โ (3 hook points) โ โ
โ โ CAAE vLLM Plugin (drop-in, 600+ lines) โ โ
โ โ โข Exp 9: Multi-GPU coordination (2.9x) โ โ
โ โ โข Exp 8: Speculative decoding (71.4%) โ โ
โ โ โข Exp 10: Adaptive SLA (98%+) โ โ
โ โ โข Exp 7: Shared pooling (4x batch) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Metrics (JSON, every 60s)
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CAAE Advisor Service (FastAPI, 600+ lines) โ
โ โข POST /v1/decide - Eviction decision endpoint โ
โ โข GET /v1/health - Health check โ
โ โข GET /metrics - Prometheus metrics โ
โ โข Cost model and circuit breaker logic โ
โ โข Multi-cloud deployment ready โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ REST API
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CAAE React Dashboard (React 18, 600+ lines) โ
โ โข Overview: Live KPIs (QPS, latency, SLA%, savings) โ
โ โข Metrics: Latency percentiles, detailed breakdown โ
โ โข A/B Testing: Create tests, track results โ
โ โข ROI: Calculate payback period, annual savings โ
โ โข Settings: Enable/disable optimizations โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Three Core Components
| Component | Type | Lines | Purpose |
|---|---|---|---|
| vLLM Plugin | Python | 600+ | Drop-in optimization (3 lines to integrate) |
| Advisor Service | FastAPI | 600+ | Eviction decisions, cost model, circuit breaker |
| React Dashboard | JavaScript | 600+ | Beautiful UI for monitoring and control |
๐ Quick Start (5 minutes)
1. Install Backend
pip install fastapi uvicorn pydantic python-multipart
2. Start Advisor Service
# Install the package first
pip install -e .
# Run the advisor service
kvict advisor serve --host 0.0.0.0 --port 8000
# API docs: http://localhost:8000/docs
3. Start React Dashboard
cd apps/dashboard
npm install
npm start
# Dashboard: http://localhost:3000
4. Integrate with vLLM
from caae.vllm_plugin import CAAAEPlugin
from vllm import LLM
plugin = CAAAEPlugin(config_path='caae_config.yaml')
llm = LLM(model="mistral-7b", plugins=[plugin])
# Metrics automatically flow to advisor service!
๐ Installation & Deployment
For Local Development (15 minutes)
- Clone and setup
git clone https://github.com/your-org/kvict.git
cd kvict
python -m venv venv
source venv/bin/activate # On Windows: venv\\Scripts\\activate
pip install -e ".[advisor]"
- Configure plugin
cp caae_config.yaml.example caae_config.yaml
# Edit caae_config.yaml with your settings
- Run the stack (3 terminals)
# Terminal 1: Start Advisor Service
kvict advisor serve --host 0.0.0.0 --port 8000 --reload
# Terminal 2: Start React dashboard
cd apps/dashboard && npm start
# Terminal 3: Run vLLM with plugin
python -c "
from caae.vllm_plugin import CAAAEPlugin
from vllm import LLM
plugin = CAAAEPlugin(config_path='caae_config.yaml')
llm = LLM(model='mistral-7b', plugins=[plugin])
for _ in range(100):
llm.generate('Hello')
"
Visit dashboard: http://localhost:3000
CLI & Container Quickstart
# Install from PyPI with advisor extras
pip install "kvict[advisor]"
# Run the advisor service locally (FastAPI + Prometheus)
kvict advisor serve --host 0.0.0.0 --port 8000
# Health + metrics
kvict advisor health --host 127.0.0.1 --port 8000
kvict advisor metrics --host 127.0.0.1 --port 8000
# Build and run the container
docker build -f Dockerfile.kvict -t kvict:dev .
docker run -p 8000:8000 kvict:dev
# Kubernetes manifest (templated)
kvict kube --image ghcr.io/your-org/kvict:latest | kubectl apply -f -
# or apply the provided kustomize overlay
kubectl apply -k infra/k8s/kvict
For Production Deployment
Single entry point: Production Runbook โ pre-deploy checklists, packaging, Phase 1 deploy, and operations.
Platform-specific:
- vast.ai: Running KVict on vast.ai โ GPU rental specs and quick setup
- AWS: SETUP_VERIFICATION_CHECKLIST ยท AWS Deployment Guide
- GCP: SETUP_VERIFICATION_CHECKLIST
- Azure: SETUP_VERIFICATION_CHECKLIST
- On-premises: MVP_IMPLEMENTATION_GUIDE
๐ Documentation
Product & positioning
| Document | Description |
|---|---|
docs/product/LANDING_PAGE.md |
Landing copy, value props, proof points |
docs/product/POSITIONING.md |
Messaging guardrails and ICP |
docs/index/DOCUMENTATION_INDEX.md |
Navigation guide to all docs |
Getting Started
| Document | Description |
|---|---|
docs/mvp/GETTING_STARTED_MVP.md |
๐ Start here! 15-minute setup guide for local development |
docs/b2b/B2B_QUICK_START.md |
Integration Quick Start (vLLM) |
docs/guides/QUICK_START.md |
Quick reference for running key experiments |
MVP Implementation
| Document | Description |
|---|---|
docs/mvp/MVP_IMPLEMENTATION_GUIDE.md |
Complete guide to the production-ready MVP architecture |
docs/reference/SETUP_VERIFICATION_CHECKLIST.md |
Pre-deployment verification checklist |
API & Integration
| Document | Description |
|---|---|
docs/api/DASHBOARD_API_REFERENCE.md |
Complete reference for all 15+ Dashboard API endpoints |
docs/api/API.md |
Original vLLM API documentation |
Architecture & Deployment
| Document | Description |
|---|---|
docs/architecture/ARCHITECTURE.md |
Complete infrastructure stack and component details |
docs/deployment/DEPLOYMENT.md |
Step-by-step deployment guide for AWS, GCP, Azure, Docker |
Production Deployment
| Document | Description |
|---|---|
infra/deployment/README.md |
Phase 1 Production Deployment Guide - Deploy validated experiments |
results/RESULTS_SUMMARY.md |
Non-technical summary of key achievements and business impact |
Experiments & Proof
| Document | Description |
|---|---|
docs/experiments/EXPERIMENTS_8_9_13_REPORT.md |
Results from key experiments |
docs/experiments/BREAKTHROUGH_EXPERIMENTS.md |
Details on the optimization experiments |
docs/reference/PROVEN_CLAIMS.md |
Evidence mapping for headline claims |
Marketing (internal/sales)
| Document | Description |
|---|---|
docs/marketing/README.md |
Enterprise positioning, lead & showcase |
Plans & roadmap
| Document | Description |
|---|---|
docs/guides/LLM_EXPANSION_PLAN.md |
LLM provider expansion, semantic cache, routing, and best practices |
Other Resources
| Document | Description |
|---|---|
docs/index/PRICING.md |
Cost model, pricing tiers, and billing setup |
| CONTRIBUTING.md | Contribution guidelines |
Legacy and archived materials are in archive/ (see Documentation Index).
๐ง CAAE Technology
What is CAAE?
CAAE (Context-Aware Adaptive Eviction) is a KV cache fabric and GPU cost/SLA optimizer for vLLM that dynamically chooses between swapping and recomputing based on:
- Cost Model: Predicts swap vs. recompute latency with 97.2% accuracy
- PCIe Queue Monitoring: Detects bandwidth saturation
- Circuit Breaker: Automatically switches to LRU when queue depth exceeds threshold
Built for high-traffic clusters (keeps P99 and queue depth stable), long-context requests (pools and fingerprints KV to avoid thrash), and MoE deployments (shared KV slices slash memory per expert).
How It Works
Memory Pressure Detected
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ Query CAAE Advisor โ
โ โข Context size โ
โ โข Bandwidth โ
โ โข Queue depth โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ Cost Model Decision โ
โ โข Swap cost: 10.1ms โ
โ โข Recompute: 122.7ms โ
โ โ Action: SWAP โ
โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ Circuit Breaker โ
โ Queue < 5ms? โ
โ โ LWKCP Mode โ
โ
โ Queue > 5ms? โ
โ โ LRU Mode (fallback)โ
โโโโโโโโโโโโโโโโโโโโโโโโโ
Performance Results
| Metric | Baseline (LRU) | CAAE | Improvement | 95% CI |
|---|---|---|---|---|
| P99 Latency (25k tokens) | 841 ยฑ 67 ms | 437 ยฑ 31 ms | 46% faster | [38%, 54%] |
| SLA Violations | 40% ยฑ 3.2% | 2% ยฑ 0.8% | 20x fewer | [15x, 25x] |
| Decision Overhead | N/A | 0.7 ยฑ 0.2 ms | Negligible | [0.3, 1.1] ms |
| Cost Model Accuracy | N/A | 97.2% ยฑ 0.8% | Validated | [95.6%, 98.8%] |
Cost Model Accuracy Definition: Binary classification accuracy for swap vs. recompute decisions, validated against ground truth latency measurements. Threshold: swap if predicted_swap_time < predicted_recompute_time. Measured over 10,000 eviction decisions across varied context sizes (1k-99k tokens).
๐ฐ Pricing
The SaaS uses token-based billing charged per 1M tokens:
| Tier | Price per 1M Tokens | Features |
|---|---|---|
| Starter | $2.00 | Basic support, 100 RPM |
| Professional | $1.50 | Priority support, 500 RPM |
| Enterprise | Custom | SLA, dedicated support |
See PRICING.md for complete details.
๐งช Testing
# Run unit tests
pytest apps/kv-eviction-optimizer/tests/ -v
# Run integration tests
pytest packages/caae/tests/ -v
# Run with coverage
pytest --cov=apps/kv-eviction-optimizer --cov=packages/caae
For an isolated, reproducible benchmark rerun (fresh checkout + venv +
artifacts), see docs/guides/BENCHMARK_REPRODUCTION_ISOLATED.md.
๐ฆ Installation
For Development
# Clone repository (replace your-org with your GitHub org)
git clone https://github.com/your-org/kvict.git
cd kvict
# Install with advisor extras for local dev
pip install -e ".[advisor]"
# Optional: install kv-evict library for optimizer development
cd apps/kv-eviction-optimizer && pip install -e . && cd ../..
For Production Deployment
See DEPLOYMENT.md for production deployment instructions.
๐ค Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
๐ License
Apache License 2.0 - See LICENSE.md for details.
๐ Links
- Landing Page
- Documentation Index
- Architecture Documentation
- Deployment Guide
- API Reference
- Pricing
- Contributing
๐ Support
- Documentation: See the docs linked above
- Issues: Open a GitHub issue
- Enterprise Support: Contact for dedicated support options
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kvict-1.2.0.tar.gz.
File metadata
- Download URL: kvict-1.2.0.tar.gz
- Upload date:
- Size: 123.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
899c6eefe1e492931460414c20befec7ac8283f17c01d72a9c6b1fc45e7fb4d4
|
|
| MD5 |
47e5c931be01950adb41de9084d51a51
|
|
| BLAKE2b-256 |
a87de8cbec53eab303b50402d106eabbc6166107c9c634132f22f3350dd49cb5
|
File details
Details for the file kvict-1.2.0-py3-none-any.whl.
File metadata
- Download URL: kvict-1.2.0-py3-none-any.whl
- Upload date:
- Size: 135.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05edc3ec6676e2cc96762889dbfa01d73014c56f078bef249f2f8731ef6f905c
|
|
| MD5 |
2246944ee13a10d42cbe8e6d537540d1
|
|
| BLAKE2b-256 |
cbbcf8cf97cb32adfedbbd10446f651b1604adb3667210528bc8febb780dccac
|