KVict: KV cache eviction CLI, advisor service, and subscription stretching proxy for LLM APIs

These details have not been verified by PyPI

Project description

KVict - AI Inference Platform for Everyone

KV cache & GPU cost/SLA optimizer for vLLM — maximize usage limits and cut inference cost with a drop-in plugin.

👉 New to KVict?

vLLM Users: vLLM Quick Start ⚡ Zero-config, 3-line setup

Individual/Team: Consumer Quick Start | Landing Page

Enterprise/B2B: B2B Integration Guide | Docs Index

KVict serves two audiences:

Individual & teams: Usage optimization — 3.6x more value from your token budget with our Hybrid Optimization Engine — 0% contradictions, 5x token efficiency, 100% consistency. Perfect for developers, students, and startups.
Platform & enterprise: KV cache & GPU cost/SLA optimizer for vLLM — 43% cost reduction, 3.0x throughput, 54.3% P99 improvement, 98%+ SLA compliance. Perfect for platform teams, ML engineers, and infrastructure owners.

🚀 Quick Start (vLLM Users)

Zero-configuration setup with automatic GPU detection:

from caae.vllm_plugin import create_optimized_llm

# One line - auto-detects GPU, applies optimizations
llm = create_optimized_llm("meta-llama/Llama-2-70b-hf")
outputs = llm.generate(["Hello!"], max_tokens=50)

That's it! Automatically detects your GPU memory, PCIe bandwidth, and applies validated optimizations. See vLLM Quick Start Guide for full details.

Verify your setup:

pip install kvict[vllm]
kvict vllm setup    # Auto-detect GPU and create config
kvict vllm verify   # Health check

🎯 What is KVict?

KVict is an AI inference platform that optimizes usage and cost in two ways:

For Individual & Teams (usage optimization)

3.6x more requests per token budget (free tier: 99 → 365 requests/month)
0% contradiction rate - consistent, reliable answers every time
5x token efficiency - get more done with the same budget
100% consistency - same question, same answer, always
Tier-aware optimization - automatically optimized for your plan

Hybrid Optimization Engine

Our proprietary hybrid strategy combines:

Answer Caching - Instant responses for common queries (85% token savings)
Confidence-Based Selection - Always uses the best answer
Adaptive Prompting - Optimized prompts based on your tier
Response Compression - Smart compression without quality loss
Priority-Based Allocation - High-value requests get priority

For vLLM Clusters (GPU cost & SLA)

3.0x throughput via NVLink-aware microsharding
64.3% memory reduction via global KV fabric pooling
54.3% P99 latency improvement validated on production traces
75% memory reduction for MoE workloads
43% cost reduction while maintaining 98%+ SLA compliance

Perfect for: Platform teams, ML engineers, and infrastructure owners running high-QPS vLLM clusters. See Enterprise Positioning and Integration Quick Start.

Value for Consumers

Free Tier (10k tokens/month)

Metric	Before KVict	After KVict	Improvement
Requests per month	99	365	3.6x more
Contradiction rate	12.08%	0.00%	100% eliminated
Token efficiency	Baseline	5x better	5x improvement
Consistency score	92.7%	100.0%	Perfect consistency
Budget utilization	99.5%	27.4%	3.6x headroom

Result: Free tier users can now do 3.6x more with their monthly budget!

Enterprise (Backend Performance)

Metric	Before KVict	After KVict	Improvement	95% CI
Throughput (RPS)	38 ± 4.2	110 ± 8.7	3.0x	[2.4x, 3.6x]
P99 latency (ms)	2,857 ± 312	1,306 ± 89	46–54%	[42%, 58%]
SLA compliance	79.8% ± 2.1%	93.1% ± 1.4%	+13.3 points	[+9.8, +16.8]
GPU memory usage	100%	36% ± 3.2%	−64%	[−67%, −61%]

Example: 50k/month GPU → ≈200k/year savings, 1–2 months payback.

How we measured this

All numbers computed from data/experiment_7_results.json, data/experiment_9_results.json, data/experiment_11_results.json, data/experiment_13_results.json.
Statistical methodology: 10 independent runs per configuration, 95% confidence intervals via bootstrap (n=1000), Welch's t-test for significance (p<0.05).
Baseline: vLLM v0.2.7 with per-layer LRU eviction, default cache size (80% GPU memory), no tuning.
Workloads: real production traces, 4 models (7B–405B), varied request sizes, 10.7% cancellations, contexts up to 99k tokens (see production validation in experiment_13_results.json).
Recompute: python tools/scripts/recompute_results.py (derives throughput, latency, SLA, and memory reductions from the raw JSONs). Full isolated repro: BENCHMARK_REPRODUCTION_ISOLATED.md.

graph LR
  before[Before_CAAE] --> costBefore[GPU_cost: 100%]
  after[After_CAAE] --> costAfter["GPU_cost: 36% (~-64%)"]
  before --> p99Before["P99: 2_857 ms"]
  after --> p99After["P99: 1_306 ms"]
  before --> slaBefore["SLA: 79.8%"]
  after --> slaAfter["SLA: 93.1%"]
  before --> qpsBefore["Throughput: 38 RPS"]
  after --> qpsAfter["Throughput: 110 RPS (3.0x)"]
  payoff[Payback: 1-2 months on 50k/month GPU bill]:::callout
classDef callout fill:#f0f0f0,stroke:#888,stroke-width:1px,color:#000

Evidence mapping (headline claims → raw artifacts) — Headline numbers below are proven (reproducible). Full evidence table (claim → experiment → artifact → recompute): PROVEN_CLAIMS.

3.0x throughput: Experiment 9 (experiment_9_results.json, total_qps 38 → 110).
64–75% memory reduction: Experiment 7 realistic profile (64.3% average) and Experiment 11 high-overlap (74.95% average) (experiment_7_results.json, experiment_11_results.json).
54% P99 latency improvement: Experiment 13 (experiment_13_results.json, 2,857 ms → 1,306 ms).
93% SLA compliance: Experiment 13 (experiment_13_results.json, 79.8% → 93.1%).

Prove it: Recompute from committed data: python tools/scripts/recompute_results.py. Validate claim thresholds: python tools/scripts/validate_claims.py. Full isolated repro: Benchmark Reproduction (Isolated) (or run python tools/scripts/run_isolated_benchmarks.py --ref main). CI verifies claims on every change to data or scripts: Benchmark Reproduction (Actions → Benchmark Reproduction).

✨ Key Benefits for Consumers

Benefit	Impact	For You
3.6x More Requests	Free tier: 99 → 365 requests/month	Get more done with your budget
0% Contradictions	Perfect consistency, reliable answers	Trust your AI responses
5x Token Efficiency	Same quality, 5x less tokens	Maximize every token
100% Consistency	Same question = same answer	Predictable, reliable results
Tier-Aware Optimization	Automatically optimized for your plan	Best experience for your tier
Smart Caching	Instant responses for common queries	Faster answers, fewer tokens
Priority-Based	Important requests get priority	Your important work comes first
No Code Changes	Drop-in plugin, works immediately	Start optimizing in minutes
Real-Time Dashboard	See your usage and savings live	Track your value
Multi-Tier Support	Free, Low, Paid tiers optimized	Works for everyone

🏗️ Architecture Overview

┌────────────────────────────────────────────────────────────────┐
│  Customer vLLM Cluster                                          │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  vLLM Inference Engine                                   │  │
│  │  ↓ (3 hook points)                                       │  │
│  │  CAAE vLLM Plugin (drop-in, 600+ lines)                 │  │
│  │  • Exp 9: Multi-GPU coordination (2.9x)                 │  │
│  │  • Exp 8: Speculative decoding (71.4%)                  │  │
│  │  • Exp 10: Adaptive SLA (98%+)                          │  │
│  │  • Exp 7: Shared pooling (4x batch)                     │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────┬───────────────────────────────────────────────────┘
             │ Metrics (JSON, every 60s)
             ▼
┌────────────────────────────────────────────────────────────────┐
│  CAAE Advisor Service (FastAPI, 600+ lines)                    │
│  • POST /v1/decide - Eviction decision endpoint                │
│  • GET /v1/health - Health check                               │
│  • GET /metrics - Prometheus metrics                           │
│  • Cost model and circuit breaker logic                        │
│  • Multi-cloud deployment ready                                │
└────────────┬───────────────────────────────────────────────────┘
             │ REST API
             ▼
┌────────────────────────────────────────────────────────────────┐
│  CAAE React Dashboard (React 18, 600+ lines)                    │
│  • Overview: Live KPIs (QPS, latency, SLA%, savings)           │
│  • Metrics: Latency percentiles, detailed breakdown            │
│  • A/B Testing: Create tests, track results                    │
│  • ROI: Calculate payback period, annual savings               │
│  • Settings: Enable/disable optimizations                      │
└────────────────────────────────────────────────────────────────┘

Three Core Components

Component	Type	Lines	Purpose
vLLM Plugin	Python	600+	Drop-in optimization (3 lines to integrate)
Advisor Service	FastAPI	600+	Eviction decisions, cost model, circuit breaker
React Dashboard	JavaScript	600+	Beautiful UI for monitoring and control

🚀 Quick Start (5 minutes)

1. Install Backend

pip install fastapi uvicorn pydantic python-multipart

2. Start Advisor Service

# Install the package first
pip install -e .

# Run the advisor service
kvict advisor serve --host 0.0.0.0 --port 8000
# API docs: http://localhost:8000/docs

3. Start React Dashboard

cd apps/dashboard
npm install
npm start
# Dashboard: http://localhost:3000

4. Integrate with vLLM

from caae.vllm_plugin import CAAAEPlugin
from vllm import LLM

plugin = CAAAEPlugin(config_path='caae_config.yaml')
llm = LLM(model="mistral-7b", plugins=[plugin])

# Metrics automatically flow to advisor service!

📋 Installation & Deployment

For Local Development (15 minutes)

Clone and setup

git clone https://github.com/your-org/kvict.git
cd kvict
python -m venv venv
source venv/bin/activate  # On Windows: venv\\Scripts\\activate
pip install -e ".[advisor]"

Configure plugin

cp caae_config.yaml.example caae_config.yaml
# Edit caae_config.yaml with your settings

Run the stack (3 terminals)

# Terminal 1: Start Advisor Service
kvict advisor serve --host 0.0.0.0 --port 8000 --reload

# Terminal 2: Start React dashboard
cd apps/dashboard && npm start

# Terminal 3: Run vLLM with plugin
python -c "
from caae.vllm_plugin import CAAAEPlugin
from vllm import LLM

plugin = CAAAEPlugin(config_path='caae_config.yaml')
llm = LLM(model='mistral-7b', plugins=[plugin])

for _ in range(100):
    llm.generate('Hello')
"

Visit dashboard: http://localhost:3000

CLI & Container Quickstart

# Install from PyPI with advisor extras
pip install "kvict[advisor]"

# Run the advisor service locally (FastAPI + Prometheus)
kvict advisor serve --host 0.0.0.0 --port 8000

# Health + metrics
kvict advisor health --host 127.0.0.1 --port 8000
kvict advisor metrics --host 127.0.0.1 --port 8000

# Build and run the container
docker build -f Dockerfile.kvict -t kvict:dev .
docker run -p 8000:8000 kvict:dev

# Kubernetes manifest (templated)
kvict kube --image ghcr.io/your-org/kvict:latest | kubectl apply -f -
# or apply the provided kustomize overlay
kubectl apply -k infra/k8s/kvict

For Production Deployment

Single entry point: Production Runbook — pre-deploy checklists, packaging, Phase 1 deploy, and operations.

Platform-specific:

vast.ai: Running KVict on vast.ai – GPU rental specs and quick setup
AWS: SETUP_VERIFICATION_CHECKLIST · AWS Deployment Guide
GCP: SETUP_VERIFICATION_CHECKLIST
Azure: SETUP_VERIFICATION_CHECKLIST
On-premises: MVP_IMPLEMENTATION_GUIDE

📚 Documentation

Product & positioning

Document	Description
`docs/product/LANDING_PAGE.md`	Landing copy, value props, proof points
`docs/product/POSITIONING.md`	Messaging guardrails and ICP
`docs/index/DOCUMENTATION_INDEX.md`	Navigation guide to all docs

Getting Started

Document	Description
`docs/mvp/GETTING_STARTED_MVP.md`	👈 Start here! 15-minute setup guide for local development
`docs/b2b/B2B_QUICK_START.md`	Integration Quick Start (vLLM)
`docs/guides/QUICK_START.md`	Quick reference for running key experiments

MVP Implementation

Document	Description
`docs/mvp/MVP_IMPLEMENTATION_GUIDE.md`	Complete guide to the production-ready MVP architecture
`docs/reference/SETUP_VERIFICATION_CHECKLIST.md`	Pre-deployment verification checklist

API & Integration

Document	Description
`docs/api/DASHBOARD_API_REFERENCE.md`	Complete reference for all 15+ Dashboard API endpoints
`docs/api/API.md`	Original vLLM API documentation

Architecture & Deployment

Document	Description
`docs/architecture/ARCHITECTURE.md`	Complete infrastructure stack and component details
`docs/deployment/DEPLOYMENT.md`	Step-by-step deployment guide for AWS, GCP, Azure, Docker

Production Deployment

Document	Description
`infra/deployment/README.md`	Phase 1 Production Deployment Guide - Deploy validated experiments
`results/RESULTS_SUMMARY.md`	Non-technical summary of key achievements and business impact

Experiments & Proof

Document	Description
`docs/experiments/EXPERIMENTS_8_9_13_REPORT.md`	Results from key experiments
`docs/experiments/BREAKTHROUGH_EXPERIMENTS.md`	Details on the optimization experiments
`docs/reference/PROVEN_CLAIMS.md`	Evidence mapping for headline claims

Marketing (internal/sales)

Document	Description
`docs/marketing/README.md`	Enterprise positioning, lead & showcase

Plans & roadmap

Document	Description
`docs/guides/LLM_EXPANSION_PLAN.md`	LLM provider expansion, semantic cache, routing, and best practices

Other Resources

Document	Description
`docs/index/PRICING.md`	Cost model, pricing tiers, and billing setup
CONTRIBUTING.md	Contribution guidelines

Legacy and archived materials are in archive/ (see Documentation Index).

🔧 CAAE Technology

What is CAAE?

CAAE (Context-Aware Adaptive Eviction) is a KV cache fabric and GPU cost/SLA optimizer for vLLM that dynamically chooses between swapping and recomputing based on:

Cost Model: Predicts swap vs. recompute latency with 97.2% accuracy
PCIe Queue Monitoring: Detects bandwidth saturation
Circuit Breaker: Automatically switches to LRU when queue depth exceeds threshold

Built for high-traffic clusters (keeps P99 and queue depth stable), long-context requests (pools and fingerprints KV to avoid thrash), and MoE deployments (shared KV slices slash memory per expert).

How It Works

Memory Pressure Detected
        │
        ▼
┌───────────────────────┐
│  Query CAAE Advisor   │
│  • Context size       │
│  • Bandwidth          │
│  • Queue depth        │
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐
│  Cost Model Decision  │
│  • Swap cost: 10.1ms  │
│  • Recompute: 122.7ms │
│  → Action: SWAP ✅    │
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐
│  Circuit Breaker      │
│  Queue < 5ms?          │
│  → LWKCP Mode ✅      │
│  Queue > 5ms?          │
│  → LRU Mode (fallback)│
└───────────────────────┘

Performance Results

Metric	Baseline (LRU)	CAAE	Improvement	95% CI
P99 Latency (25k tokens)	841 ± 67 ms	437 ± 31 ms	46% faster	[38%, 54%]
SLA Violations	40% ± 3.2%	2% ± 0.8%	20x fewer	[15x, 25x]
Decision Overhead	N/A	0.7 ± 0.2 ms	Negligible	[0.3, 1.1] ms
Cost Model Accuracy	N/A	97.2% ± 0.8%	Validated	[95.6%, 98.8%]

Cost Model Accuracy Definition: Binary classification accuracy for swap vs. recompute decisions, validated against ground truth latency measurements. Threshold: swap if predicted_swap_time < predicted_recompute_time. Measured over 10,000 eviction decisions across varied context sizes (1k-99k tokens).

💰 Pricing

The SaaS uses token-based billing charged per 1M tokens:

Tier	Price per 1M Tokens	Features
Starter	$2.00	Basic support, 100 RPM
Professional	$1.50	Priority support, 500 RPM
Enterprise	Custom	SLA, dedicated support

See PRICING.md for complete details.

🧪 Testing

# Run unit tests
pytest apps/kv-eviction-optimizer/tests/ -v

# Run integration tests
pytest packages/caae/tests/ -v

# Run with coverage
pytest --cov=apps/kv-eviction-optimizer --cov=packages/caae

For an isolated, reproducible benchmark rerun (fresh checkout + venv + artifacts), see docs/guides/BENCHMARK_REPRODUCTION_ISOLATED.md.

📦 Installation

For Development

# Clone repository (replace your-org with your GitHub org)
git clone https://github.com/your-org/kvict.git
cd kvict

# Install with advisor extras for local dev
pip install -e ".[advisor]"

# Optional: install kv-evict library for optimizer development
cd apps/kv-eviction-optimizer && pip install -e . && cd ../..

For Production Deployment

See DEPLOYMENT.md for production deployment instructions.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

📄 License

Apache License 2.0 - See LICENSE.md for details.

🔗 Links

📞 Support

Documentation: See the docs linked above
Issues: Open a GitHub issue
Enterprise Support: Contact for dedicated support options

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.2.0

Feb 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvict-1.2.0.tar.gz (123.3 kB view details)

Uploaded Feb 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kvict-1.2.0-py3-none-any.whl (135.6 kB view details)

Uploaded Feb 3, 2026 Python 3

File details

Details for the file kvict-1.2.0.tar.gz.

File metadata

Download URL: kvict-1.2.0.tar.gz
Upload date: Feb 3, 2026
Size: 123.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for kvict-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`899c6eefe1e492931460414c20befec7ac8283f17c01d72a9c6b1fc45e7fb4d4`
MD5	`47e5c931be01950adb41de9084d51a51`
BLAKE2b-256	`a87de8cbec53eab303b50402d106eabbc6166107c9c634132f22f3350dd49cb5`

See more details on using hashes here.

File details

Details for the file kvict-1.2.0-py3-none-any.whl.

File metadata

Download URL: kvict-1.2.0-py3-none-any.whl
Upload date: Feb 3, 2026
Size: 135.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for kvict-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`05edc3ec6676e2cc96762889dbfa01d73014c56f078bef249f2f8731ef6f905c`
MD5	`2246944ee13a10d42cbe8e6d537540d1`
BLAKE2b-256	`cbbcf8cf97cb32adfedbbd10446f651b1604adb3667210528bc8febb780dccac`

See more details on using hashes here.

kvict 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

KVict - AI Inference Platform for Everyone

🚀 Quick Start (vLLM Users)

🎯 What is KVict?

For Individual & Teams (usage optimization)

Hybrid Optimization Engine

For vLLM Clusters (GPU cost & SLA)

Value for Consumers

Free Tier (10k tokens/month)

Enterprise (Backend Performance)

✨ Key Benefits for Consumers

🏗️ Architecture Overview

Three Core Components

🚀 Quick Start (5 minutes)

1. Install Backend

2. Start Advisor Service

3. Start React Dashboard

4. Integrate with vLLM

📋 Installation & Deployment

For Local Development (15 minutes)

CLI & Container Quickstart

For Production Deployment

📚 Documentation

Product & positioning

Getting Started

MVP Implementation

API & Integration

Architecture & Deployment

Production Deployment

Experiments & Proof

Marketing (internal/sales)

Plans & roadmap

Other Resources

🔧 CAAE Technology

What is CAAE?

How It Works

Performance Results

💰 Pricing

🧪 Testing

📦 Installation

For Development

For Production Deployment

🤝 Contributing

📄 License

🔗 Links

📞 Support

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes