Skip to main content

Automatic KV-cache optimization for HuggingFace Transformers - Find the optimal cache strategy, attention backend, and configuration for your model and hardware.

Project description

KVCache Auto-Tuner

KVCache Auto-Tuner

Automatic KV-Cache Optimization for HuggingFace Transformers

Find the optimal cache strategy, attention backend, and configuration for your model and hardware.

Tests PyPI Python License

Quick Start | Performance | Features | Installation | Roadmap


What is KVCache Auto-Tuner?

KVCache Auto-Tuner (kvat) automatically benchmarks and optimizes your HuggingFace Transformers inference pipeline. Stop guessing which configuration works best - let the tuner find it for you.

# Install and optimize your model in seconds
pip install kvat[full]
kvat tune gpt2 --profile chat-agent

Performance

Baseline vs Optimized

See how kvat improves your Transformers inference:

Performance Improvement with KVCache Auto-Tuner

Model Without kvat With kvat Improvement
GPT-2 (124M) 118.1 tok/s 120.2 tok/s +1.8%
Qwen2.5-0.5B 28.7 tok/s 29.5 tok/s +2.7%
Phi-1.5 (1.3B) 45.2 tok/s 45.6 tok/s +0.9%
View Detailed Comparison Charts
Throughput: Baseline vs Optimized

Throughput Comparison

Performance Improvement %

Performance Gain

Note: Results vary by model and hardware. Larger improvements are typical for models that benefit from Flash Attention and dynamic caching.

Multi-Model Benchmarks

Desktop (RTX 4060 - 8GB VRAM):

Model TTFT Throughput VRAM Best Config
GPT-2 9.1ms 124.6 tok/s 283MB dynamic/sdpa_flash
Phi-1.5 40.9ms 52.8 tok/s 2.8GB dynamic/sdpa_flash
Qwen2.5-0.5B 33.9ms 33.6 tok/s 975MB dynamic/eager

Server (RTX 4000 Ada - 20GB VRAM):

Model TTFT Throughput VRAM Best Config
GPT-2 4.2ms 365.4 tok/s 264MB dynamic/sdpa_flash
Qwen2.5-7B 284ms 3.3 tok/s 13.6GB dynamic/sdpa_flash

Server throughput is 3x faster than desktop for the same model!

View Multi-Model Charts

Multi-Model Performance Overview

Time to First Token by Model

TTFT Comparison (lower is better)

Throughput by Model

Throughput Comparison (higher is better)


Quick Start

CLI Usage

# Optimize any HuggingFace model
kvat tune meta-llama/Llama-3.2-1B --profile chat-agent

# Quick test
kvat tune gpt2 --profile ci-micro -v

# Show system info
kvat info

Python API

from kvat.core.schema import TuneConfig, DeviceType
from kvat.core.profiles import get_profile
from kvat.engines.transformers import TransformersAdapter
from kvat.core.search import TuningSearch

# Configure and run optimization
config = TuneConfig(
    model_id="meta-llama/Llama-3.2-1B",
    device=DeviceType.CUDA,
    profile=get_profile("chat-agent"),
    output_dir="./results",
)

adapter = TransformersAdapter()
search = TuningSearch(config=config, adapter=adapter)
result = search.run()

Features

Feature Description
Automatic Optimization Find the best configuration without manual experimentation
Multiple Profiles Built-in presets for Chat, RAG, and Longform workloads
Production-Ready Output Get drop-in Python code snippets and JSON configs
Beautiful Reports Markdown and HTML reports with performance comparisons
Early Stopping Smart pruning of dominated configurations
Extensible Adapter-based design for vLLM/llama.cpp/Ollama

Optimization Parameters

Parameter Options Impact
Cache Strategy Dynamic, Static, Sliding Window Memory & prefill speed
Attention Backend SDPA Flash, Memory Efficient, Math, Eager Throughput & VRAM
Data Type bfloat16, float16, float32 Speed vs precision
Compilation torch.compile modes Startup vs runtime

Built-in Profiles

Profile Context Output Focus
chat-agent 2-8K 64-256 TTFT (latency)
rag 8-32K 256-512 Balanced
longform 4-8K 1-2K Throughput
ci-micro 512 32 Quick testing

Installation

# Recommended: Full installation with all dependencies
pip install kvat[full]

# Basic installation
pip install kvat

# From source
git clone https://github.com/Keyvanhardani/kvcache-autotune.git
cd kvcache-autotune
pip install -e ".[full,dev]"

Requirements: Python 3.9+, PyTorch 2.0+, Transformers 4.35+


Output Files

File Description
best_plan.json Complete configuration with metrics
optimized_config.py Drop-in Python code
report.md Human-readable summary
report.html Visual report with charts

Example Output

+-----------------------------------------------------------------------------+
| Best Configuration                                                          |
|                                                                             |
| Cache Strategy: dynamic                                                     |
| Attention Backend: sdpa_flash                                               |
| Data Type: bfloat16                                                         |
| Score: 100.00                                                               |
+-----------------------------------------------------------------------------+

Roadmap

v0.1.0 (Current)

  • Core tuning engine with grid search
  • HuggingFace Transformers adapter
  • CLI interface (kvat tune, kvat apply, kvat compare)
  • Built-in profiles (chat-agent, rag, longform)
  • CUDA/GPU memory tracking
  • Windows & Linux support

v0.2.0 (Next)

  • Batch size optimization
  • CPU offload strategies
  • kvat watch - Continuous monitoring
  • Profile recommendations based on hardware

v0.3.0 (Planned)

  • Ollama adapter - Local model optimization
  • llama.cpp adapter - GGUF model support
  • vLLM adapter - Production serving
  • Quantized KV-cache (INT8/INT4)

v1.0.0 (Future)

  • HuggingFace Hub integration
  • Real-time inference monitoring
  • A/B testing framework

Contributing

Contributions are welcome! See CONTRIBUTING.md for details.

pip install -e ".[dev]"
pytest tests/ -v
ruff check kvat/

License

Apache 2.0 - See LICENSE for details.

Citation

@software{kvat,
  title = {KVCache Auto-Tuner: Automatic KV-Cache Optimization for Transformers},
  author = {Keyvanhardani},
  year = {2025},
  url = {https://github.com/Keyvanhardani/kvcache-autotune}
}

Keyvan.ai | LinkedIn

Made from Germany with dedication for the HuggingFace community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvat-0.1.0.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kvat-0.1.0-py3-none-any.whl (45.5 kB view details)

Uploaded Python 3

File details

Details for the file kvat-0.1.0.tar.gz.

File metadata

  • Download URL: kvat-0.1.0.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b1610a8c40aa993f5bb35107a7d53e55d1162fd94d046b60ea3c4230b4b35182
MD5 91cfa9fd017746d38c8b68cc80698ec6
BLAKE2b-256 0e159336e668bdc360f61ce001e6cb33bc7bb5cbd5fbe78d444600aa15a3f6fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.0.tar.gz:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kvat-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kvat-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b2bb7e79c016d770c88fe39958de5dd979d1040559fb0d82ff2e1f3030311ca
MD5 60b4319cd1570b0595319b1deee58ec1
BLAKE2b-256 9efc1df416fa9f88d19cf639b4e6069ff0dd100b49392f28c673562d81f5dbc2

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page