Automatic KV-cache optimization for HuggingFace Transformers - Find the optimal cache strategy, attention backend, and configuration for your model and hardware.
Project description
KVCache Auto-Tuner
Automatic KV-Cache Optimization for HuggingFace Transformers
Find the optimal cache strategy, attention backend, and configuration for your model and hardware.
Quick Start | Performance | Features | Installation | Roadmap
What is KVCache Auto-Tuner?
KVCache Auto-Tuner (kvat) automatically benchmarks and optimizes your HuggingFace Transformers inference pipeline. Stop guessing which configuration works best - let the tuner find it for you.
# Install and optimize your model in seconds
pip install kvat[full]
kvat tune gpt2 --profile chat-agent
Performance
Baseline vs Optimized
See how kvat improves your Transformers inference:
| Model | Without kvat | With kvat | Improvement |
|---|---|---|---|
| GPT-2 (124M) | 118.1 tok/s | 120.2 tok/s | +1.8% |
| Qwen2.5-0.5B | 28.7 tok/s | 29.5 tok/s | +2.7% |
| Phi-1.5 (1.3B) | 45.2 tok/s | 45.6 tok/s | +0.9% |
View Detailed Comparison Charts
|
Throughput Comparison |
Performance Gain |
Note: Results vary by model and hardware. Larger improvements are typical for models that benefit from Flash Attention and dynamic caching.
Multi-Model Benchmarks
Desktop (RTX 4060 - 8GB VRAM):
| Model | TTFT | Throughput | VRAM | Best Config |
|---|---|---|---|---|
| GPT-2 | 9.1ms | 124.6 tok/s | 283MB | dynamic/sdpa_flash |
| Phi-1.5 | 40.9ms | 52.8 tok/s | 2.8GB | dynamic/sdpa_flash |
| Qwen2.5-0.5B | 33.9ms | 33.6 tok/s | 975MB | dynamic/eager |
Server (RTX 4000 Ada - 20GB VRAM):
| Model | TTFT | Throughput | VRAM | Best Config |
|---|---|---|---|---|
| GPT-2 | 4.2ms | 365.4 tok/s | 264MB | dynamic/sdpa_flash |
| Qwen2.5-7B | 284ms | 3.3 tok/s | 13.6GB | dynamic/sdpa_flash |
Server throughput is 3x faster than desktop for the same model!
View Multi-Model Charts
|
TTFT Comparison (lower is better) |
Throughput Comparison (higher is better) |
Quick Start
CLI Usage
# Optimize any HuggingFace model
kvat tune meta-llama/Llama-3.2-1B --profile chat-agent
# Quick test
kvat tune gpt2 --profile ci-micro -v
# Show system info
kvat info
Python API
from kvat.core.schema import TuneConfig, DeviceType
from kvat.core.profiles import get_profile
from kvat.engines.transformers import TransformersAdapter
from kvat.core.search import TuningSearch
# Configure and run optimization
config = TuneConfig(
model_id="meta-llama/Llama-3.2-1B",
device=DeviceType.CUDA,
profile=get_profile("chat-agent"),
output_dir="./results",
)
adapter = TransformersAdapter()
search = TuningSearch(config=config, adapter=adapter)
result = search.run()
Features
| Feature | Description |
|---|---|
| Automatic Optimization | Find the best configuration without manual experimentation |
| Multiple Profiles | Built-in presets for Chat, RAG, and Longform workloads |
| Production-Ready Output | Get drop-in Python code snippets and JSON configs |
| Beautiful Reports | Markdown and HTML reports with performance comparisons |
| Early Stopping | Smart pruning of dominated configurations |
| Extensible | Adapter-based design for vLLM/llama.cpp/Ollama |
Optimization Parameters
| Parameter | Options | Impact |
|---|---|---|
| Cache Strategy | Dynamic, Static, Sliding Window | Memory & prefill speed |
| Attention Backend | SDPA Flash, Memory Efficient, Math, Eager | Throughput & VRAM |
| Data Type | bfloat16, float16, float32 | Speed vs precision |
| Compilation | torch.compile modes | Startup vs runtime |
Built-in Profiles
| Profile | Context | Output | Focus |
|---|---|---|---|
chat-agent |
2-8K | 64-256 | TTFT (latency) |
rag |
8-32K | 256-512 | Balanced |
longform |
4-8K | 1-2K | Throughput |
ci-micro |
512 | 32 | Quick testing |
Installation
# Recommended: Full installation with all dependencies
pip install kvat[full]
# Basic installation
pip install kvat
# From source
git clone https://github.com/Keyvanhardani/kvcache-autotune.git
cd kvcache-autotune
pip install -e ".[full,dev]"
Requirements: Python 3.9+, PyTorch 2.0+, Transformers 4.35+
Output Files
| File | Description |
|---|---|
best_plan.json |
Complete configuration with metrics |
optimized_config.py |
Drop-in Python code |
report.md |
Human-readable summary |
report.html |
Visual report with charts |
Example Output
+-----------------------------------------------------------------------------+
| Best Configuration |
| |
| Cache Strategy: dynamic |
| Attention Backend: sdpa_flash |
| Data Type: bfloat16 |
| Score: 100.00 |
+-----------------------------------------------------------------------------+
Roadmap
v0.1.0 (Current)
- Core tuning engine with grid search
- HuggingFace Transformers adapter
- CLI interface (
kvat tune,kvat apply,kvat compare) - Built-in profiles (chat-agent, rag, longform)
- CUDA/GPU memory tracking
- Windows & Linux support
v0.2.0 (Next)
- Batch size optimization
- CPU offload strategies
-
kvat watch- Continuous monitoring - Profile recommendations based on hardware
v0.3.0 (Planned)
- Ollama adapter - Local model optimization
- llama.cpp adapter - GGUF model support
- vLLM adapter - Production serving
- Quantized KV-cache (INT8/INT4)
v1.0.0 (Future)
- HuggingFace Hub integration
- Real-time inference monitoring
- A/B testing framework
Contributing
Contributions are welcome! See CONTRIBUTING.md for details.
pip install -e ".[dev]"
pytest tests/ -v
ruff check kvat/
License
Apache 2.0 - See LICENSE for details.
Citation
@software{kvat,
title = {KVCache Auto-Tuner: Automatic KV-Cache Optimization for Transformers},
author = {Keyvanhardani},
year = {2025},
url = {https://github.com/Keyvanhardani/kvcache-autotune}
}
Made from Germany with dedication for the HuggingFace community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kvat-0.1.0.tar.gz.
File metadata
- Download URL: kvat-0.1.0.tar.gz
- Upload date:
- Size: 44.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1610a8c40aa993f5bb35107a7d53e55d1162fd94d046b60ea3c4230b4b35182
|
|
| MD5 |
91cfa9fd017746d38c8b68cc80698ec6
|
|
| BLAKE2b-256 |
0e159336e668bdc360f61ce001e6cb33bc7bb5cbd5fbe78d444600aa15a3f6fa
|
Provenance
The following attestation bundles were made for kvat-0.1.0.tar.gz:
Publisher:
publish.yml on Keyvanhardani/kvcache-autotune
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kvat-0.1.0.tar.gz -
Subject digest:
b1610a8c40aa993f5bb35107a7d53e55d1162fd94d046b60ea3c4230b4b35182 - Sigstore transparency entry: 817831416
- Sigstore integration time:
-
Permalink:
Keyvanhardani/kvcache-autotune@68f7ad71890bf79c4204eed72dcb9948ec09bbd1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Keyvanhardani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@68f7ad71890bf79c4204eed72dcb9948ec09bbd1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file kvat-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kvat-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b2bb7e79c016d770c88fe39958de5dd979d1040559fb0d82ff2e1f3030311ca
|
|
| MD5 |
60b4319cd1570b0595319b1deee58ec1
|
|
| BLAKE2b-256 |
9efc1df416fa9f88d19cf639b4e6069ff0dd100b49392f28c673562d81f5dbc2
|
Provenance
The following attestation bundles were made for kvat-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Keyvanhardani/kvcache-autotune
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kvat-0.1.0-py3-none-any.whl -
Subject digest:
5b2bb7e79c016d770c88fe39958de5dd979d1040559fb0d82ff2e1f3030311ca - Sigstore transparency entry: 817831472
- Sigstore integration time:
-
Permalink:
Keyvanhardani/kvcache-autotune@68f7ad71890bf79c4204eed72dcb9948ec09bbd1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Keyvanhardani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@68f7ad71890bf79c4204eed72dcb9948ec09bbd1 -
Trigger Event:
release
-
Statement type: