Automatic KV-cache optimization for HuggingFace Transformers - Find the optimal cache strategy, attention backend, and configuration for your model and hardware.
Project description
KVCache Auto-Tuner
Why kvat?
When you run LLMs with HuggingFace Transformers, there are dozens of configuration options that affect performance:
| Setting | Options | What it affects |
|---|---|---|
| Cache Strategy | dynamic, static, sliding_window | Memory usage, prefill speed |
| Attention Backend | sdpa_flash, eager, math, mem_efficient | Throughput, VRAM |
| Data Type | bfloat16, float16, float32 | Speed vs precision |
The problem: The optimal combination depends on YOUR specific model + YOUR GPU + YOUR use case. Nobody knows which config is best without testing.
The solution: kvat automatically benchmarks all combinations and tells you the fastest configuration.
# Before: Guessing and manual testing
model = AutoModelForCausalLM.from_pretrained("gpt2") # Default config - slow
# After: Let kvat find the best config in 2 minutes
pip install kvat[full]
kvat tune gpt2 --profile ci-micro
# Output: "Best: dynamic/sdpa_flash/bfloat16 = 120 tok/s (+2.7% faster)"
Installation
pip install kvat[full]
Quick Start
# Tune any HuggingFace model
kvat tune meta-llama/Llama-3.2-1B --profile chat-agent
# Quick test (recommended for first try)
kvat tune gpt2 --profile ci-micro
# Show your system info
kvat info
Real Benchmark Results
Desktop (RTX 4060 - 8GB VRAM)
| Model | Baseline | With kvat | Improvement |
|---|---|---|---|
| GPT-2 (124M) | 118.1 tok/s | 120.2 tok/s | +1.8% |
| Qwen2.5-0.5B | 28.7 tok/s | 29.5 tok/s | +2.7% |
| Phi-1.5 (1.3B) | 45.2 tok/s | 45.6 tok/s | +0.9% |
Desktop Benchmark Charts
|
Throughput (tokens/second) |
Performance Gain % |
Profiles
| Profile | Context Length | Output Length | Best For |
|---|---|---|---|
ci-micro |
512 | 32 | Quick testing |
chat-agent |
2-8K | 64-256 | Chatbots, low latency |
rag |
8-32K | 256-512 | RAG pipelines |
longform |
4-8K | 1-2K | Long text generation |
Output
After tuning, kvat generates:
results/
├── best_plan.json # Full config as JSON
├── optimized_config.py # Ready-to-use Python code
├── report.md # Human-readable report
└── report.html # Visual report with charts
Example optimized_config.py:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
device_map="auto",
)
# Cache strategy: dynamic (default in Transformers 4.35+)
# Measured: 120.2 tok/s, TTFT: 9.1ms
Python API
from kvat.core.schema import TuneConfig, DeviceType
from kvat.core.profiles import get_profile
from kvat.engines.transformers import TransformersAdapter
from kvat.core.search import TuningSearch
config = TuneConfig(
model_id="meta-llama/Llama-3.2-1B",
device=DeviceType.CUDA,
profile=get_profile("chat-agent"),
output_dir="./results",
)
adapter = TransformersAdapter()
search = TuningSearch(config=config, adapter=adapter)
result = search.run()
print(f"Best config: {result.best_config}")
print(f"Throughput: {result.best_score} tok/s")
npm Package (JavaScript/TypeScript)
npm install kvat
const kvat = require('kvat');
// Run tuning
const result = await kvat.tune('gpt2', {
profile: 'ci-micro',
outputDir: './results'
});
Roadmap
v0.1.1 - Current
- Auto context length limiting (fixes CUDA errors)
- PyPI + npm packages
- Baseline vs Optimized benchmarking
v0.2.0 - Next
- Ollama adapter
- llama.cpp adapter (GGUF models)
- Batch size optimization
v0.3.0 - Planned
- vLLM adapter
- Quantized KV-cache (INT8/INT4)
Contributing
git clone https://github.com/Keyvanhardani/kvcache-autotune.git
cd kvcache-autotune
pip install -e ".[full,dev]"
pytest tests/ -v
License
Apache 2.0
Citation
@software{kvat,
title = {KVCache Auto-Tuner: Automatic KV-Cache Optimization for Transformers},
author = {Keyvanhardani},
year = {2026},
url = {https://github.com/Keyvanhardani/kvcache-autotune}
}
Made in Germany with dedication for the HuggingFace Community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kvat-0.1.2.tar.gz.
File metadata
- Download URL: kvat-0.1.2.tar.gz
- Upload date:
- Size: 43.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8903a24915f335dfb9d1b1f408a9a6208c0ba03ba72bdceda92c46c1a0610e9
|
|
| MD5 |
6f3f24d0920fc05a671d9db9b6547e6c
|
|
| BLAKE2b-256 |
de6cc61aaa942a11350ee2d09eb6f72d34314c76001c60798272d4180615446b
|
Provenance
The following attestation bundles were made for kvat-0.1.2.tar.gz:
Publisher:
publish.yml on Keyvanhardani/kvcache-autotune
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kvat-0.1.2.tar.gz -
Subject digest:
a8903a24915f335dfb9d1b1f408a9a6208c0ba03ba72bdceda92c46c1a0610e9 - Sigstore transparency entry: 818201937
- Sigstore integration time:
-
Permalink:
Keyvanhardani/kvcache-autotune@db5726fd99f08bd660d75bfd8770d2954d188bab -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Keyvanhardani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@db5726fd99f08bd660d75bfd8770d2954d188bab -
Trigger Event:
release
-
Statement type:
File details
Details for the file kvat-0.1.2-py3-none-any.whl.
File metadata
- Download URL: kvat-0.1.2-py3-none-any.whl
- Upload date:
- Size: 45.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8a4e287c67ae2d8fccc1c89a78f75f32fc9e3b79c75190cea9460e3cdd839bd
|
|
| MD5 |
dedfe576d5b435c239867cf4294138f2
|
|
| BLAKE2b-256 |
03c79574aacf119f2045f46d2d64874b3de97824022af00a67a1a8a52c3fe525
|
Provenance
The following attestation bundles were made for kvat-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on Keyvanhardani/kvcache-autotune
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kvat-0.1.2-py3-none-any.whl -
Subject digest:
e8a4e287c67ae2d8fccc1c89a78f75f32fc9e3b79c75190cea9460e3cdd839bd - Sigstore transparency entry: 818201993
- Sigstore integration time:
-
Permalink:
Keyvanhardani/kvcache-autotune@db5726fd99f08bd660d75bfd8770d2954d188bab -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Keyvanhardani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@db5726fd99f08bd660d75bfd8770d2954d188bab -
Trigger Event:
release
-
Statement type: