Skip to main content

Automatic KV-cache optimization for HuggingFace Transformers - Find the optimal cache strategy, attention backend, and configuration for your model and hardware.

Project description

KVCache Auto-Tuner

Tests PyPI npm Python License

English | Deutsch | Francais | Espanol | فارسی | العربية


Why kvat?

When you run LLMs with HuggingFace Transformers, there are dozens of configuration options that affect performance:

Setting Options What it affects
Cache Strategy dynamic, static, sliding_window Memory usage, prefill speed
Attention Backend sdpa_flash, eager, math, mem_efficient Throughput, VRAM
Data Type bfloat16, float16, float32 Speed vs precision

The problem: The optimal combination depends on YOUR specific model + YOUR GPU + YOUR use case. Nobody knows which config is best without testing.

The solution: kvat automatically benchmarks all combinations and tells you the fastest configuration.

# Before: Guessing and manual testing
model = AutoModelForCausalLM.from_pretrained("gpt2")  # Default config - slow

# After: Let kvat find the best config in 2 minutes
pip install kvat[full]
kvat tune gpt2 --profile ci-micro
# Output: "Best: dynamic/sdpa_flash/bfloat16 = 120 tok/s (+2.7% faster)"

Installation

pip install kvat[full]

Quick Start

# Tune any HuggingFace model
kvat tune meta-llama/Llama-3.2-1B --profile chat-agent

# Quick test (recommended for first try)
kvat tune gpt2 --profile ci-micro

# Show your system info
kvat info

Real Benchmark Results

Desktop (RTX 4060 - 8GB VRAM)

Model Baseline With kvat Improvement
GPT-2 (124M) 118.1 tok/s 120.2 tok/s +1.8%
Qwen2.5-0.5B 28.7 tok/s 29.5 tok/s +2.7%
Phi-1.5 (1.3B) 45.2 tok/s 45.6 tok/s +0.9%
Desktop Benchmark Charts

Baseline vs Optimized

Throughput Comparison

Throughput (tokens/second)

Improvement %

Performance Gain %


Profiles

Profile Context Length Output Length Best For
ci-micro 512 32 Quick testing
chat-agent 2-8K 64-256 Chatbots, low latency
rag 8-32K 256-512 RAG pipelines
longform 4-8K 1-2K Long text generation

Output

After tuning, kvat generates:

results/
├── best_plan.json      # Full config as JSON
├── optimized_config.py # Ready-to-use Python code
├── report.md           # Human-readable report
└── report.html         # Visual report with charts

Example optimized_config.py:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto",
)
# Cache strategy: dynamic (default in Transformers 4.35+)
# Measured: 120.2 tok/s, TTFT: 9.1ms

Python API

from kvat.core.schema import TuneConfig, DeviceType
from kvat.core.profiles import get_profile
from kvat.engines.transformers import TransformersAdapter
from kvat.core.search import TuningSearch

config = TuneConfig(
    model_id="meta-llama/Llama-3.2-1B",
    device=DeviceType.CUDA,
    profile=get_profile("chat-agent"),
    output_dir="./results",
)

adapter = TransformersAdapter()
search = TuningSearch(config=config, adapter=adapter)
result = search.run()

print(f"Best config: {result.best_config}")
print(f"Throughput: {result.best_score} tok/s")

npm Package (JavaScript/TypeScript)

npm install kvat
const kvat = require('kvat');

// Run tuning
const result = await kvat.tune('gpt2', {
  profile: 'ci-micro',
  outputDir: './results'
});

Roadmap

v0.1.3 - Current

  • Auto context length limiting (fixes CUDA errors)
  • PyPI + npm + GitHub Packages
  • Baseline vs Optimized benchmarking
  • Multi-language READMEs (EN, DE, FR, ES, FA, AR)
  • Improved report branding

v0.2.0 - Next

  • Ollama adapter
  • llama.cpp adapter (GGUF models)
  • Batch size optimization

v0.3.0 - Planned

  • vLLM adapter
  • Quantized KV-cache (INT8/INT4)

Contributing

git clone https://github.com/Keyvanhardani/kvcache-autotune.git
cd kvcache-autotune
pip install -e ".[full,dev]"
pytest tests/ -v

License

Apache 2.0

Citation

@software{kvat,
  title = {KVCache Auto-Tuner: Automatic KV-Cache Optimization for Transformers},
  author = {Keyvanhardani},
  year = {2026},
  url = {https://github.com/Keyvanhardani/kvcache-autotune}
}

Keyvan.ai | LinkedIn

Made in Germany with dedication for the HuggingFace Community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvat-0.1.3.tar.gz (45.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kvat-0.1.3-py3-none-any.whl (46.8 kB view details)

Uploaded Python 3

File details

Details for the file kvat-0.1.3.tar.gz.

File metadata

  • Download URL: kvat-0.1.3.tar.gz
  • Upload date:
  • Size: 45.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2f0f0fb517c18fea41d9ab4bf84fae65ef6c5032ae2d48b4845592a76eecbe6a
MD5 94fa3c68a70db8721377e340169d6c51
BLAKE2b-256 d103cdb9e5f86d79ddc09482201be70b4d66fa921e999c1ee451fa1867887797

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.3.tar.gz:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kvat-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: kvat-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 46.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e0ef08ae28f846e946138a054c1201573c0e673f868f35b1e26b4edac26130ff
MD5 a37e93da698d8b377d349bd9c5c45a35
BLAKE2b-256 1922896ab721a1590e5c8e4c97e069fcb967a7f82a3f4e0e3a42f140e29ea529

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.3-py3-none-any.whl:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page