Skip to main content

Automatic KV-cache optimization for HuggingFace Transformers - Find the optimal cache strategy, attention backend, and configuration for your model and hardware.

Project description

KVCache Auto-Tuner

Tests PyPI npm Python License


Why kvat?

When you run LLMs with HuggingFace Transformers, there are dozens of configuration options that affect performance:

Setting Options What it affects
Cache Strategy dynamic, static, sliding_window Memory usage, prefill speed
Attention Backend sdpa_flash, eager, math, mem_efficient Throughput, VRAM
Data Type bfloat16, float16, float32 Speed vs precision

The problem: The optimal combination depends on YOUR specific model + YOUR GPU + YOUR use case. Nobody knows which config is best without testing.

The solution: kvat automatically benchmarks all combinations and tells you the fastest configuration.

# Before: Guessing and manual testing
model = AutoModelForCausalLM.from_pretrained("gpt2")  # Default config - slow

# After: Let kvat find the best config in 2 minutes
pip install kvat[full]
kvat tune gpt2 --profile ci-micro
# Output: "Best: dynamic/sdpa_flash/bfloat16 = 120 tok/s (+2.7% faster)"

Installation

pip install kvat[full]

Quick Start

# Tune any HuggingFace model
kvat tune meta-llama/Llama-3.2-1B --profile chat-agent

# Quick test (recommended for first try)
kvat tune gpt2 --profile ci-micro

# Show your system info
kvat info

Real Benchmark Results

Desktop (RTX 4060 - 8GB VRAM)

Model Baseline With kvat Improvement
GPT-2 (124M) 118.1 tok/s 120.2 tok/s +1.8%
Qwen2.5-0.5B 28.7 tok/s 29.5 tok/s +2.7%
Phi-1.5 (1.3B) 45.2 tok/s 45.6 tok/s +0.9%
Desktop Benchmark Charts

Baseline vs Optimized

Throughput Comparison

Throughput (tokens/second)

Improvement %

Performance Gain %


Profiles

Profile Context Length Output Length Best For
ci-micro 512 32 Quick testing
chat-agent 2-8K 64-256 Chatbots, low latency
rag 8-32K 256-512 RAG pipelines
longform 4-8K 1-2K Long text generation

Output

After tuning, kvat generates:

results/
├── best_plan.json      # Full config as JSON
├── optimized_config.py # Ready-to-use Python code
├── report.md           # Human-readable report
└── report.html         # Visual report with charts

Example optimized_config.py:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto",
)
# Cache strategy: dynamic (default in Transformers 4.35+)
# Measured: 120.2 tok/s, TTFT: 9.1ms

Python API

from kvat.core.schema import TuneConfig, DeviceType
from kvat.core.profiles import get_profile
from kvat.engines.transformers import TransformersAdapter
from kvat.core.search import TuningSearch

config = TuneConfig(
    model_id="meta-llama/Llama-3.2-1B",
    device=DeviceType.CUDA,
    profile=get_profile("chat-agent"),
    output_dir="./results",
)

adapter = TransformersAdapter()
search = TuningSearch(config=config, adapter=adapter)
result = search.run()

print(f"Best config: {result.best_config}")
print(f"Throughput: {result.best_score} tok/s")

npm Package (JavaScript/TypeScript)

npm install kvat
const kvat = require('kvat');

// Run tuning
const result = await kvat.tune('gpt2', {
  profile: 'ci-micro',
  outputDir: './results'
});

Roadmap

v0.1.1 - Current

  • Auto context length limiting (fixes CUDA errors)
  • PyPI + npm packages
  • Baseline vs Optimized benchmarking

v0.2.0 - Next

  • Ollama adapter
  • llama.cpp adapter (GGUF models)
  • Batch size optimization

v0.3.0 - Planned

  • vLLM adapter
  • Quantized KV-cache (INT8/INT4)

Contributing

git clone https://github.com/Keyvanhardani/kvcache-autotune.git
cd kvcache-autotune
pip install -e ".[full,dev]"
pytest tests/ -v

License

Apache 2.0

Citation

@software{kvat,
  title = {KVCache Auto-Tuner: Automatic KV-Cache Optimization for Transformers},
  author = {Keyvanhardani},
  year = {2026},
  url = {https://github.com/Keyvanhardani/kvcache-autotune}
}

Keyvan.ai | LinkedIn

Made in Germany with dedication for the HuggingFace Community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvat-0.1.2.tar.gz (43.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kvat-0.1.2-py3-none-any.whl (45.3 kB view details)

Uploaded Python 3

File details

Details for the file kvat-0.1.2.tar.gz.

File metadata

  • Download URL: kvat-0.1.2.tar.gz
  • Upload date:
  • Size: 43.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a8903a24915f335dfb9d1b1f408a9a6208c0ba03ba72bdceda92c46c1a0610e9
MD5 6f3f24d0920fc05a671d9db9b6547e6c
BLAKE2b-256 de6cc61aaa942a11350ee2d09eb6f72d34314c76001c60798272d4180615446b

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.2.tar.gz:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kvat-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: kvat-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 45.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e8a4e287c67ae2d8fccc1c89a78f75f32fc9e3b79c75190cea9460e3cdd839bd
MD5 dedfe576d5b435c239867cf4294138f2
BLAKE2b-256 03c79574aacf119f2045f46d2d64874b3de97824022af00a67a1a8a52c3fe525

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.2-py3-none-any.whl:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page