Skip to main content

Automatic KV-cache optimization for HuggingFace Transformers - Find the optimal cache strategy, attention backend, and configuration for your model and hardware.

Project description

KVCache Auto-Tuner

Tests PyPI npm Python License

English | Deutsch | Francais | Espanol | فارسی | العربية


Why kvat?

When you run LLMs with HuggingFace Transformers, there are dozens of configuration options that affect performance:

Setting Options What it affects
Cache Strategy dynamic, static, sliding_window Memory usage, prefill speed
Attention Backend sdpa_flash, eager, math, mem_efficient Throughput, VRAM
Data Type bfloat16, float16, float32 Speed vs precision

The problem: The optimal combination depends on YOUR specific model + YOUR GPU + YOUR use case. Nobody knows which config is best without testing.

The solution: kvat automatically benchmarks all combinations and tells you the fastest configuration.

# Before: Guessing and manual testing
model = AutoModelForCausalLM.from_pretrained("gpt2")  # Default config - slow

# After: Let kvat find the best config in 2 minutes
pip install kvat[full]
kvat tune gpt2 --profile ci-micro
# Output: "Best: dynamic/sdpa_flash/bfloat16 = 120 tok/s (+2.7% faster)"

Installation

pip install kvat[full]

Quick Start

# Tune any HuggingFace model
kvat tune meta-llama/Llama-3.2-1B --profile chat-agent

# Quick test (recommended for first try)
kvat tune gpt2 --profile ci-micro

# Show your system info
kvat info

Benchmark Results

Server Throughput

Server (RTX 4000 SFF Ada - 20GB VRAM)

Model Throughput TTFT Best Config
GPT-2 (124M) 407.1 tok/s 4.0ms dynamic/sdpa_flash
Qwen2.5-0.5B 140.7 tok/s 10.9ms dynamic/sdpa_flash
TinyLlama-1.1B 93.0 tok/s 30.6ms static/eager
Phi-1.5 (1.3B) 78.8 tok/s 37.2ms static/eager

Server Dashboard

Desktop (RTX 4060 - 8GB VRAM)

Model Baseline With kvat Improvement
GPT-2 (124M) 118.1 tok/s 120.2 tok/s +1.8%
Qwen2.5-0.5B 28.7 tok/s 29.5 tok/s +2.7%
Phi-1.5 (1.3B) 45.2 tok/s 45.6 tok/s +0.9%
Desktop Benchmark Charts

Baseline vs Optimized

Throughput Comparison

Throughput (tokens/second)

Improvement %

Performance Gain %


Profiles

Profile Context Length Output Length Best For
ci-micro 512 32 Quick testing
chat-agent 2-8K 64-256 Chatbots, low latency
rag 8-32K 256-512 RAG pipelines
longform 4-8K 1-2K Long text generation

Output

After tuning, kvat generates:

results/
├── best_plan.json      # Full config as JSON
├── optimized_config.py # Ready-to-use Python code
├── report.md           # Human-readable report
└── report.html         # Visual report with charts

Example optimized_config.py:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto",
)
# Cache strategy: dynamic (default in Transformers 4.35+)
# Measured: 120.2 tok/s, TTFT: 9.1ms

Python API

from kvat.core.schema import TuneConfig, DeviceType
from kvat.core.profiles import get_profile
from kvat.engines.transformers import TransformersAdapter
from kvat.core.search import TuningSearch

config = TuneConfig(
    model_id="meta-llama/Llama-3.2-1B",
    device=DeviceType.CUDA,
    profile=get_profile("chat-agent"),
    output_dir="./results",
)

adapter = TransformersAdapter()
search = TuningSearch(config=config, adapter=adapter)
result = search.run()

print(f"Best config: {result.best_config}")
print(f"Throughput: {result.best_score} tok/s")

npm Package (JavaScript/TypeScript)

npm install kvat
const kvat = require('kvat');

// Run tuning
const result = await kvat.tune('gpt2', {
  profile: 'ci-micro',
  outputDir: './results'
});

Roadmap

v0.1.3 - Current

  • Auto context length limiting (fixes CUDA errors)
  • PyPI + npm + GitHub Packages
  • Baseline vs Optimized benchmarking
  • Multi-language READMEs (EN, DE, FR, ES, FA, AR)
  • Multi-language report generation (6 languages)
  • Server benchmarks (RTX 4000 SFF Ada)
  • Improved report branding

v0.2.0 - Next

  • Ollama adapter
  • llama.cpp adapter (GGUF models)
  • Batch size optimization

v0.3.0 - Planned

  • vLLM adapter
  • Quantized KV-cache (INT8/INT4)

Contributing

git clone https://github.com/Keyvanhardani/kvcache-autotune.git
cd kvcache-autotune
pip install -e ".[full,dev]"
pytest tests/ -v

License

Apache 2.0

Citation

@software{kvat,
  title = {KVCache Auto-Tuner: Automatic KV-Cache Optimization for Transformers},
  author = {Keyvanhardani},
  year = {2026},
  url = {https://github.com/Keyvanhardani/kvcache-autotune}
}

Keyvan.ai | LinkedIn

Made in Germany with dedication for the HuggingFace Community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvat-0.1.4.tar.gz (49.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kvat-0.1.4-py3-none-any.whl (50.1 kB view details)

Uploaded Python 3

File details

Details for the file kvat-0.1.4.tar.gz.

File metadata

  • Download URL: kvat-0.1.4.tar.gz
  • Upload date:
  • Size: 49.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.4.tar.gz
Algorithm Hash digest
SHA256 b6cf38722e75c3c65716903d0088b18004ca1fc6de0c8658885e5401137a02a8
MD5 cd60f9067205c22f283ddcb607a09e6a
BLAKE2b-256 bd538c724b583460caf23d3124149e80d7da5ed1a87cc3090f01b17644269949

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.4.tar.gz:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kvat-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: kvat-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 50.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 da34851944d32d1824a9062b1d416132f29ba19d7c139838463b7bb37e6e3b12
MD5 d6bbe4527ee94af1e9bbc5431c475569
BLAKE2b-256 d504e3adfcd37b8fb100d9fdb688ec15f7dd75107336707d27d72c971dfa873b

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.4-py3-none-any.whl:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page