Skip to main content

Automatic KV-cache optimization for HuggingFace Transformers - Find the optimal cache strategy, attention backend, and configuration for your model and hardware.

Project description

KVCache Auto-Tuner

Tests PyPI npm Python License


Why kvat?

When you run LLMs with HuggingFace Transformers, there are dozens of configuration options that affect performance:

Setting Options What it affects
Cache Strategy dynamic, static, sliding_window Memory usage, prefill speed
Attention Backend sdpa_flash, eager, math, mem_efficient Throughput, VRAM
Data Type bfloat16, float16, float32 Speed vs precision

The problem: The optimal combination depends on YOUR specific model + YOUR GPU + YOUR use case. Nobody knows which config is best without testing.

The solution: kvat automatically benchmarks all combinations and tells you the fastest configuration.

# Before: Guessing and manual testing
model = AutoModelForCausalLM.from_pretrained("gpt2")  # Default config - slow

# After: Let kvat find the best config in 2 minutes
pip install kvat[full]
kvat tune gpt2 --profile ci-micro
# Output: "Best: dynamic/sdpa_flash/bfloat16 = 120 tok/s (+2.7% faster)"

Installation

pip install kvat[full]

Quick Start

# Tune any HuggingFace model
kvat tune meta-llama/Llama-3.2-1B --profile chat-agent

# Quick test (recommended for first try)
kvat tune gpt2 --profile ci-micro

# Show your system info
kvat info

Real Benchmark Results

Desktop (RTX 4060 - 8GB VRAM)

Model Baseline With kvat Improvement
GPT-2 (124M) 118.1 tok/s 120.2 tok/s +1.8%
Qwen2.5-0.5B 28.7 tok/s 29.5 tok/s +2.7%
Phi-1.5 (1.3B) 45.2 tok/s 45.6 tok/s +0.9%

Server (RTX 4000 Ada - 20GB VRAM)

Model TTFT Throughput VRAM
GPT-2 4.2ms 365.4 tok/s 264MB
Qwen2.5-7B 284ms 3.3 tok/s 13.6GB

Server is 3x faster than desktop for the same model!

Desktop Benchmark Charts

Baseline vs Optimized

Throughput Comparison

Throughput (tokens/second)

Improvement %

Performance Gain %

Server Benchmark Charts (RTX 4000 Ada)

Server Performance

Server Throughput

Server Throughput (tok/s)

Server Improvement

Server Performance Gain


Profiles

Profile Context Length Output Length Best For
ci-micro 512 32 Quick testing
chat-agent 2-8K 64-256 Chatbots, low latency
rag 8-32K 256-512 RAG pipelines
longform 4-8K 1-2K Long text generation

Output

After tuning, kvat generates:

results/
├── best_plan.json      # Full config as JSON
├── optimized_config.py # Ready-to-use Python code
├── report.md           # Human-readable report
└── report.html         # Visual report with charts

Example optimized_config.py:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto",
)
# Cache strategy: dynamic (default in Transformers 4.35+)
# Measured: 120.2 tok/s, TTFT: 9.1ms

Python API

from kvat.core.schema import TuneConfig, DeviceType
from kvat.core.profiles import get_profile
from kvat.engines.transformers import TransformersAdapter
from kvat.core.search import TuningSearch

config = TuneConfig(
    model_id="meta-llama/Llama-3.2-1B",
    device=DeviceType.CUDA,
    profile=get_profile("chat-agent"),
    output_dir="./results",
)

adapter = TransformersAdapter()
search = TuningSearch(config=config, adapter=adapter)
result = search.run()

print(f"Best config: {result.best_config}")
print(f"Throughput: {result.best_score} tok/s")

npm Package (JavaScript/TypeScript)

npm install kvat
const kvat = require('kvat');

// Run tuning
const result = await kvat.tune('gpt2', {
  profile: 'ci-micro',
  outputDir: './results'
});

Roadmap

v0.1.1 - Current

  • Auto context length limiting (fixes CUDA errors)
  • PyPI + npm packages
  • Baseline vs Optimized benchmarking

v0.2.0 - Next

  • Ollama adapter
  • llama.cpp adapter (GGUF models)
  • Batch size optimization

v0.3.0 - Planned

  • vLLM adapter
  • Quantized KV-cache (INT8/INT4)

Contributing

git clone https://github.com/Keyvanhardani/kvcache-autotune.git
cd kvcache-autotune
pip install -e ".[full,dev]"
pytest tests/ -v

License

Apache 2.0


Keyvan.ai | LinkedIn

Made in Germany with dedication for the HuggingFace Community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvat-0.1.1.tar.gz (43.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kvat-0.1.1-py3-none-any.whl (45.0 kB view details)

Uploaded Python 3

File details

Details for the file kvat-0.1.1.tar.gz.

File metadata

  • Download URL: kvat-0.1.1.tar.gz
  • Upload date:
  • Size: 43.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.1.tar.gz
Algorithm Hash digest
SHA256 86c651cd34f4953a53cb6935e48810451a71675e882213c06f7bd7aa7bde6944
MD5 caefb12b5f6ca3f302635845d3e26c98
BLAKE2b-256 0ef691b71577885452c52b8b2995ce167e4fadbaa09f7004239c144fe366bdb5

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.1.tar.gz:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kvat-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: kvat-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 45.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 68070879a0b2a20b48442250ea68936b444af72bdb73e5001cfcb2956af14e62
MD5 49f344e359d7c0d693dfac6059f4b526
BLAKE2b-256 cd194ed182b7ab527fc7a3f77540cc93b2c69ee37fa035ecf82a5e292aafab5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.1-py3-none-any.whl:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page