Automatic KV-cache optimization for HuggingFace Transformers - Find the optimal cache strategy, attention backend, and configuration for your model and hardware.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Keyvan

These details have not been verified by PyPI

Project description

KVCache Auto-Tuner

Why kvat?

When you run LLMs with HuggingFace Transformers, there are dozens of configuration options that affect performance:

Setting	Options	What it affects
Cache Strategy	dynamic, static, sliding_window	Memory usage, prefill speed
Attention Backend	sdpa_flash, eager, math, mem_efficient	Throughput, VRAM
Data Type	bfloat16, float16, float32	Speed vs precision

The problem: The optimal combination depends on YOUR specific model + YOUR GPU + YOUR use case. Nobody knows which config is best without testing.

The solution: kvat automatically benchmarks all combinations and tells you the fastest configuration.

# Before: Guessing and manual testing
model = AutoModelForCausalLM.from_pretrained("gpt2")  # Default config - slow

# After: Let kvat find the best config in 2 minutes
pip install kvat[full]
kvat tune gpt2 --profile ci-micro
# Output: "Best: dynamic/sdpa_flash/bfloat16 = 120 tok/s (+2.7% faster)"

Installation

pip install kvat[full]

Quick Start

# Tune any HuggingFace model
kvat tune meta-llama/Llama-3.2-1B --profile chat-agent

# Quick test (recommended for first try)
kvat tune gpt2 --profile ci-micro

# Show your system info
kvat info

Real Benchmark Results

Desktop (RTX 4060 - 8GB VRAM)

Model	Baseline	With kvat	Improvement
GPT-2 (124M)	118.1 tok/s	120.2 tok/s	+1.8%
Qwen2.5-0.5B	28.7 tok/s	29.5 tok/s	+2.7%
Phi-1.5 (1.3B)	45.2 tok/s	45.6 tok/s	+0.9%

Server (RTX 4000 Ada - 20GB VRAM)

Model	TTFT	Throughput	VRAM
GPT-2	4.2ms	365.4 tok/s	264MB
Qwen2.5-7B	284ms	3.3 tok/s	13.6GB

Server is 3x faster than desktop for the same model!

Desktop Benchmark Charts

Baseline vs Optimized

Throughput (tokens/second)

Performance Gain %

Server Benchmark Charts (RTX 4000 Ada)

Server Performance

Server Throughput (tok/s)

Server Performance Gain

Profiles

Profile	Context Length	Output Length	Best For
`ci-micro`	512	32	Quick testing
`chat-agent`	2-8K	64-256	Chatbots, low latency
`rag`	8-32K	256-512	RAG pipelines
`longform`	4-8K	1-2K	Long text generation

Output

After tuning, kvat generates:

results/
├── best_plan.json      # Full config as JSON
├── optimized_config.py # Ready-to-use Python code
├── report.md           # Human-readable report
└── report.html         # Visual report with charts

Example optimized_config.py:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto",
)
# Cache strategy: dynamic (default in Transformers 4.35+)
# Measured: 120.2 tok/s, TTFT: 9.1ms

Python API

from kvat.core.schema import TuneConfig, DeviceType
from kvat.core.profiles import get_profile
from kvat.engines.transformers import TransformersAdapter
from kvat.core.search import TuningSearch

config = TuneConfig(
    model_id="meta-llama/Llama-3.2-1B",
    device=DeviceType.CUDA,
    profile=get_profile("chat-agent"),
    output_dir="./results",
)

adapter = TransformersAdapter()
search = TuningSearch(config=config, adapter=adapter)
result = search.run()

print(f"Best config: {result.best_config}")
print(f"Throughput: {result.best_score} tok/s")

npm Package (JavaScript/TypeScript)

npm install kvat

const kvat = require('kvat');

// Run tuning
const result = await kvat.tune('gpt2', {
  profile: 'ci-micro',
  outputDir: './results'
});

Roadmap

v0.1.1 - Current

Auto context length limiting (fixes CUDA errors)
PyPI + npm packages
Baseline vs Optimized benchmarking

v0.2.0 - Next

Ollama adapter
llama.cpp adapter (GGUF models)
Batch size optimization

v0.3.0 - Planned

vLLM adapter
Quantized KV-cache (INT8/INT4)

Contributing

git clone https://github.com/Keyvanhardani/kvcache-autotune.git
cd kvcache-autotune
pip install -e ".[full,dev]"
pytest tests/ -v

License

Apache 2.0

Keyvan.ai | LinkedIn

Made in Germany with dedication for the HuggingFace Community

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Keyvan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

Jan 13, 2026

0.1.3

Jan 13, 2026

0.1.2

Jan 13, 2026

This version

0.1.1

Jan 13, 2026

0.1.0

Jan 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvat-0.1.1.tar.gz (43.5 kB view details)

Uploaded Jan 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kvat-0.1.1-py3-none-any.whl (45.0 kB view details)

Uploaded Jan 13, 2026 Python 3

File details

Details for the file kvat-0.1.1.tar.gz.

File metadata

Download URL: kvat-0.1.1.tar.gz
Upload date: Jan 13, 2026
Size: 43.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`86c651cd34f4953a53cb6935e48810451a71675e882213c06f7bd7aa7bde6944`
MD5	`caefb12b5f6ca3f302635845d3e26c98`
BLAKE2b-256	`0ef691b71577885452c52b8b2995ce167e4fadbaa09f7004239c144fe366bdb5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.1.tar.gz:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kvat-0.1.1.tar.gz
- Subject digest: 86c651cd34f4953a53cb6935e48810451a71675e882213c06f7bd7aa7bde6944
- Sigstore transparency entry: 818022416
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: Keyvanhardani/kvcache-autotune@0884584886645130de58c81da0abe39a28d95720
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Keyvanhardani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0884584886645130de58c81da0abe39a28d95720
- Trigger Event: push

File details

Details for the file kvat-0.1.1-py3-none-any.whl.

File metadata

Download URL: kvat-0.1.1-py3-none-any.whl
Upload date: Jan 13, 2026
Size: 45.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68070879a0b2a20b48442250ea68936b444af72bdb73e5001cfcb2956af14e62`
MD5	`49f344e359d7c0d693dfac6059f4b526`
BLAKE2b-256	`cd194ed182b7ab527fc7a3f77540cc93b2c69ee37fa035ecf82a5e292aafab5d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.1-py3-none-any.whl:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kvat-0.1.1-py3-none-any.whl
- Subject digest: 68070879a0b2a20b48442250ea68936b444af72bdb73e5001cfcb2956af14e62
- Sigstore transparency entry: 818022444
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: Keyvanhardani/kvcache-autotune@0884584886645130de58c81da0abe39a28d95720
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Keyvanhardani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0884584886645130de58c81da0abe39a28d95720
- Trigger Event: push

kvat 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

KVCache Auto-Tuner

Why kvat?

Installation

Quick Start

Real Benchmark Results

Desktop (RTX 4060 - 8GB VRAM)

Server (RTX 4000 Ada - 20GB VRAM)

Profiles

Output

Python API

npm Package (JavaScript/TypeScript)

Roadmap

v0.1.1 - Current

v0.2.0 - Next

v0.3.0 - Planned

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance