Automatic KV-cache optimization for HuggingFace Transformers - Find the optimal cache strategy, attention backend, and configuration for your model and hardware.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Keyvan

These details have not been verified by PyPI

Project description

KVCache Auto-Tuner

Automatic KV-Cache Optimization for HuggingFace Transformers

Find the optimal cache strategy, attention backend, and configuration for your model and hardware.

Quick Start | Performance | Features | Installation | Roadmap

What is KVCache Auto-Tuner?

KVCache Auto-Tuner (kvat) automatically benchmarks and optimizes your HuggingFace Transformers inference pipeline. Stop guessing which configuration works best - let the tuner find it for you.

# Install and optimize your model in seconds
pip install kvat[full]
kvat tune gpt2 --profile chat-agent

Performance

Baseline vs Optimized

See how kvat improves your Transformers inference:

Performance Improvement with KVCache Auto-Tuner

Model	Without kvat	With kvat	Improvement
GPT-2 (124M)	118.1 tok/s	120.2 tok/s	+1.8%
Qwen2.5-0.5B	28.7 tok/s	29.5 tok/s	+2.7%
Phi-1.5 (1.3B)	45.2 tok/s	45.6 tok/s	+0.9%

View Detailed Comparison Charts

Throughput Comparison

Performance Gain

Note: Results vary by model and hardware. Larger improvements are typical for models that benefit from Flash Attention and dynamic caching.

Multi-Model Benchmarks

Desktop (RTX 4060 - 8GB VRAM):

Model	TTFT	Throughput	VRAM	Best Config
GPT-2	9.1ms	124.6 tok/s	283MB	dynamic/sdpa_flash
Phi-1.5	40.9ms	52.8 tok/s	2.8GB	dynamic/sdpa_flash
Qwen2.5-0.5B	33.9ms	33.6 tok/s	975MB	dynamic/eager

Server (RTX 4000 Ada - 20GB VRAM):

Model	TTFT	Throughput	VRAM	Best Config
GPT-2	4.2ms	365.4 tok/s	264MB	dynamic/sdpa_flash
Qwen2.5-7B	284ms	3.3 tok/s	13.6GB	dynamic/sdpa_flash

Server throughput is 3x faster than desktop for the same model!

View Multi-Model Charts

Multi-Model Performance Overview

TTFT Comparison (lower is better)

Throughput Comparison (higher is better)

Quick Start

CLI Usage

# Optimize any HuggingFace model
kvat tune meta-llama/Llama-3.2-1B --profile chat-agent

# Quick test
kvat tune gpt2 --profile ci-micro -v

# Show system info
kvat info

Python API

from kvat.core.schema import TuneConfig, DeviceType
from kvat.core.profiles import get_profile
from kvat.engines.transformers import TransformersAdapter
from kvat.core.search import TuningSearch

# Configure and run optimization
config = TuneConfig(
    model_id="meta-llama/Llama-3.2-1B",
    device=DeviceType.CUDA,
    profile=get_profile("chat-agent"),
    output_dir="./results",
)

adapter = TransformersAdapter()
search = TuningSearch(config=config, adapter=adapter)
result = search.run()

Features

Feature	Description
Automatic Optimization	Find the best configuration without manual experimentation
Multiple Profiles	Built-in presets for Chat, RAG, and Longform workloads
Production-Ready Output	Get drop-in Python code snippets and JSON configs
Beautiful Reports	Markdown and HTML reports with performance comparisons
Early Stopping	Smart pruning of dominated configurations
Extensible	Adapter-based design for vLLM/llama.cpp/Ollama

Optimization Parameters

Parameter	Options	Impact
Cache Strategy	Dynamic, Static, Sliding Window	Memory & prefill speed
Attention Backend	SDPA Flash, Memory Efficient, Math, Eager	Throughput & VRAM
Data Type	bfloat16, float16, float32	Speed vs precision
Compilation	torch.compile modes	Startup vs runtime

Built-in Profiles

Profile	Context	Output	Focus
`chat-agent`	2-8K	64-256	TTFT (latency)
`rag`	8-32K	256-512	Balanced
`longform`	4-8K	1-2K	Throughput
`ci-micro`	512	32	Quick testing

Installation

# Recommended: Full installation with all dependencies
pip install kvat[full]

# Basic installation
pip install kvat

# From source
git clone https://github.com/Keyvanhardani/kvcache-autotune.git
cd kvcache-autotune
pip install -e ".[full,dev]"

Requirements: Python 3.9+, PyTorch 2.0+, Transformers 4.35+

Output Files

File	Description
`best_plan.json`	Complete configuration with metrics
`optimized_config.py`	Drop-in Python code
`report.md`	Human-readable summary
`report.html`	Visual report with charts

Example Output

+-----------------------------------------------------------------------------+
| Best Configuration                                                          |
|                                                                             |
| Cache Strategy: dynamic                                                     |
| Attention Backend: sdpa_flash                                               |
| Data Type: bfloat16                                                         |
| Score: 100.00                                                               |
+-----------------------------------------------------------------------------+

Roadmap

v0.1.0 (Current)

Core tuning engine with grid search
HuggingFace Transformers adapter
CLI interface (kvat tune, kvat apply, kvat compare)
Built-in profiles (chat-agent, rag, longform)
CUDA/GPU memory tracking
Windows & Linux support

v0.2.0 (Next)

Batch size optimization
CPU offload strategies
kvat watch - Continuous monitoring
Profile recommendations based on hardware

v0.3.0 (Planned)

Ollama adapter - Local model optimization
llama.cpp adapter - GGUF model support
vLLM adapter - Production serving
Quantized KV-cache (INT8/INT4)

v1.0.0 (Future)

HuggingFace Hub integration
Real-time inference monitoring
A/B testing framework

Contributing

Contributions are welcome! See CONTRIBUTING.md for details.

pip install -e ".[dev]"
pytest tests/ -v
ruff check kvat/

License

Apache 2.0 - See LICENSE for details.

Citation

@software{kvat,
  title = {KVCache Auto-Tuner: Automatic KV-Cache Optimization for Transformers},
  author = {Keyvanhardani},
  year = {2025},
  url = {https://github.com/Keyvanhardani/kvcache-autotune}
}

Keyvan.ai | LinkedIn

Made from Germany with dedication for the HuggingFace community

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Keyvan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

Jan 13, 2026

0.1.3

Jan 13, 2026

0.1.2

Jan 13, 2026

0.1.1

Jan 13, 2026

This version

0.1.0

Jan 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvat-0.1.0.tar.gz (44.0 kB view details)

Uploaded Jan 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kvat-0.1.0-py3-none-any.whl (45.5 kB view details)

Uploaded Jan 13, 2026 Python 3

File details

Details for the file kvat-0.1.0.tar.gz.

File metadata

Download URL: kvat-0.1.0.tar.gz
Upload date: Jan 13, 2026
Size: 44.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b1610a8c40aa993f5bb35107a7d53e55d1162fd94d046b60ea3c4230b4b35182`
MD5	`91cfa9fd017746d38c8b68cc80698ec6`
BLAKE2b-256	`0e159336e668bdc360f61ce001e6cb33bc7bb5cbd5fbe78d444600aa15a3f6fa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.0.tar.gz:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kvat-0.1.0.tar.gz
- Subject digest: b1610a8c40aa993f5bb35107a7d53e55d1162fd94d046b60ea3c4230b4b35182
- Sigstore transparency entry: 817831416
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: Keyvanhardani/kvcache-autotune@68f7ad71890bf79c4204eed72dcb9948ec09bbd1
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Keyvanhardani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@68f7ad71890bf79c4204eed72dcb9948ec09bbd1
- Trigger Event: release

File details

Details for the file kvat-0.1.0-py3-none-any.whl.

File metadata

Download URL: kvat-0.1.0-py3-none-any.whl
Upload date: Jan 13, 2026
Size: 45.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kvat-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b2bb7e79c016d770c88fe39958de5dd979d1040559fb0d82ff2e1f3030311ca`
MD5	`60b4319cd1570b0595319b1deee58ec1`
BLAKE2b-256	`9efc1df416fa9f88d19cf639b4e6069ff0dd100b49392f28c673562d81f5dbc2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kvat-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Keyvanhardani/kvcache-autotune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kvat-0.1.0-py3-none-any.whl
- Subject digest: 5b2bb7e79c016d770c88fe39958de5dd979d1040559fb0d82ff2e1f3030311ca
- Sigstore transparency entry: 817831472
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: Keyvanhardani/kvcache-autotune@68f7ad71890bf79c4204eed72dcb9948ec09bbd1
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Keyvanhardani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@68f7ad71890bf79c4204eed72dcb9948ec09bbd1
- Trigger Event: release

kvat 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

KVCache Auto-Tuner

Automatic KV-Cache Optimization for HuggingFace Transformers

What is KVCache Auto-Tuner?

Performance

Baseline vs Optimized

Multi-Model Benchmarks

Quick Start

CLI Usage

Python API

Features

Optimization Parameters

Built-in Profiles

Installation

Output Files

Example Output

Roadmap

v0.1.0 (Current)

v0.2.0 (Next)

v0.3.0 (Planned)

v1.0.0 (Future)

Contributing

License

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance