Benchmark every KV cache compression method on your GPU. One command, real numbers.

These details have not been verified by PyPI

Project links

Homepage

Project description

kvcache-bench

Benchmark every KV cache compression method on your GPU. One command, real numbers.

kvcache-bench --model qwen3.5:9b

| KV Type | Context | Prompt    | Gen tok/s | Prefill tok/s | VRAM +MB | Quality |
|---------|---------|-----------|-----------|---------------|----------|---------|
| f16     | 4096    | short     | 80.1      | 712.3         | +142     | PASS    |
| q8_0    | 4096    | short     | 79.5      | 723.5         | +71      | PASS    |
| q4_0    | 4096    | short     | 78.2      | 698.1         | +36      | PASS    |

Why

When you run a local LLM, the KV cache eats your VRAM. Ollama and llama.cpp support different KV cache quantization types (f16, q8_0, q4_0), but nobody tells you what the actual tradeoff is on YOUR hardware.

Current state of the world:

You Google "ollama kv cache quantization" and find forum posts with conflicting advice
You manually test each config, eyeball nvidia-smi, and guess
No tool compares them systematically

kvcache-bench fixes this. It tests every KV cache type on your GPU and gives you a comparison table with speed, VRAM, and quality.

Install

pip install kvcache-bench

Usage

# Auto-detect your first model, test all KV types
kvcache-bench

# Specific model
kvcache-bench --model qwen3.5:9b

# Test at multiple context lengths (where KV savings matter most)
kvcache-bench --model llama3.1:8b --context 4096,8192,16384

# Include tool calling test
kvcache-bench --model qwen3.5:9b --prompts short,code,reasoning,tool_call

# Save results as JSON
kvcache-bench --model qwen3.5:9b --json results.json

# Just show GPU info
kvcache-bench --gpu

# List available models
kvcache-bench --list-models

What It Tests

For each KV cache type (f16, q8_0, q4_0), it measures:

Metric	How
Generation speed	Tokens per second during generation
Prefill speed	Tokens per second processing the prompt
VRAM delta	Extra VRAM used beyond model weights (measured via nvidia-smi)
Quality	Auto-checked against expected answers (Paris, code structure, reasoning)

How It Works

Detects your GPU and Ollama installation
For each KV cache type: restarts Ollama with OLLAMA_KV_CACHE_TYPE=<type>, warms up the model, runs benchmark prompts
Measures VRAM before and during inference via nvidia-smi
Extracts timing from Ollama's API response (prompt_eval_duration, eval_duration)
Checks response quality with simple auto-graders
Produces a markdown table (and optional JSON)

What the Research Says

Based on llama.cpp community benchmarks and our testing:

KV Type	VRAM Savings	Perplexity Impact	Best For
f16	Baseline	None	When you have VRAM to spare
q8_0	2x	+0.004 (negligible)	Default recommendation. Free VRAM, zero quality cost.
q4_0	4x	+0.2 (noticeable)	When you need max context length or are VRAM-constrained

The sweet spot for most users: q8_0. Halves your KV cache VRAM with essentially zero quality loss.

Requirements

Python 3.10+
NVIDIA GPU with nvidia-smi
Ollama installed and running

Roadmap

Mixed K/V types (q8 keys + q4 values)
Context length sweep charts
HuggingFace backend (vLLM, TGI)
TurboQuant integration
Multi-model matrix
HuggingFace Spaces leaderboard
Community result submissions

License

Apache 2.0

turboquant -- TurboQuant KV cache compression (sub-4-bit)
NVIDIA kvpress -- KV cache eviction/pruning methods
llama.cpp -- Where KV cache quantization lives

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvcache_bench-0.1.0.tar.gz (11.7 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kvcache_bench-0.1.0-py3-none-any.whl (11.7 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file kvcache_bench-0.1.0.tar.gz.

File metadata

Download URL: kvcache_bench-0.1.0.tar.gz
Upload date: Mar 25, 2026
Size: 11.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for kvcache_bench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5edd26a5afde9d3d9fbf7f27e5c19fcaa2fbbb9bba19123fab9e1e96aaee521b`
MD5	`e466906063581b73cedc3b4bc7e56e27`
BLAKE2b-256	`a0f75d2e3f2a29d40acf00f65b0f294bc8ae4de515c54d146c3f68def34de5a6`

See more details on using hashes here.

File details

Details for the file kvcache_bench-0.1.0-py3-none-any.whl.

File metadata

Download URL: kvcache_bench-0.1.0-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 11.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for kvcache_bench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f86c3d2aafffc5d7741cfc8f8f51c6eb5998693feb934de78438959be3337608`
MD5	`71f1b1189102ff3d9926ef13c7bd7415`
BLAKE2b-256	`922debfb5ca7cf301cf6a9dc8e4ac99191c26610dcbf9684a7b9d262691ad799`

See more details on using hashes here.

kvcache-bench 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

kvcache-bench

Why

Install

Usage

What It Tests

How It Works

What the Research Says

Requirements

Roadmap

License

Related

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes