Benchmark every KV cache compression method on your GPU. One command, real numbers.
Project description
kvcache-bench
Benchmark every KV cache compression method on your GPU. One command, real numbers.
kvcache-bench --model qwen3.5:9b
| KV Type | Context | Prompt | Gen tok/s | Prefill tok/s | VRAM +MB | Quality |
|---------|---------|-----------|-----------|---------------|----------|---------|
| f16 | 4096 | short | 80.1 | 712.3 | +142 | PASS |
| q8_0 | 4096 | short | 79.5 | 723.5 | +71 | PASS |
| q4_0 | 4096 | short | 78.2 | 698.1 | +36 | PASS |
Why
When you run a local LLM, the KV cache eats your VRAM. Ollama and llama.cpp support different KV cache quantization types (f16, q8_0, q4_0), but nobody tells you what the actual tradeoff is on YOUR hardware.
Current state of the world:
- You Google "ollama kv cache quantization" and find forum posts with conflicting advice
- You manually test each config, eyeball nvidia-smi, and guess
- No tool compares them systematically
kvcache-bench fixes this. It tests every KV cache type on your GPU and gives you a comparison table with speed, VRAM, and quality.
Install
pip install kvcache-bench
Usage
# Auto-detect your first model, test all KV types
kvcache-bench
# Specific model
kvcache-bench --model qwen3.5:9b
# Test at multiple context lengths (where KV savings matter most)
kvcache-bench --model llama3.1:8b --context 4096,8192,16384
# Include tool calling test
kvcache-bench --model qwen3.5:9b --prompts short,code,reasoning,tool_call
# Save results as JSON
kvcache-bench --model qwen3.5:9b --json results.json
# Just show GPU info
kvcache-bench --gpu
# List available models
kvcache-bench --list-models
What It Tests
For each KV cache type (f16, q8_0, q4_0), it measures:
| Metric | How |
|---|---|
| Generation speed | Tokens per second during generation |
| Prefill speed | Tokens per second processing the prompt |
| VRAM delta | Extra VRAM used beyond model weights (measured via nvidia-smi) |
| Quality | Auto-checked against expected answers (Paris, code structure, reasoning) |
How It Works
- Detects your GPU and Ollama installation
- For each KV cache type: restarts Ollama with
OLLAMA_KV_CACHE_TYPE=<type>, warms up the model, runs benchmark prompts - Measures VRAM before and during inference via nvidia-smi
- Extracts timing from Ollama's API response (prompt_eval_duration, eval_duration)
- Checks response quality with simple auto-graders
- Produces a markdown table (and optional JSON)
What the Research Says
Based on llama.cpp community benchmarks and our testing:
| KV Type | VRAM Savings | Perplexity Impact | Best For |
|---|---|---|---|
| f16 | Baseline | None | When you have VRAM to spare |
| q8_0 | 2x | +0.004 (negligible) | Default recommendation. Free VRAM, zero quality cost. |
| q4_0 | 4x | +0.2 (noticeable) | When you need max context length or are VRAM-constrained |
The sweet spot for most users: q8_0. Halves your KV cache VRAM with essentially zero quality loss.
Requirements
- Python 3.10+
- NVIDIA GPU with nvidia-smi
- Ollama installed and running
Roadmap
- Mixed K/V types (q8 keys + q4 values)
- Context length sweep charts
- HuggingFace backend (vLLM, TGI)
- TurboQuant integration
- Multi-model matrix
- HuggingFace Spaces leaderboard
- Community result submissions
License
Apache 2.0
Related
- turboquant -- TurboQuant KV cache compression (sub-4-bit)
- NVIDIA kvpress -- KV cache eviction/pruning methods
- llama.cpp -- Where KV cache quantization lives
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kvcache_bench-0.1.0.tar.gz.
File metadata
- Download URL: kvcache_bench-0.1.0.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5edd26a5afde9d3d9fbf7f27e5c19fcaa2fbbb9bba19123fab9e1e96aaee521b
|
|
| MD5 |
e466906063581b73cedc3b4bc7e56e27
|
|
| BLAKE2b-256 |
a0f75d2e3f2a29d40acf00f65b0f294bc8ae4de515c54d146c3f68def34de5a6
|
File details
Details for the file kvcache_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kvcache_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f86c3d2aafffc5d7741cfc8f8f51c6eb5998693feb934de78438959be3337608
|
|
| MD5 |
71f1b1189102ff3d9926ef13c7bd7415
|
|
| BLAKE2b-256 |
922debfb5ca7cf301cf6a9dc8e4ac99191c26610dcbf9684a7b9d262691ad799
|