Skip to main content

Benchmark every KV cache compression method on your GPU. One command, real numbers.

Project description

kvcache-bench

Benchmark every KV cache compression method on your GPU. One command, real numbers.

kvcache-bench --model qwen3.5:9b
| KV Type | Context | Prompt    | Gen tok/s | Prefill tok/s | VRAM +MB | Quality |
|---------|---------|-----------|-----------|---------------|----------|---------|
| f16     | 4096    | short     | 80.1      | 712.3         | +142     | PASS    |
| q8_0    | 4096    | short     | 79.5      | 723.5         | +71      | PASS    |
| q4_0    | 4096    | short     | 78.2      | 698.1         | +36      | PASS    |

Why

When you run a local LLM, the KV cache eats your VRAM. Ollama and llama.cpp support different KV cache quantization types (f16, q8_0, q4_0), but nobody tells you what the actual tradeoff is on YOUR hardware.

Current state of the world:

  • You Google "ollama kv cache quantization" and find forum posts with conflicting advice
  • You manually test each config, eyeball nvidia-smi, and guess
  • No tool compares them systematically

kvcache-bench fixes this. It tests every KV cache type on your GPU and gives you a comparison table with speed, VRAM, and quality.

Install

pip install kvcache-bench

Usage

# Auto-detect your first model, test all KV types
kvcache-bench

# Specific model
kvcache-bench --model qwen3.5:9b

# Test at multiple context lengths (where KV savings matter most)
kvcache-bench --model llama3.1:8b --context 4096,8192,16384

# Include tool calling test
kvcache-bench --model qwen3.5:9b --prompts short,code,reasoning,tool_call

# Save results as JSON
kvcache-bench --model qwen3.5:9b --json results.json

# Just show GPU info
kvcache-bench --gpu

# List available models
kvcache-bench --list-models

What It Tests

For each KV cache type (f16, q8_0, q4_0), it measures:

Metric How
Generation speed Tokens per second during generation
Prefill speed Tokens per second processing the prompt
VRAM delta Extra VRAM used beyond model weights (measured via nvidia-smi)
Quality Auto-checked against expected answers (Paris, code structure, reasoning)

How It Works

  1. Detects your GPU and Ollama installation
  2. For each KV cache type: restarts Ollama with OLLAMA_KV_CACHE_TYPE=<type>, warms up the model, runs benchmark prompts
  3. Measures VRAM before and during inference via nvidia-smi
  4. Extracts timing from Ollama's API response (prompt_eval_duration, eval_duration)
  5. Checks response quality with simple auto-graders
  6. Produces a markdown table (and optional JSON)

What the Research Says

Based on llama.cpp community benchmarks and our testing:

KV Type VRAM Savings Perplexity Impact Best For
f16 Baseline None When you have VRAM to spare
q8_0 2x +0.004 (negligible) Default recommendation. Free VRAM, zero quality cost.
q4_0 4x +0.2 (noticeable) When you need max context length or are VRAM-constrained

The sweet spot for most users: q8_0. Halves your KV cache VRAM with essentially zero quality loss.

Requirements

  • Python 3.10+
  • NVIDIA GPU with nvidia-smi
  • Ollama installed and running

Roadmap

  • Mixed K/V types (q8 keys + q4 values)
  • Context length sweep charts
  • HuggingFace backend (vLLM, TGI)
  • TurboQuant integration
  • Multi-model matrix
  • HuggingFace Spaces leaderboard
  • Community result submissions

License

Apache 2.0

Related

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvcache_bench-0.1.0.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kvcache_bench-0.1.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file kvcache_bench-0.1.0.tar.gz.

File metadata

  • Download URL: kvcache_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for kvcache_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5edd26a5afde9d3d9fbf7f27e5c19fcaa2fbbb9bba19123fab9e1e96aaee521b
MD5 e466906063581b73cedc3b4bc7e56e27
BLAKE2b-256 a0f75d2e3f2a29d40acf00f65b0f294bc8ae4de515c54d146c3f68def34de5a6

See more details on using hashes here.

File details

Details for the file kvcache_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kvcache_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for kvcache_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f86c3d2aafffc5d7741cfc8f8f51c6eb5998693feb934de78438959be3337608
MD5 71f1b1189102ff3d9926ef13c7bd7415
BLAKE2b-256 922debfb5ca7cf301cf6a9dc8e4ac99191c26610dcbf9684a7b9d262691ad799

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page