Which quantization should I use? One command benchmarks every quant level on YOUR GPU.
Project description
quant-sim
Which quantization should I use? One command tells you.
pip install quant-sim
quant-sim qwen2.5:7b
Benchmarks every quantization level of a model on YOUR GPU. Measures speed, quality, and VRAM. Tells you the best tradeoff.
Why
Every model on Ollama has 5+ quantization levels. You ask Reddit "should I use Q4_K_M or Q5_K_M?" and get 10 different answers. The right answer depends on YOUR GPU, YOUR tasks, YOUR quality threshold.
No existing tool benchmarks speed AND quality across quant levels automatically:
- ollamabench, llm-benchmark, LocalScore: speed only
- lm-evaluation-harness: quality only, manual setup
- ollama-grid-search: prompt tuning, not quant comparison
quant-sim does both in one command.
Example: Compare All Quant Levels
$ quant-sim qwen2.5:7b --quick --speed-only
Quant Size VRAM Speed Quality Note
------------ ------ ------- ---------- -------- ---------------
Q3_K_S 3.3G 15004M 128.8/s --
Q4_K_M 4.4G 7885M 134.2/s -- * BEST *
Q5_K_M 5.1G 8532M 105.8/s --
Q6_K 5.8G 9160M 89.0/s --
Q8_0 7.5G 10813M 69.7/s --
Recommendation: Use Q4_K_M (qwen2.5:7b-instruct-q4_k_m).
134 tok/s, 4.4 GB.
Example: Benchmark All Local Models
$ quant-sim --local --quick --speed-only
Quant Size VRAM Speed Note
------------ ------ ------- ---------- ---------------
Q4_K_M 7.5G 7857M 117.9/s * BEST *
Q4_K_M 4.4G 7888M 112.0/s
Q5_K_M 5.1G 8532M 101.2/s
Q4_K_M 4.9G 8619M 98.6/s
Q6_K 5.8G 9220M 89.1/s
Q4_K_M 6.1G 10717M 80.4/s
Q4_K_M 6.1G 10723M 75.9/s
Q8_0 7.5G 11096M 72.7/s
Q4_K_M 8.6G 12375M 50.9/s
Q3_K_M 13.4G 15843M 2.1/s (CPU offload)
Real output from RTX 4080 16GB with 11 models installed.
Install
pip install quant-sim # coming soon — for now: pip install -e . from source
Requires: Ollama running locally, NVIDIA GPU.
Usage
# Benchmark a model (auto-discovers quant variants, pulls if needed)
quant-sim qwen2.5:7b
# Benchmark ALL locally installed models (no downloads)
quant-sim --local
# Quick mode (~2 min instead of ~10 min)
quant-sim qwen2.5:7b --quick
# Speed only (skip quality test)
quant-sim --local --quick --speed-only
# Don't download anything (only test what's already installed)
quant-sim qwen2.5:7b --no-pull
# Compare specific tags
quant-sim test --tags "qwen3:8b,qwen3:14b,qwen3.5:9b"
# Save results as JSON
quant-sim qwen2.5:7b --json results.json
# Show GPU info
quant-sim --gpu
# List local models
quant-sim --list
What It Measures
| Metric | How |
|---|---|
| Speed | Tokens/sec via Ollama chat API (prompt + generation) |
| Quality | 20 built-in questions: facts, math, coding, reasoning |
| VRAM | Peak GPU memory via nvidia-smi during inference |
| Size | Model file size from Ollama |
Quality Test
Built-in 20-question test covering:
- Facts (5): capitals, science, literature
- Math (5): arithmetic, word problems
- Coding (5): Python functions, one-liners
- Reasoning (5): logic puzzles, trick questions
How It Works
- Discovers available quantization variants of your model
- For each variant: loads model, measures VRAM, runs speed prompts, runs quality questions
- Grades quality answers automatically (keyword matching, code syntax checking)
- Recommends the best tradeoff: highest quality above 80%, then fastest
Community Leaderboard
Share your benchmarks. Compare your GPU against others.
# Submit results after benchmarking
quant-sim --local --quick --submit
# View community results
quant-sim --leaderboard
Results stored as GitHub issues — no backend server needed. Set GITHUB_TOKEN env var to submit.
Requirements
- Python 3.10+
- Ollama installed and running
- NVIDIA GPU with nvidia-smi
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quantsim_bench-0.1.0.tar.gz.
File metadata
- Download URL: quantsim_bench-0.1.0.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ded10631300e7346e4a93a2ab4fcb3666ff631648ae80f5695c9eb93e762bb8b
|
|
| MD5 |
e650d5de6936438c0b01f7377bcb1a80
|
|
| BLAKE2b-256 |
abce20a383b37fc6fe710645cdebd032fdd7f098875757f4899280f32382203a
|
File details
Details for the file quantsim_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: quantsim_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
798fdc8549cdcf655875cc2fc3c80279aaddab76c87aa1f8db6fc12134365307
|
|
| MD5 |
8049c060507e0db3c6a8f1a47780b802
|
|
| BLAKE2b-256 |
3f596de56a0d140473197850c27cdf29ac42e93672d1d8916943c2a2e49db051
|