Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation
Project description
ollama-benchmarker
Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation
Features
- 🧠 LLM Benchmarks: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
- 📊 Embedding Benchmarks: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
- 🔄 Reranking Benchmarks: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
- 🤖 Ollama Integration: Direct integration with local Ollama API
- ⚡ Interactive Selection: Choose which models and benchmarks to run
- 📋 Structured Output: JSON output, rich tables, and ToolResult API pattern
- 🔧 Agent Integration: OpenAI function-calling tools for AI agent use
功能特性
- 🧠 大语言模型评估:MMLU、MMLU-Pro、HellaSwag、ARC、GSM8K、TruthfulQA、C-Eval、CMMLU、HumanEval、BBH
- 📊 嵌入模型评估:MTEB 分类/聚类/检索/语义相似度、C-MTEB 中文评估
- 🔄 重排模型评估:LLM 逐点重排、嵌入重排、LLM 列表重排
- 🤖 Ollama 集成:直接连接本地 Ollama API
- ⚡ 交互选择:自由选择模型和基准测试
- 📋 结构化输出:JSON 输出、富文本表格、ToolResult API 模式
- 🔧 智能体集成:OpenAI 函数调用工具
Requirements
- Python 3.10+
- Ollama running locally (default: http://localhost:11434)
- Internet connection (for HuggingFace datasets download, first run only)
Installation
pip install -e .
For development:
pip install -e ".[dev]"
Quick Start
# One-command auto benchmark: detect GPU, pick models, run all applicable tests
ollama-bench auto
# Auto benchmark without interactive prompt (GPU-fitable models only)
ollama-bench auto --no-confirm
# Override VRAM for auto benchmark
ollama-bench auto --vram 24 --no-confirm
# List available benchmarks
ollama-bench list benchmarks
# List available models on Ollama
ollama-bench list models
# List only models that fit in GPU VRAM (no CPU offload needed)
ollama-bench list models --gpu-only
# Show GPU info and model VRAM analysis
ollama-bench gpu
# Run all LLM benchmarks on a model
ollama-bench run -m llama3 --category llm
# Run specific benchmarks
ollama-bench run -m llama3 -b mmlu hellaswag gsm8k
# Auto-select GPU-fitable models and run benchmarks
ollama-bench run --gpu-only --category llm
# Run full suite (all categories)
ollama-bench suite -m llama3
# Run suite only on models that fit in VRAM
ollama-bench suite --gpu-only
# Run with limited samples for quick testing
ollama-bench run -m llama3 -b mmlu -n 100
# Output results to JSON file
ollama-bench run -m llama3 --category llm --json -o results.json
Usage
CLI Commands
List Benchmarks
# List all benchmarks
ollama-bench list benchmarks
# List by category
ollama-bench list benchmarks --category llm
ollama-bench list benchmarks --category embedding
ollama-bench list benchmarks --category reranking
List Models
ollama-bench list models
# List only models that fit in GPU VRAM (no CPU offload)
ollama-bench list models --gpu-only
GPU & VRAM
# Show GPU info and which models can run fully on GPU
ollama-bench gpu
# Override VRAM size (useful when nvidia-smi not available)
ollama-bench gpu --vram 24
# Output GPU info as JSON
ollama-bench gpu --json
The gpu command:
- Detects NVIDIA GPUs via
nvidia-smi - Estimates VRAM requirements based on model parameter count and quantization
- Shows which models fit in available VRAM (no CPU offload needed)
- Lists models that require CPU offload
Auto Benchmark (Hardware-Aware)
# Interactive mode: detect GPU, show models, let you pick, run all applicable benchmarks
ollama-bench auto
# Non-interactive: auto-select all GPU-fitable models and run
ollama-bench auto --no-confirm
# Override VRAM and run non-interactive
ollama-bench auto --vram 16 --no-confirm
# Limit samples for quick test
ollama-bench auto --no-confirm -n 50
# Output results as JSON
ollama-bench auto --no-confirm --json -o auto_results.json
The auto command:
- Detects your GPU VRAM via
nvidia-smi - Lists all Ollama models with VRAM estimates
- Presents an interactive multi-select menu (GPU-fitable models pre-selected)
- You pick models by number (e.g.
1,3,5or1-3orallorgpu) - Automatically matches each model to its applicable benchmarks (LLM/embedding/reranking)
- Runs all benchmarks and outputs results
Interactive selection options:
- Enter numbers:
1,3,5or1-3(range) all- select all modelsgpuor Enter - select only GPU-fitable models (default)q- quit
Run Benchmarks
# Run specific benchmarks on specific models
ollama-bench run -m llama3 -b mmlu hellaswag
# Run all LLM benchmarks
ollama-bench run -m llama3 --category llm
# Run on multiple models
ollama-bench run -m llama3 mistral qwen2 --category llm
# Auto-select models that fit in GPU VRAM
ollama-bench run --gpu-only --category llm
# Limit sample count
ollama-bench run -m llama3 -b mmlu -n 50
# Use custom Ollama host
ollama-bench run -m llama3 -b mmlu --host http://192.168.1.100:11434
Run Full Suite
# Full suite on a model
ollama-bench suite -m llama3
# Auto-select GPU-fitable models and run full suite
ollama-bench suite --gpu-only
# Only embedding benchmarks
ollama-bench suite -m nomic-embed-text --category embedding
# Only reranking
ollama-bench suite -m jina-reranker-v2-small --category reranking
CLI Flags
| Flag | Description |
|---|---|
-V, --version |
Show version |
-v, --verbose |
Verbose output (debug logging) |
-o, --output |
Output to file path (JSON) |
--json |
JSON output format |
-q, --quiet |
Suppress non-essential output |
Python API
from ollama_benchmarker import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models, gpu_info, auto_run
# List benchmarks
result = list_benchmarks(category="llm")
print(result.success) # True
print(result.data) # List of benchmark info dicts
# Discover models
models = discover_models()
print(models.data) # List of model info dicts
# Only models that fit in GPU VRAM
models = discover_models(gpu_only=True)
print(models.data) # Only models that don't need CPU offload
# GPU information and VRAM analysis
info = gpu_info()
print(info.data["gpus"]) # GPU details
print(info.data["gpu_fitable_models"]) # Models that fit in VRAM
print(info.data["offload_required_models"]) # Models needing CPU offload
# Auto-detect hardware, select GPU-fitable models, run all benchmarks
result = auto_run()
print(result.data["selected_models"]) # Number of models selected
print(result.data["results"]) # Full benchmark results
# Run a benchmark
result = run_benchmark(benchmark_name="mmlu", model="llama3")
print(result.data["metric_value"]) # Accuracy score
# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")
print(result.data) # Full results for all models
Agent Integration (OpenAI Function Calling)
from ollama_benchmarker.tools import TOOLS, dispatch
# Use TOOLS in your OpenAI function-calling setup
# When you receive a tool call:
result = dispatch("ollama_bench_run_benchmark", {
"benchmark_name": "mmlu",
"model": "llama3"
})
# One-command auto benchmark: detect hardware, select models, run all
result = dispatch("ollama_bench_auto_run", {"gpu_only": True})
# Check GPU VRAM and filter models
result = dispatch("ollama_bench_gpu_info", {})
# Discover only GPU-fitable models
result = dispatch("ollama_bench_discover_models", {"gpu_only": True})
Benchmark Categories
LLM Benchmarks
| Benchmark | Reference | Metric | Description |
|---|---|---|---|
| MMLU | Hendrycks et al., 2020 | Accuracy | 57 subjects, multiple-choice |
| MMLU-Pro | Wang et al., 2024 | Accuracy | Harder MMLU with 10 choices |
| HellaSwag | Zellers et al., 2019 | Accuracy | Commonsense NLI sentence completion |
| ARC-Easy | Clark et al., 2018 | Accuracy | Grade-school science (easy) |
| ARC-Challenge | Clark et al., 2018 | Accuracy | Grade-school science (hard) |
| GSM8K | Cobbe et al., 2021 | Accuracy | Math word problems |
| TruthfulQA MC1 | Lin et al., 2021 | Accuracy | Single-true truthfulness |
| TruthfulQA MC2 | Lin et al., 2021 | Accuracy | Multiple-true truthfulness |
| C-Eval | Huang et al., 2023 | Accuracy | Chinese multi-discipline |
| CMMLU | Li et al., 2023 | Accuracy | Chinese massive multitask |
| HumanEval | Chen et al., 2021 | pass@1 | Code generation |
| BBH | Suzgun et al., 2022 | Accuracy | Hard reasoning tasks |
Embedding Benchmarks
| Benchmark | Reference | Metric | Description |
|---|---|---|---|
| embed_classification | Muennighoff et al., 2022 | Accuracy | k-NN classification |
| embed_clustering | Muennighoff et al., 2022 | V-measure | k-Means clustering |
| embed_retrieval | Muennighoff et al., 2022 | NDCG@10 | Cosine-similarity retrieval |
| embed_sts | Muennighoff et al., 2022 | Spearman | Semantic textual similarity |
| cmteb | Xiao et al., 2023 | Accuracy | Chinese MTEB |
Reranking Benchmarks
| Benchmark | Reference | Metric | Description |
|---|---|---|---|
| llm_reranking | Askari et al., 2023 | NDCG@5 | LLM pointwise reranking |
| embed_reranking | Muennighoff et al., 2022 | NDCG@5 | Embedding cosine reranking |
| llm_listwise_reranking | Askari et al., 2023 | NDCG@5 | LLM listwise reranking |
References
All benchmark papers are available in the pdf/ directory:
| Paper | arXiv ID |
|---|---|
| MMLU | 2009.03300 |
| HellaSwag | 1905.07830 |
| ARC | 1803.05457 |
| GSM8K | 2110.14168 |
| TruthfulQA | 2109.07958 |
| CMMLU | 2306.09212 |
| C-Eval | 2305.08322 |
| HumanEval | 2107.03374 |
| MMLU-Pro | 2406.01564 |
| BBH | 2206.04615 |
| HELM | 2211.09110 |
| MTEB | 2210.07316 |
| C-MTEB | 2307.09371 |
| BEIR | 2104.08663 |
| RankLLM | 2310.18548 |
| Jina ColBERT | 2402.14759 |
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Lint
ruff format . && ruff check .
# Type check
mypy ollama_benchmarker/
License
GPL-3.0-or-later
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ollama_benchmarker-0.1.3.tar.gz.
File metadata
- Download URL: ollama_benchmarker-0.1.3.tar.gz
- Upload date:
- Size: 50.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8ba540a3ce0a5d92693cd417878c1630cdc0b32f289a056976b90d574e7c10a
|
|
| MD5 |
963615e17f9eaf4b63207e3ccce64c54
|
|
| BLAKE2b-256 |
024cbc10e2f96f88e815668533152d4e15d4681695b25ca7a3eb68b640308db6
|
File details
Details for the file ollama_benchmarker-0.1.3-py3-none-any.whl.
File metadata
- Download URL: ollama_benchmarker-0.1.3-py3-none-any.whl
- Upload date:
- Size: 57.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4707b6118d98739917ed4a79cb58221289c162c2890230594873a187d456295d
|
|
| MD5 |
077f62fc519ff9caccdad3cf86e6a867
|
|
| BLAKE2b-256 |
e0a01eec82c6caa7073270954a6f17aee9858d51f5d69099e234976b6cffc856
|