Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation
Project description
ollama-benchmarker
Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation
Features
- 🧠 LLM Benchmarks: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
- 📊 Embedding Benchmarks: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
- 🔄 Reranking Benchmarks: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
- 🤖 Ollama Integration: Direct integration with local Ollama API
- ⚡ Interactive Selection: Choose which models and benchmarks to run
- 📋 Structured Output: JSON output, rich tables, and ToolResult API pattern
- 🔧 Agent Integration: OpenAI function-calling tools for AI agent use
功能特性
- 🧠 大语言模型评估:MMLU、MMLU-Pro、HellaSwag、ARC、GSM8K、TruthfulQA、C-Eval、CMMLU、HumanEval、BBH
- 📊 嵌入模型评估:MTEB 分类/聚类/检索/语义相似度、C-MTEB 中文评估
- 🔄 重排模型评估:LLM 逐点重排、嵌入重排、LLM 列表重排
- 🤖 Ollama 集成:直接连接本地 Ollama API
- ⚡ 交互选择:自由选择模型和基准测试
- 📋 结构化输出:JSON 输出、富文本表格、ToolResult API 模式
- 🔧 智能体集成:OpenAI 函数调用工具
Requirements
- Python 3.10+
- Ollama running locally (default: http://localhost:11434)
- Internet connection (for HuggingFace datasets download, first run only)
Installation
pip install -e .
For development:
pip install -e ".[dev]"
Quick Start
# List available benchmarks
ollama-bench list benchmarks
# List available models on Ollama
ollama-bench list models
# Run all LLM benchmarks on a model
ollama-bench run -m llama3 --category llm
# Run specific benchmarks
ollama-bench run -m llama3 -b mmlu hellaswag gsm8k
# Run full suite (all categories)
ollama-bench suite -m llama3
# Run with limited samples for quick testing
ollama-bench run -m llama3 -b mmlu -n 100
# Output results to JSON file
ollama-bench run -m llama3 --category llm --json -o results.json
Usage
CLI Commands
List Benchmarks
# List all benchmarks
ollama-bench list benchmarks
# List by category
ollama-bench list benchmarks --category llm
ollama-bench list benchmarks --category embedding
ollama-bench list benchmarks --category reranking
List Models
ollama-bench list models
Run Benchmarks
# Run specific benchmarks on specific models
ollama-bench run -m llama3 -b mmlu hellaswag
# Run all LLM benchmarks
ollama-bench run -m llama3 --category llm
# Run on multiple models
ollama-bench run -m llama3 mistral qwen2 --category llm
# Limit sample count
ollama-bench run -m llama3 -b mmlu -n 50
# Use custom Ollama host
ollama-bench run -m llama3 -b mmlu --host http://192.168.1.100:11434
Run Full Suite
# Full suite on a model
ollama-bench suite -m llama3
# Only embedding benchmarks
ollama-bench suite -m nomic-embed-text --category embedding
# Only reranking
ollama-bench suite -m jina-reranker-v2-small --category reranking
CLI Flags
| Flag | Description |
|---|---|
-V, --version |
Show version |
-v, --verbose |
Verbose output (debug logging) |
-o, --output |
Output to file path (JSON) |
--json |
JSON output format |
-q, --quiet |
Suppress non-essential output |
Python API
from ollama_benchmarker import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models
# List benchmarks
result = list_benchmarks(category="llm")
print(result.success) # True
print(result.data) # List of benchmark info dicts
# Discover models
models = discover_models()
print(models.data) # List of model info dicts
# Run a benchmark
result = run_benchmark(benchmark_name="mmlu", model="llama3")
print(result.data["metric_value"]) # Accuracy score
# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")
print(result.data) # Full results for all models
Agent Integration (OpenAI Function Calling)
from ollama_benchmarker.tools import TOOLS, dispatch
# Use TOOLS in your OpenAI function-calling setup
# When you receive a tool call:
result = dispatch("ollama_bench_run_benchmark", {
"benchmark_name": "mmlu",
"model": "llama3"
})
Benchmark Categories
LLM Benchmarks
| Benchmark | Reference | Metric | Description |
|---|---|---|---|
| MMLU | Hendrycks et al., 2020 | Accuracy | 57 subjects, multiple-choice |
| MMLU-Pro | Wang et al., 2024 | Accuracy | Harder MMLU with 10 choices |
| HellaSwag | Zellers et al., 2019 | Accuracy | Commonsense NLI sentence completion |
| ARC-Easy | Clark et al., 2018 | Accuracy | Grade-school science (easy) |
| ARC-Challenge | Clark et al., 2018 | Accuracy | Grade-school science (hard) |
| GSM8K | Cobbe et al., 2021 | Accuracy | Math word problems |
| TruthfulQA MC1 | Lin et al., 2021 | Accuracy | Single-true truthfulness |
| TruthfulQA MC2 | Lin et al., 2021 | Accuracy | Multiple-true truthfulness |
| C-Eval | Huang et al., 2023 | Accuracy | Chinese multi-discipline |
| CMMLU | Li et al., 2023 | Accuracy | Chinese massive multitask |
| HumanEval | Chen et al., 2021 | pass@1 | Code generation |
| BBH | Suzgun et al., 2022 | Accuracy | Hard reasoning tasks |
Embedding Benchmarks
| Benchmark | Reference | Metric | Description |
|---|---|---|---|
| embed_classification | Muennighoff et al., 2022 | Accuracy | k-NN classification |
| embed_clustering | Muennighoff et al., 2022 | V-measure | k-Means clustering |
| embed_retrieval | Muennighoff et al., 2022 | NDCG@10 | Cosine-similarity retrieval |
| embed_sts | Muennighoff et al., 2022 | Spearman | Semantic textual similarity |
| cmteb | Xiao et al., 2023 | Accuracy | Chinese MTEB |
Reranking Benchmarks
| Benchmark | Reference | Metric | Description |
|---|---|---|---|
| llm_reranking | Askari et al., 2023 | NDCG@5 | LLM pointwise reranking |
| embed_reranking | Muennighoff et al., 2022 | NDCG@5 | Embedding cosine reranking |
| llm_listwise_reranking | Askari et al., 2023 | NDCG@5 | LLM listwise reranking |
References
All benchmark papers are available in the pdf/ directory:
| Paper | arXiv ID |
|---|---|
| MMLU | 2009.03300 |
| HellaSwag | 1905.07830 |
| ARC | 1803.05457 |
| GSM8K | 2110.14168 |
| TruthfulQA | 2109.07958 |
| CMMLU | 2306.09212 |
| C-Eval | 2305.08322 |
| HumanEval | 2107.03374 |
| MMLU-Pro | 2406.01564 |
| BBH | 2206.04615 |
| HELM | 2211.09110 |
| MTEB | 2210.07316 |
| C-MTEB | 2307.09371 |
| BEIR | 2104.08663 |
| RankLLM | 2310.18548 |
| Jina ColBERT | 2402.14759 |
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Lint
ruff format . && ruff check .
# Type check
mypy ollama_benchmarker/
License
GPL-3.0-or-later
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ollama_benchmarker-0.1.0.tar.gz
(44.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ollama_benchmarker-0.1.0.tar.gz.
File metadata
- Download URL: ollama_benchmarker-0.1.0.tar.gz
- Upload date:
- Size: 44.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7667b59f0e9d06b194cfaecb0efaa1872b12ed73b6c9706cd64bc8ce5889c5b5
|
|
| MD5 |
92487d06638aa6bb27d9b4feb5b66174
|
|
| BLAKE2b-256 |
6847c751e6ea5651869e52e5a8695c5fd3128546aba992d33c78f852e35064b5
|
File details
Details for the file ollama_benchmarker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ollama_benchmarker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b6aedf3e382c01dc3de0fa8dabda4dc01bab615d3edfc8e37ff3d94b7c4269e
|
|
| MD5 |
18c771df4b46e4d09de538985db7ba69
|
|
| BLAKE2b-256 |
a32afc3c3c85fe2438f561570efa4566a13d20aaad7eb04dd007f0603f5affeb
|