Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation

These details have not been verified by PyPI

Project links

Repository

Project description

ollama-benchmarker

Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation

Features

🧠 LLM Benchmarks: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
📊 Embedding Benchmarks: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
🔄 Reranking Benchmarks: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
🤖 Ollama Integration: Direct integration with local Ollama API
⚡ Interactive Selection: Choose which models and benchmarks to run
📋 Structured Output: JSON output, rich tables, and ToolResult API pattern
🔧 Agent Integration: OpenAI function-calling tools for AI agent use

功能特性

🧠 大语言模型评估：MMLU、MMLU-Pro、HellaSwag、ARC、GSM8K、TruthfulQA、C-Eval、CMMLU、HumanEval、BBH
📊 嵌入模型评估：MTEB 分类/聚类/检索/语义相似度、C-MTEB 中文评估
🔄 重排模型评估：LLM 逐点重排、嵌入重排、LLM 列表重排
🤖 Ollama 集成：直接连接本地 Ollama API
⚡ 交互选择：自由选择模型和基准测试
📋 结构化输出：JSON 输出、富文本表格、ToolResult API 模式
🔧 智能体集成：OpenAI 函数调用工具

Requirements

Python 3.10+
Ollama running locally (default: http://localhost:11434)
Internet connection (for HuggingFace datasets download, first run only)

Installation

pip install -e .

For development:

pip install -e ".[dev]"

Quick Start

# One-command auto benchmark: detect GPU, pick models, run all applicable tests
ollama-bench auto

# Auto benchmark without interactive prompt (GPU-fitable models only)
ollama-bench auto --no-confirm

# Override VRAM for auto benchmark
ollama-bench auto --vram 24 --no-confirm

# List available benchmarks
ollama-bench list benchmarks

# List available models on Ollama
ollama-bench list models

# List only models that fit in GPU VRAM (no CPU offload needed)
ollama-bench list models --gpu-only

# Show GPU info and model VRAM analysis
ollama-bench gpu

# Run all LLM benchmarks on a model
ollama-bench run -m llama3 --category llm

# Run specific benchmarks
ollama-bench run -m llama3 -b mmlu hellaswag gsm8k

# Auto-select GPU-fitable models and run benchmarks
ollama-bench run --gpu-only --category llm

# Run full suite (all categories)
ollama-bench suite -m llama3

# Run suite only on models that fit in VRAM
ollama-bench suite --gpu-only

# Run with limited samples for quick testing
ollama-bench run -m llama3 -b mmlu -n 100

# Output results to JSON file
ollama-bench run -m llama3 --category llm --json -o results.json

Usage

CLI Commands

List Benchmarks

# List all benchmarks
ollama-bench list benchmarks

# List by category
ollama-bench list benchmarks --category llm
ollama-bench list benchmarks --category embedding
ollama-bench list benchmarks --category reranking

List Models

ollama-bench list models

# List only models that fit in GPU VRAM (no CPU offload)
ollama-bench list models --gpu-only

GPU & VRAM

# Show GPU info and which models can run fully on GPU
ollama-bench gpu

# Override VRAM size (useful when nvidia-smi not available)
ollama-bench gpu --vram 24

# Output GPU info as JSON
ollama-bench gpu --json

The gpu command:

Detects NVIDIA GPUs via nvidia-smi
Estimates VRAM requirements based on model parameter count and quantization
Shows which models fit in available VRAM (no CPU offload needed)
Lists models that require CPU offload

Auto Benchmark (Hardware-Aware)

# Interactive mode: detect GPU, show models, let you pick, run all applicable benchmarks
ollama-bench auto

# Non-interactive: auto-select all GPU-fitable models and run
ollama-bench auto --no-confirm

# Override VRAM and run non-interactive
ollama-bench auto --vram 16 --no-confirm

# Limit samples for quick test
ollama-bench auto --no-confirm -n 50

# Output results as JSON
ollama-bench auto --no-confirm --json -o auto_results.json

The auto command:

Detects your GPU VRAM via nvidia-smi
Lists all Ollama models with VRAM estimates
Presents an interactive multi-select menu (GPU-fitable models pre-selected)
You pick models by number (e.g. 1,3,5 or 1-3 or all or gpu)
Automatically matches each model to its applicable benchmarks (LLM/embedding/reranking)
Runs all benchmarks and outputs results

Interactive selection options:

Enter numbers: 1,3,5 or 1-3 (range)
all - select all models
gpu or Enter - select only GPU-fitable models (default)
q - quit

Run Benchmarks

# Run specific benchmarks on specific models
ollama-bench run -m llama3 -b mmlu hellaswag

# Run all LLM benchmarks
ollama-bench run -m llama3 --category llm

# Run on multiple models
ollama-bench run -m llama3 mistral qwen2 --category llm

# Auto-select models that fit in GPU VRAM
ollama-bench run --gpu-only --category llm

# Limit sample count
ollama-bench run -m llama3 -b mmlu -n 50

# Use custom Ollama host
ollama-bench run -m llama3 -b mmlu --host http://192.168.1.100:11434

Run Full Suite

# Full suite on a model
ollama-bench suite -m llama3

# Auto-select GPU-fitable models and run full suite
ollama-bench suite --gpu-only

# Only embedding benchmarks
ollama-bench suite -m nomic-embed-text --category embedding

# Only reranking
ollama-bench suite -m jina-reranker-v2-small --category reranking

CLI Flags

Flag	Description
`-V`, `--version`	Show version
`-v`, `--verbose`	Verbose output (debug logging)
`-o`, `--output`	Output to file path (JSON)
`--json`	JSON output format
`-q`, `--quiet`	Suppress non-essential output

Python API

from ollama_benchmarker import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models, gpu_info, auto_run

# List benchmarks
result = list_benchmarks(category="llm")
print(result.success)    # True
print(result.data)       # List of benchmark info dicts

# Discover models
models = discover_models()
print(models.data)       # List of model info dicts

# Only models that fit in GPU VRAM
models = discover_models(gpu_only=True)
print(models.data)       # Only models that don't need CPU offload

# GPU information and VRAM analysis
info = gpu_info()
print(info.data["gpus"])                     # GPU details
print(info.data["gpu_fitable_models"])        # Models that fit in VRAM
print(info.data["offload_required_models"])   # Models needing CPU offload

# Auto-detect hardware, select GPU-fitable models, run all benchmarks
result = auto_run()
print(result.data["selected_models"])         # Number of models selected
print(result.data["results"])                 # Full benchmark results

# Run a benchmark
result = run_benchmark(benchmark_name="mmlu", model="llama3")
print(result.data["metric_value"])  # Accuracy score

# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")
print(result.data)       # Full results for all models

Agent Integration (OpenAI Function Calling)

from ollama_benchmarker.tools import TOOLS, dispatch

# Use TOOLS in your OpenAI function-calling setup
# When you receive a tool call:
result = dispatch("ollama_bench_run_benchmark", {
    "benchmark_name": "mmlu",
    "model": "llama3"
})

# One-command auto benchmark: detect hardware, select models, run all
result = dispatch("ollama_bench_auto_run", {"gpu_only": True})

# Check GPU VRAM and filter models
result = dispatch("ollama_bench_gpu_info", {})

# Discover only GPU-fitable models
result = dispatch("ollama_bench_discover_models", {"gpu_only": True})

Benchmark Categories

LLM Benchmarks

Benchmark	Reference	Metric	Description
MMLU	Hendrycks et al., 2020	Accuracy	57 subjects, multiple-choice
MMLU-Pro	Wang et al., 2024	Accuracy	Harder MMLU with 10 choices
HellaSwag	Zellers et al., 2019	Accuracy	Commonsense NLI sentence completion
ARC-Easy	Clark et al., 2018	Accuracy	Grade-school science (easy)
ARC-Challenge	Clark et al., 2018	Accuracy	Grade-school science (hard)
GSM8K	Cobbe et al., 2021	Accuracy	Math word problems
TruthfulQA MC1	Lin et al., 2021	Accuracy	Single-true truthfulness
TruthfulQA MC2	Lin et al., 2021	Accuracy	Multiple-true truthfulness
C-Eval	Huang et al., 2023	Accuracy	Chinese multi-discipline
CMMLU	Li et al., 2023	Accuracy	Chinese massive multitask
HumanEval	Chen et al., 2021	pass@1	Code generation
BBH	Suzgun et al., 2022	Accuracy	Hard reasoning tasks

Embedding Benchmarks

Benchmark	Reference	Metric	Description
embed_classification	Muennighoff et al., 2022	Accuracy	k-NN classification
embed_clustering	Muennighoff et al., 2022	V-measure	k-Means clustering
embed_retrieval	Muennighoff et al., 2022	NDCG@10	Cosine-similarity retrieval
embed_sts	Muennighoff et al., 2022	Spearman	Semantic textual similarity
cmteb	Xiao et al., 2023	Accuracy	Chinese MTEB

Reranking Benchmarks

Benchmark	Reference	Metric	Description
llm_reranking	Askari et al., 2023	NDCG@5	LLM pointwise reranking
embed_reranking	Muennighoff et al., 2022	NDCG@5	Embedding cosine reranking
llm_listwise_reranking	Askari et al., 2023	NDCG@5	LLM listwise reranking

References

All benchmark papers are available in the pdf/ directory:

Paper	arXiv ID
MMLU	2009.03300
HellaSwag	1905.07830
ARC	1803.05457
GSM8K	2110.14168
TruthfulQA	2109.07958
CMMLU	2306.09212
C-Eval	2305.08322
HumanEval	2107.03374
MMLU-Pro	2406.01564
BBH	2206.04615
HELM	2211.09110
MTEB	2210.07316
C-MTEB	2307.09371
BEIR	2104.08663
RankLLM	2310.18548
Jina ColBERT	2402.14759

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Lint
ruff format . && ruff check .

# Type check
mypy ollama_benchmarker/

License

GPL-3.0-or-later

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.3

Apr 30, 2026

0.1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollama_benchmarker-0.1.3.tar.gz (50.7 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ollama_benchmarker-0.1.3-py3-none-any.whl (57.0 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file ollama_benchmarker-0.1.3.tar.gz.

File metadata

Download URL: ollama_benchmarker-0.1.3.tar.gz
Upload date: Apr 30, 2026
Size: 50.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ollama_benchmarker-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`c8ba540a3ce0a5d92693cd417878c1630cdc0b32f289a056976b90d574e7c10a`
MD5	`963615e17f9eaf4b63207e3ccce64c54`
BLAKE2b-256	`024cbc10e2f96f88e815668533152d4e15d4681695b25ca7a3eb68b640308db6`

See more details on using hashes here.

File details

Details for the file ollama_benchmarker-0.1.3-py3-none-any.whl.

File metadata

Download URL: ollama_benchmarker-0.1.3-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 57.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ollama_benchmarker-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4707b6118d98739917ed4a79cb58221289c162c2890230594873a187d456295d`
MD5	`077f62fc519ff9caccdad3cf86e6a867`
BLAKE2b-256	`e0a01eec82c6caa7073270954a6f17aee9858d51f5d69099e234976b6cffc856`

See more details on using hashes here.

ollama-benchmarker 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ollama-benchmarker

Features

Requirements

Installation

Quick Start

Usage

CLI Commands

List Benchmarks

List Models

GPU & VRAM

Auto Benchmark (Hardware-Aware)

Run Benchmarks

Run Full Suite

CLI Flags

Python API

Agent Integration (OpenAI Function Calling)

Benchmark Categories

LLM Benchmarks

Embedding Benchmarks

Reranking Benchmarks

References

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes