Skip to main content

Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation

Project description

ollama-benchmarker

Comprehensive benchmark suite for Ollama models: LLM, embedding, and reranking evaluation

Features

  • 🧠 LLM Benchmarks: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
  • 📊 Embedding Benchmarks: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
  • 🔄 Reranking Benchmarks: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
  • 🤖 Ollama Integration: Direct integration with local Ollama API
  • Interactive Selection: Choose which models and benchmarks to run
  • 📋 Structured Output: JSON output, rich tables, and ToolResult API pattern
  • 🔧 Agent Integration: OpenAI function-calling tools for AI agent use

功能特性

  • 🧠 大语言模型评估:MMLU、MMLU-Pro、HellaSwag、ARC、GSM8K、TruthfulQA、C-Eval、CMMLU、HumanEval、BBH
  • 📊 嵌入模型评估:MTEB 分类/聚类/检索/语义相似度、C-MTEB 中文评估
  • 🔄 重排模型评估:LLM 逐点重排、嵌入重排、LLM 列表重排
  • 🤖 Ollama 集成:直接连接本地 Ollama API
  • ⚡ 交互选择:自由选择模型和基准测试
  • 📋 结构化输出:JSON 输出、富文本表格、ToolResult API 模式
  • 🔧 智能体集成:OpenAI 函数调用工具

Requirements

  • Python 3.10+
  • Ollama running locally (default: http://localhost:11434)
  • Internet connection (for HuggingFace datasets download, first run only)

Installation

pip install -e .

For development:

pip install -e ".[dev]"

Quick Start

# One-command auto benchmark: detect GPU, pick models, run all applicable tests
ollama-bench auto

# Auto benchmark without interactive prompt (GPU-fitable models only)
ollama-bench auto --no-confirm

# Override VRAM for auto benchmark
ollama-bench auto --vram 24 --no-confirm

# List available benchmarks
ollama-bench list benchmarks

# List available models on Ollama
ollama-bench list models

# List only models that fit in GPU VRAM (no CPU offload needed)
ollama-bench list models --gpu-only

# Show GPU info and model VRAM analysis
ollama-bench gpu

# Run all LLM benchmarks on a model
ollama-bench run -m llama3 --category llm

# Run specific benchmarks
ollama-bench run -m llama3 -b mmlu hellaswag gsm8k

# Auto-select GPU-fitable models and run benchmarks
ollama-bench run --gpu-only --category llm

# Run full suite (all categories)
ollama-bench suite -m llama3

# Run suite only on models that fit in VRAM
ollama-bench suite --gpu-only

# Run with limited samples for quick testing
ollama-bench run -m llama3 -b mmlu -n 100

# Output results to JSON file
ollama-bench run -m llama3 --category llm --json -o results.json

Usage

CLI Commands

List Benchmarks

# List all benchmarks
ollama-bench list benchmarks

# List by category
ollama-bench list benchmarks --category llm
ollama-bench list benchmarks --category embedding
ollama-bench list benchmarks --category reranking

List Models

ollama-bench list models

# List only models that fit in GPU VRAM (no CPU offload)
ollama-bench list models --gpu-only

GPU & VRAM

# Show GPU info and which models can run fully on GPU
ollama-bench gpu

# Override VRAM size (useful when nvidia-smi not available)
ollama-bench gpu --vram 24

# Output GPU info as JSON
ollama-bench gpu --json

The gpu command:

  • Detects NVIDIA GPUs via nvidia-smi
  • Estimates VRAM requirements based on model parameter count and quantization
  • Shows which models fit in available VRAM (no CPU offload needed)
  • Lists models that require CPU offload

Auto Benchmark (Hardware-Aware)

# Interactive mode: detect GPU, show models, let you pick, run all applicable benchmarks
ollama-bench auto

# Non-interactive: auto-select all GPU-fitable models and run
ollama-bench auto --no-confirm

# Override VRAM and run non-interactive
ollama-bench auto --vram 16 --no-confirm

# Limit samples for quick test
ollama-bench auto --no-confirm -n 50

# Output results as JSON
ollama-bench auto --no-confirm --json -o auto_results.json

The auto command:

  1. Detects your GPU VRAM via nvidia-smi
  2. Lists all Ollama models with VRAM estimates
  3. Presents an interactive multi-select menu (GPU-fitable models pre-selected)
  4. You pick models by number (e.g. 1,3,5 or 1-3 or all or gpu)
  5. Automatically matches each model to its applicable benchmarks (LLM/embedding/reranking)
  6. Runs all benchmarks and outputs results

Interactive selection options:

  • Enter numbers: 1,3,5 or 1-3 (range)
  • all - select all models
  • gpu or Enter - select only GPU-fitable models (default)
  • q - quit

Run Benchmarks

# Run specific benchmarks on specific models
ollama-bench run -m llama3 -b mmlu hellaswag

# Run all LLM benchmarks
ollama-bench run -m llama3 --category llm

# Run on multiple models
ollama-bench run -m llama3 mistral qwen2 --category llm

# Auto-select models that fit in GPU VRAM
ollama-bench run --gpu-only --category llm

# Limit sample count
ollama-bench run -m llama3 -b mmlu -n 50

# Use custom Ollama host
ollama-bench run -m llama3 -b mmlu --host http://192.168.1.100:11434

Run Full Suite

# Full suite on a model
ollama-bench suite -m llama3

# Auto-select GPU-fitable models and run full suite
ollama-bench suite --gpu-only

# Only embedding benchmarks
ollama-bench suite -m nomic-embed-text --category embedding

# Only reranking
ollama-bench suite -m jina-reranker-v2-small --category reranking

CLI Flags

Flag Description
-V, --version Show version
-v, --verbose Verbose output (debug logging)
-o, --output Output to file path (JSON)
--json JSON output format
-q, --quiet Suppress non-essential output

Python API

from ollama_benchmarker import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models, gpu_info, auto_run

# List benchmarks
result = list_benchmarks(category="llm")
print(result.success)    # True
print(result.data)       # List of benchmark info dicts

# Discover models
models = discover_models()
print(models.data)       # List of model info dicts

# Only models that fit in GPU VRAM
models = discover_models(gpu_only=True)
print(models.data)       # Only models that don't need CPU offload

# GPU information and VRAM analysis
info = gpu_info()
print(info.data["gpus"])                     # GPU details
print(info.data["gpu_fitable_models"])        # Models that fit in VRAM
print(info.data["offload_required_models"])   # Models needing CPU offload

# Auto-detect hardware, select GPU-fitable models, run all benchmarks
result = auto_run()
print(result.data["selected_models"])         # Number of models selected
print(result.data["results"])                 # Full benchmark results

# Run a benchmark
result = run_benchmark(benchmark_name="mmlu", model="llama3")
print(result.data["metric_value"])  # Accuracy score

# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")
print(result.data)       # Full results for all models

Agent Integration (OpenAI Function Calling)

from ollama_benchmarker.tools import TOOLS, dispatch

# Use TOOLS in your OpenAI function-calling setup
# When you receive a tool call:
result = dispatch("ollama_bench_run_benchmark", {
    "benchmark_name": "mmlu",
    "model": "llama3"
})

# One-command auto benchmark: detect hardware, select models, run all
result = dispatch("ollama_bench_auto_run", {"gpu_only": True})

# Check GPU VRAM and filter models
result = dispatch("ollama_bench_gpu_info", {})

# Discover only GPU-fitable models
result = dispatch("ollama_bench_discover_models", {"gpu_only": True})

Benchmark Categories

LLM Benchmarks

Benchmark Reference Metric Description
MMLU Hendrycks et al., 2020 Accuracy 57 subjects, multiple-choice
MMLU-Pro Wang et al., 2024 Accuracy Harder MMLU with 10 choices
HellaSwag Zellers et al., 2019 Accuracy Commonsense NLI sentence completion
ARC-Easy Clark et al., 2018 Accuracy Grade-school science (easy)
ARC-Challenge Clark et al., 2018 Accuracy Grade-school science (hard)
GSM8K Cobbe et al., 2021 Accuracy Math word problems
TruthfulQA MC1 Lin et al., 2021 Accuracy Single-true truthfulness
TruthfulQA MC2 Lin et al., 2021 Accuracy Multiple-true truthfulness
C-Eval Huang et al., 2023 Accuracy Chinese multi-discipline
CMMLU Li et al., 2023 Accuracy Chinese massive multitask
HumanEval Chen et al., 2021 pass@1 Code generation
BBH Suzgun et al., 2022 Accuracy Hard reasoning tasks

Embedding Benchmarks

Benchmark Reference Metric Description
embed_classification Muennighoff et al., 2022 Accuracy k-NN classification
embed_clustering Muennighoff et al., 2022 V-measure k-Means clustering
embed_retrieval Muennighoff et al., 2022 NDCG@10 Cosine-similarity retrieval
embed_sts Muennighoff et al., 2022 Spearman Semantic textual similarity
cmteb Xiao et al., 2023 Accuracy Chinese MTEB

Reranking Benchmarks

Benchmark Reference Metric Description
llm_reranking Askari et al., 2023 NDCG@5 LLM pointwise reranking
embed_reranking Muennighoff et al., 2022 NDCG@5 Embedding cosine reranking
llm_listwise_reranking Askari et al., 2023 NDCG@5 LLM listwise reranking

References

All benchmark papers are available in the pdf/ directory:

Paper arXiv ID
MMLU 2009.03300
HellaSwag 1905.07830
ARC 1803.05457
GSM8K 2110.14168
TruthfulQA 2109.07958
CMMLU 2306.09212
C-Eval 2305.08322
HumanEval 2107.03374
MMLU-Pro 2406.01564
BBH 2206.04615
HELM 2211.09110
MTEB 2210.07316
C-MTEB 2307.09371
BEIR 2104.08663
RankLLM 2310.18548
Jina ColBERT 2402.14759

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Lint
ruff format . && ruff check .

# Type check
mypy ollama_benchmarker/

License

GPL-3.0-or-later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollama_benchmarker-0.1.3.tar.gz (50.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ollama_benchmarker-0.1.3-py3-none-any.whl (57.0 kB view details)

Uploaded Python 3

File details

Details for the file ollama_benchmarker-0.1.3.tar.gz.

File metadata

  • Download URL: ollama_benchmarker-0.1.3.tar.gz
  • Upload date:
  • Size: 50.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for ollama_benchmarker-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c8ba540a3ce0a5d92693cd417878c1630cdc0b32f289a056976b90d574e7c10a
MD5 963615e17f9eaf4b63207e3ccce64c54
BLAKE2b-256 024cbc10e2f96f88e815668533152d4e15d4681695b25ca7a3eb68b640308db6

See more details on using hashes here.

File details

Details for the file ollama_benchmarker-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for ollama_benchmarker-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4707b6118d98739917ed4a79cb58221289c162c2890230594873a187d456295d
MD5 077f62fc519ff9caccdad3cf86e6a867
BLAKE2b-256 e0a01eec82c6caa7073270954a6f17aee9858d51f5d69099e234976b6cffc856

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page