Skip to main content

Comprehensive benchmark suite for LLM, embedding, and reranking models across Ollama, GGUF, ModelScope, and HuggingFace backends

Project description

BiWu (比武)

Comprehensive benchmark suite for LLM, embedding, and reranking models across Ollama, GGUF, ModelScope, and HuggingFace backends

Features

  • 🧠 LLM Benchmarks: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
  • 📊 Embedding Benchmarks: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
  • 🔄 Reranking Benchmarks: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
  • 🔌 Multi-Backend: Ollama, GGUF (llama-cpp-python), ModelScope, HuggingFace
  • Interactive Selection: Choose which models and benchmarks to run
  • 📋 Structured Output: JSON output, rich tables, and ToolResult API pattern
  • 🔧 Agent Integration: OpenAI function-calling tools for AI agent use
  • 🎮 GPU VRAM-Aware: Auto-detect GPU, filter models that fit in VRAM

Requirements

  • Python 3.10+
  • For Ollama backend: Ollama running locally (default: http://localhost:11434)
  • For GGUF backend: llama-cpp-python with CUDA support
  • For ModelScope backend: modelscope + llama-cpp-python
  • For HuggingFace backend: huggingface_hub + llama-cpp-python
  • Internet connection (for dataset download, first run only)

Installation

pip install -e .

With specific backends:

pip install -e ".[ollama]"      # Ollama backend
pip install -e ".[gguf]"        # GGUF backend
pip install -e ".[modelscope]"  # ModelScope backend
pip install -e ".[huggingface]" # HuggingFace backend
pip install -e ".[all]"         # All backends
pip install -e ".[dev]"          # Dev dependencies

Quick Start

# One-command auto benchmark: detect GPU, pick models, run all applicable tests
biwu auto

# Auto benchmark without interactive prompt (GPU-fitable models only)
biwu auto --no-confirm

# Override VRAM for auto benchmark
biwu auto --vram 24 --no-confirm

# List available benchmarks
biwu list benchmarks

# List available models on Ollama
biwu list models

# List models that fit in GPU VRAM
biwu list models --gpu-only

# Show GPU info and model VRAM analysis
biwu gpu

# Run all LLM benchmarks on a model (Ollama)
biwu run -m llama3 --category llm

# Run benchmarks on a GGUF model
biwu run -m /path/to/model.gguf --backend gguf --category llm

# Run benchmarks on a ModelScope model
biwu run -m Qwen/Qwen2-7B-Instruct-GGUF --backend modelscope --category llm

# Run benchmarks on a HuggingFace model
biwu run -m TheBloke/Llama-2-7B-GGUF --backend huggingface --category llm

# Run specific benchmarks
biwu run -m llama3 -b mmlu hellaswag gsm8k

# Auto-select GPU-fitable models and run benchmarks
biwu run --gpu-only --category llm

# Run full suite (all categories)
biwu suite -m llama3

# Run suite only on models that fit in VRAM
biwu suite --gpu-only

# Run with limited samples for quick testing
biwu run -m llama3 -b mmlu -n 100

# Output results to JSON file
biwu run -m llama3 --category llm --json -o results.json

Usage

CLI Commands

List Benchmarks

biwu list benchmarks
biwu list benchmarks --category llm
biwu list benchmarks --category embedding
biwu list benchmarks --category reranking

List Models

# List models on Ollama (default backend)
biwu list models

# List models that fit in GPU VRAM
biwu list models --gpu-only

# List models for a specific backend
biwu list models --backend gguf
biwu list models --backend modelscope

GPU & VRAM

biwu gpu
biwu gpu --vram 24
biwu gpu --json

Auto Benchmark (Hardware-Aware)

# Interactive mode
biwu auto

# Non-interactive: auto-select all GPU-fitable models
biwu auto --no-confirm

# Override VRAM
biwu auto --vram 16 --no-confirm

# Use a specific backend
biwu auto --backend gguf --no-confirm
biwu auto --backend modelscope --no-confirm

The auto command:

  1. Detects your GPU VRAM via nvidia-smi
  2. Lists models with VRAM estimates
  3. Presents an interactive multi-select menu (GPU-fitable models pre-selected)
  4. Matches each model to its applicable benchmarks (LLM/embedding/reranking)
  5. Runs all benchmarks and outputs results

Interactive selection: 1,3,5 or 1-3, all, gpu (default), q

Run Benchmarks

# Ollama (default)
biwu run -m llama3 -b mmlu hellaswag

# GGUF
biwu run -m /path/to/model.gguf --backend gguf --category llm

# ModelScope
biwu run -m Qwen/Qwen2-7B-Instruct-GGUF --backend modelscope --category llm

# HuggingFace
biwu run -m TheBloke/Llama-2-7B-GGUF --backend huggingface --category llm

# Multiple models
biwu run -m llama3 mistral qwen2 --category llm

# GPU-only models
biwu run --gpu-only --category llm

# Custom Ollama host
biwu run -m llama3 -b mmlu --host http://192.168.1.100:11434

Run Full Suite

biwu suite -m llama3
biwu suite --gpu-only
biwu suite -m nomic-embed-text --category embedding
biwu suite -m jina-reranker-v2-small --category reranking
biwu suite -m model.gguf --backend gguf --category full

CLI Flags

Flag Description
-V, --version Show version
-v, --verbose Verbose output (debug logging)
-o, --output Output to file path (JSON)
--json JSON output format
-q, --quiet Suppress non-essential output

Python API

from biwu import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models, gpu_info, auto_run

# List benchmarks
result = list_benchmarks(category="llm")
print(result.success)    # True
print(result.data)       # List of benchmark info dicts

# Discover models (Ollama by default)
models = discover_models()
print(models.data)       # List of model info dicts

# Discover models on specific backend
models = discover_models(backend="gguf")
models = discover_models(backend="modelscope")
models = discover_models(backend="huggingface")

# GPU-only models
models = discover_models(gpu_only=True)

# GPU information
info = gpu_info()
print(info.data["gpus"])
print(info.data["gpu_fitable_models"])
print(info.data["offload_required_models"])

# Auto-detect hardware, select GPU-fitable models, run all benchmarks
result = auto_run(gpu_only=True)

# Run a benchmark (Ollama)
result = run_benchmark(benchmark_name="mmlu", model="llama3")

# Run a benchmark (GGUF)
result = run_benchmark(benchmark_name="mmlu", model="/path/to/model.gguf", backend="gguf")

# Run a benchmark (ModelScope)
result = run_benchmark(benchmark_name="mmlu", model="Qwen/Qwen2-7B-Instruct-GGUF", backend="modelscope")

# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")

Agent Integration (OpenAI Function Calling)

from biwu.tools import TOOLS, dispatch

# Use TOOLS in your OpenAI function-calling setup
result = dispatch("biwu_run_benchmark", {
    "benchmark_name": "mmlu",
    "model": "llama3"
})

# Auto benchmark
result = dispatch("biwu_auto_run", {"gpu_only": True})

# GPU info
result = dispatch("biwu_gpu_info", {})

# Discover models (with backend)
result = dispatch("biwu_discover_models", {"backend": "ollama", "gpu_only": True})

# Discover GGUF models
result = dispatch("biwu_discover_models", {"backend": "gguf"})

Benchmark Categories

LLM Benchmarks

Benchmark Reference Metric Description
MMLU Hendrycks et al., 2020 Accuracy 57 subjects, multiple-choice
MMLU-Pro Wang et al., 2024 Accuracy Harder MMLU with 10 choices
HellaSwag Zellers et al., 2019 Accuracy Commonsense NLI sentence completion
ARC-Easy Clark et al., 2018 Accuracy Grade-school science (easy)
ARC-Challenge Clark et al., 2018 Accuracy Grade-school science (hard)
GSM8K Cobbe et al., 2021 Accuracy Math word problems
TruthfulQA MC1 Lin et al., 2021 Accuracy Single-true truthfulness
TruthfulQA MC2 Lin et al., 2021 Accuracy Multiple-true truthfulness
C-Eval Huang et al., 2023 Accuracy Chinese multi-discipline
CMMLU Li et al., 2023 Accuracy Chinese massive multitask
HumanEval Chen et al., 2021 pass@1 Code generation
BBH Suzgun et al., 2022 Accuracy Hard reasoning tasks

Embedding Benchmarks

Benchmark Reference Metric Description
embed_classification Muennighoff et al., 2022 Accuracy k-NN classification
embed_clustering Muennighoff et al., 2022 V-measure k-Means clustering
embed_retrieval Muennighoff et al., 2022 NDCG@10 Cosine-similarity retrieval
embed_sts Muennighoff et al., 2022 Spearman Semantic textual similarity
cmteb Xiao et al., 2023 Accuracy Chinese MTEB

Reranking Benchmarks

Benchmark Reference Metric Description
llm_reranking Askari et al., 2023 NDCG@5 LLM pointwise reranking
embed_reranking Muennighoff et al., 2022 NDCG@5 Embedding cosine reranking
llm_listwise_reranking Askari et al., 2023 NDCG@5 LLM listwise reranking

References

All benchmark papers are available in the pdf/ directory:

Paper arXiv ID
MMLU 2009.03300
HellaSwag 1905.07830
ARC 1803.05457
GSM8K 2110.14168
TruthfulQA 2109.07958
CMMLU 2306.09212
C-Eval 2305.08322
HumanEval 2107.03374
MMLU-Pro 2406.01564
BBH 2206.04615
HELM 2211.09110
MTEB 2210.07316
C-MTEB 2307.09371
BEIR 2104.08663
RankLLM 2310.18548
Jina ColBERT 2402.14759

Development

pip install -e ".[dev]"
pytest tests/ -v
ruff format . && ruff check .
mypy biwu/

License

GPL-3.0-or-later

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biwu-0.2.1.tar.gz (51.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biwu-0.2.1-py3-none-any.whl (59.8 kB view details)

Uploaded Python 3

File details

Details for the file biwu-0.2.1.tar.gz.

File metadata

  • Download URL: biwu-0.2.1.tar.gz
  • Upload date:
  • Size: 51.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for biwu-0.2.1.tar.gz
Algorithm Hash digest
SHA256 95e6250f8d57837562818774d3c75c095d593f28e3f539f9dd09d8eb448cb319
MD5 ca88d1230f72a76593f526c4f667f2a2
BLAKE2b-256 37982823664f6bc3350c84d377353bb101cb18d44047f6a0aaced6322e61ad4a

See more details on using hashes here.

File details

Details for the file biwu-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: biwu-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 59.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for biwu-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3dc45c30a64c88422ffb50a5c0ff75bc79514819c527f92310c3494ed97318d6
MD5 fd265b5ec8b5ab96402b7b94f126518c
BLAKE2b-256 f18b65b57fb18894d7d5281d0234d843bef929a77555d478b71c4b1589603ff3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page