Comprehensive benchmark suite for LLM, embedding, and reranking models across Ollama, GGUF, ModelScope, and HuggingFace backends
Project description
BiWu (比武)
Comprehensive benchmark suite for LLM, embedding, and reranking models across Ollama, GGUF, ModelScope, and HuggingFace backends
Features
- 🧠 LLM Benchmarks: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
- 📊 Embedding Benchmarks: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
- 🔄 Reranking Benchmarks: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
- 🔌 Multi-Backend: Ollama, GGUF (llama-cpp-python), ModelScope, HuggingFace
- ⚡ Interactive Selection: Choose which models and benchmarks to run
- 📋 Structured Output: JSON output, rich tables, and ToolResult API pattern
- 🔧 Agent Integration: OpenAI function-calling tools for AI agent use
- 🎮 GPU VRAM-Aware: Auto-detect GPU, filter models that fit in VRAM
Requirements
- Python 3.10+
- For Ollama backend: Ollama running locally (default: http://localhost:11434)
- For GGUF backend:
llama-cpp-pythonwith CUDA support - For ModelScope backend:
modelscope+llama-cpp-python - For HuggingFace backend:
huggingface_hub+llama-cpp-python - Internet connection (for dataset download, first run only)
Installation
pip install -e .
With specific backends:
pip install -e ".[ollama]" # Ollama backend
pip install -e ".[gguf]" # GGUF backend
pip install -e ".[modelscope]" # ModelScope backend
pip install -e ".[huggingface]" # HuggingFace backend
pip install -e ".[all]" # All backends
pip install -e ".[dev]" # Dev dependencies
Quick Start
# One-command auto benchmark: detect GPU, pick models, run all applicable tests
biwu auto
# Auto benchmark without interactive prompt (GPU-fitable models only)
biwu auto --no-confirm
# Override VRAM for auto benchmark
biwu auto --vram 24 --no-confirm
# List available benchmarks
biwu list benchmarks
# List available models on Ollama
biwu list models
# List models that fit in GPU VRAM
biwu list models --gpu-only
# Show GPU info and model VRAM analysis
biwu gpu
# Run all LLM benchmarks on a model (Ollama)
biwu run -m llama3 --category llm
# Run benchmarks on a GGUF model
biwu run -m /path/to/model.gguf --backend gguf --category llm
# Run benchmarks on a ModelScope model
biwu run -m Qwen/Qwen2-7B-Instruct-GGUF --backend modelscope --category llm
# Run benchmarks on a HuggingFace model
biwu run -m TheBloke/Llama-2-7B-GGUF --backend huggingface --category llm
# Run specific benchmarks
biwu run -m llama3 -b mmlu hellaswag gsm8k
# Auto-select GPU-fitable models and run benchmarks
biwu run --gpu-only --category llm
# Run full suite (all categories)
biwu suite -m llama3
# Run suite only on models that fit in VRAM
biwu suite --gpu-only
# Run with limited samples for quick testing
biwu run -m llama3 -b mmlu -n 100
# Output results to JSON file
biwu run -m llama3 --category llm --json -o results.json
Usage
CLI Commands
List Benchmarks
biwu list benchmarks
biwu list benchmarks --category llm
biwu list benchmarks --category embedding
biwu list benchmarks --category reranking
List Models
# List models on Ollama (default backend)
biwu list models
# List models that fit in GPU VRAM
biwu list models --gpu-only
# List models for a specific backend
biwu list models --backend gguf
biwu list models --backend modelscope
GPU & VRAM
biwu gpu
biwu gpu --vram 24
biwu gpu --json
Auto Benchmark (Hardware-Aware)
# Interactive mode
biwu auto
# Non-interactive: auto-select all GPU-fitable models
biwu auto --no-confirm
# Override VRAM
biwu auto --vram 16 --no-confirm
# Use a specific backend
biwu auto --backend gguf --no-confirm
biwu auto --backend modelscope --no-confirm
The auto command:
- Detects your GPU VRAM via
nvidia-smi - Lists models with VRAM estimates
- Presents an interactive multi-select menu (GPU-fitable models pre-selected)
- Matches each model to its applicable benchmarks (LLM/embedding/reranking)
- Runs all benchmarks and outputs results
Interactive selection: 1,3,5 or 1-3, all, gpu (default), q
Run Benchmarks
# Ollama (default)
biwu run -m llama3 -b mmlu hellaswag
# GGUF
biwu run -m /path/to/model.gguf --backend gguf --category llm
# ModelScope
biwu run -m Qwen/Qwen2-7B-Instruct-GGUF --backend modelscope --category llm
# HuggingFace
biwu run -m TheBloke/Llama-2-7B-GGUF --backend huggingface --category llm
# Multiple models
biwu run -m llama3 mistral qwen2 --category llm
# GPU-only models
biwu run --gpu-only --category llm
# Custom Ollama host
biwu run -m llama3 -b mmlu --host http://192.168.1.100:11434
Run Full Suite
biwu suite -m llama3
biwu suite --gpu-only
biwu suite -m nomic-embed-text --category embedding
biwu suite -m jina-reranker-v2-small --category reranking
biwu suite -m model.gguf --backend gguf --category full
CLI Flags
| Flag | Description |
|---|---|
-V, --version |
Show version |
-v, --verbose |
Verbose output (debug logging) |
-o, --output |
Output to file path (JSON) |
--json |
JSON output format |
-q, --quiet |
Suppress non-essential output |
Python API
from biwu import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models, gpu_info, auto_run
# List benchmarks
result = list_benchmarks(category="llm")
print(result.success) # True
print(result.data) # List of benchmark info dicts
# Discover models (Ollama by default)
models = discover_models()
print(models.data) # List of model info dicts
# Discover models on specific backend
models = discover_models(backend="gguf")
models = discover_models(backend="modelscope")
models = discover_models(backend="huggingface")
# GPU-only models
models = discover_models(gpu_only=True)
# GPU information
info = gpu_info()
print(info.data["gpus"])
print(info.data["gpu_fitable_models"])
print(info.data["offload_required_models"])
# Auto-detect hardware, select GPU-fitable models, run all benchmarks
result = auto_run(gpu_only=True)
# Run a benchmark (Ollama)
result = run_benchmark(benchmark_name="mmlu", model="llama3")
# Run a benchmark (GGUF)
result = run_benchmark(benchmark_name="mmlu", model="/path/to/model.gguf", backend="gguf")
# Run a benchmark (ModelScope)
result = run_benchmark(benchmark_name="mmlu", model="Qwen/Qwen2-7B-Instruct-GGUF", backend="modelscope")
# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")
Agent Integration (OpenAI Function Calling)
from biwu.tools import TOOLS, dispatch
# Use TOOLS in your OpenAI function-calling setup
result = dispatch("biwu_run_benchmark", {
"benchmark_name": "mmlu",
"model": "llama3"
})
# Auto benchmark
result = dispatch("biwu_auto_run", {"gpu_only": True})
# GPU info
result = dispatch("biwu_gpu_info", {})
# Discover models (with backend)
result = dispatch("biwu_discover_models", {"backend": "ollama", "gpu_only": True})
# Discover GGUF models
result = dispatch("biwu_discover_models", {"backend": "gguf"})
Benchmark Categories
LLM Benchmarks
| Benchmark | Reference | Metric | Description |
|---|---|---|---|
| MMLU | Hendrycks et al., 2020 | Accuracy | 57 subjects, multiple-choice |
| MMLU-Pro | Wang et al., 2024 | Accuracy | Harder MMLU with 10 choices |
| HellaSwag | Zellers et al., 2019 | Accuracy | Commonsense NLI sentence completion |
| ARC-Easy | Clark et al., 2018 | Accuracy | Grade-school science (easy) |
| ARC-Challenge | Clark et al., 2018 | Accuracy | Grade-school science (hard) |
| GSM8K | Cobbe et al., 2021 | Accuracy | Math word problems |
| TruthfulQA MC1 | Lin et al., 2021 | Accuracy | Single-true truthfulness |
| TruthfulQA MC2 | Lin et al., 2021 | Accuracy | Multiple-true truthfulness |
| C-Eval | Huang et al., 2023 | Accuracy | Chinese multi-discipline |
| CMMLU | Li et al., 2023 | Accuracy | Chinese massive multitask |
| HumanEval | Chen et al., 2021 | pass@1 | Code generation |
| BBH | Suzgun et al., 2022 | Accuracy | Hard reasoning tasks |
Embedding Benchmarks
| Benchmark | Reference | Metric | Description |
|---|---|---|---|
| embed_classification | Muennighoff et al., 2022 | Accuracy | k-NN classification |
| embed_clustering | Muennighoff et al., 2022 | V-measure | k-Means clustering |
| embed_retrieval | Muennighoff et al., 2022 | NDCG@10 | Cosine-similarity retrieval |
| embed_sts | Muennighoff et al., 2022 | Spearman | Semantic textual similarity |
| cmteb | Xiao et al., 2023 | Accuracy | Chinese MTEB |
Reranking Benchmarks
| Benchmark | Reference | Metric | Description |
|---|---|---|---|
| llm_reranking | Askari et al., 2023 | NDCG@5 | LLM pointwise reranking |
| embed_reranking | Muennighoff et al., 2022 | NDCG@5 | Embedding cosine reranking |
| llm_listwise_reranking | Askari et al., 2023 | NDCG@5 | LLM listwise reranking |
References
All benchmark papers are available in the pdf/ directory:
| Paper | arXiv ID |
|---|---|
| MMLU | 2009.03300 |
| HellaSwag | 1905.07830 |
| ARC | 1803.05457 |
| GSM8K | 2110.14168 |
| TruthfulQA | 2109.07958 |
| CMMLU | 2306.09212 |
| C-Eval | 2305.08322 |
| HumanEval | 2107.03374 |
| MMLU-Pro | 2406.01564 |
| BBH | 2206.04615 |
| HELM | 2211.09110 |
| MTEB | 2210.07316 |
| C-MTEB | 2307.09371 |
| BEIR | 2104.08663 |
| RankLLM | 2310.18548 |
| Jina ColBERT | 2402.14759 |
Development
pip install -e ".[dev]"
pytest tests/ -v
ruff format . && ruff check .
mypy biwu/
License
GPL-3.0-or-later
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biwu-0.2.1.tar.gz.
File metadata
- Download URL: biwu-0.2.1.tar.gz
- Upload date:
- Size: 51.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95e6250f8d57837562818774d3c75c095d593f28e3f539f9dd09d8eb448cb319
|
|
| MD5 |
ca88d1230f72a76593f526c4f667f2a2
|
|
| BLAKE2b-256 |
37982823664f6bc3350c84d377353bb101cb18d44047f6a0aaced6322e61ad4a
|
File details
Details for the file biwu-0.2.1-py3-none-any.whl.
File metadata
- Download URL: biwu-0.2.1-py3-none-any.whl
- Upload date:
- Size: 59.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dc45c30a64c88422ffb50a5c0ff75bc79514819c527f92310c3494ed97318d6
|
|
| MD5 |
fd265b5ec8b5ab96402b7b94f126518c
|
|
| BLAKE2b-256 |
f18b65b57fb18894d7d5281d0234d843bef929a77555d478b71c4b1589603ff3
|