Comprehensive benchmark suite for LLM, embedding, and reranking models across Ollama, GGUF, ModelScope, and HuggingFace backends

These details have not been verified by PyPI

Project links

Repository

Project description

BiWu (比武)

Comprehensive benchmark suite for LLM, embedding, and reranking models across Ollama, GGUF, ModelScope, and HuggingFace backends

Features

🧠 LLM Benchmarks: MMLU, MMLU-Pro, HellaSwag, ARC-Easy/Challenge, GSM8K, TruthfulQA (MC1/MC2), C-Eval, CMMLU, HumanEval, BBH
📊 Embedding Benchmarks: MTEB Classification, Clustering, Retrieval, STS; C-MTEB (Chinese)
🔄 Reranking Benchmarks: LLM Pointwise Reranking, Embedding Reranking, LLM Listwise Reranking
🔌 Multi-Backend: Ollama, GGUF (llama-cpp-python), ModelScope, HuggingFace
⚡ Interactive Selection: Choose which models and benchmarks to run
📋 Structured Output: JSON output, rich tables, and ToolResult API pattern
🔧 Agent Integration: OpenAI function-calling tools for AI agent use
🎮 GPU VRAM-Aware: Auto-detect GPU, filter models that fit in VRAM

Requirements

Python 3.10+
For Ollama backend: Ollama running locally (default: http://localhost:11434)
For GGUF backend: llama-cpp-python with CUDA support
For ModelScope backend: modelscope + llama-cpp-python
For HuggingFace backend: huggingface_hub + llama-cpp-python
Internet connection (for dataset download, first run only)

Installation

pip install -e .

With specific backends:

pip install -e ".[ollama]"      # Ollama backend
pip install -e ".[gguf]"        # GGUF backend
pip install -e ".[modelscope]"  # ModelScope backend
pip install -e ".[huggingface]" # HuggingFace backend
pip install -e ".[all]"         # All backends
pip install -e ".[dev]"          # Dev dependencies

Quick Start

# One-command auto benchmark: detect GPU, pick models, run all applicable tests
biwu auto

# Auto benchmark without interactive prompt (GPU-fitable models only)
biwu auto --no-confirm

# Override VRAM for auto benchmark
biwu auto --vram 24 --no-confirm

# List available benchmarks
biwu list benchmarks

# List available models on Ollama
biwu list models

# List models that fit in GPU VRAM
biwu list models --gpu-only

# Show GPU info and model VRAM analysis
biwu gpu

# Run all LLM benchmarks on a model (Ollama)
biwu run -m llama3 --category llm

# Run benchmarks on a GGUF model
biwu run -m /path/to/model.gguf --backend gguf --category llm

# Run benchmarks on a ModelScope model
biwu run -m Qwen/Qwen2-7B-Instruct-GGUF --backend modelscope --category llm

# Run benchmarks on a HuggingFace model
biwu run -m TheBloke/Llama-2-7B-GGUF --backend huggingface --category llm

# Run specific benchmarks
biwu run -m llama3 -b mmlu hellaswag gsm8k

# Auto-select GPU-fitable models and run benchmarks
biwu run --gpu-only --category llm

# Run full suite (all categories)
biwu suite -m llama3

# Run suite only on models that fit in VRAM
biwu suite --gpu-only

# Run with limited samples for quick testing
biwu run -m llama3 -b mmlu -n 100

# Output results to JSON file
biwu run -m llama3 --category llm --json -o results.json

Usage

CLI Commands

List Benchmarks

biwu list benchmarks
biwu list benchmarks --category llm
biwu list benchmarks --category embedding
biwu list benchmarks --category reranking

List Models

# List models on Ollama (default backend)
biwu list models

# List models that fit in GPU VRAM
biwu list models --gpu-only

# List models for a specific backend
biwu list models --backend gguf
biwu list models --backend modelscope

GPU & VRAM

biwu gpu
biwu gpu --vram 24
biwu gpu --json

Auto Benchmark (Hardware-Aware)

# Interactive mode
biwu auto

# Non-interactive: auto-select all GPU-fitable models
biwu auto --no-confirm

# Override VRAM
biwu auto --vram 16 --no-confirm

# Use a specific backend
biwu auto --backend gguf --no-confirm
biwu auto --backend modelscope --no-confirm

The auto command:

Detects your GPU VRAM via nvidia-smi
Lists models with VRAM estimates
Presents an interactive multi-select menu (GPU-fitable models pre-selected)
Matches each model to its applicable benchmarks (LLM/embedding/reranking)
Runs all benchmarks and outputs results

Interactive selection: 1,3,5 or 1-3, all, gpu (default), q

Run Benchmarks

# Ollama (default)
biwu run -m llama3 -b mmlu hellaswag

# GGUF
biwu run -m /path/to/model.gguf --backend gguf --category llm

# ModelScope
biwu run -m Qwen/Qwen2-7B-Instruct-GGUF --backend modelscope --category llm

# HuggingFace
biwu run -m TheBloke/Llama-2-7B-GGUF --backend huggingface --category llm

# Multiple models
biwu run -m llama3 mistral qwen2 --category llm

# GPU-only models
biwu run --gpu-only --category llm

# Custom Ollama host
biwu run -m llama3 -b mmlu --host http://192.168.1.100:11434

Run Full Suite

biwu suite -m llama3
biwu suite --gpu-only
biwu suite -m nomic-embed-text --category embedding
biwu suite -m jina-reranker-v2-small --category reranking
biwu suite -m model.gguf --backend gguf --category full

CLI Flags

Flag	Description
`-V`, `--version`	Show version
`-v`, `--verbose`	Verbose output (debug logging)
`-o`, `--output`	Output to file path (JSON)
`--json`	JSON output format
`-q`, `--quiet`	Suppress non-essential output

Python API

from biwu import ToolResult, list_benchmarks, run_benchmark, run_suite, discover_models, gpu_info, auto_run

# List benchmarks
result = list_benchmarks(category="llm")
print(result.success)    # True
print(result.data)       # List of benchmark info dicts

# Discover models (Ollama by default)
models = discover_models()
print(models.data)       # List of model info dicts

# Discover models on specific backend
models = discover_models(backend="gguf")
models = discover_models(backend="modelscope")
models = discover_models(backend="huggingface")

# GPU-only models
models = discover_models(gpu_only=True)

# GPU information
info = gpu_info()
print(info.data["gpus"])
print(info.data["gpu_fitable_models"])
print(info.data["offload_required_models"])

# Auto-detect hardware, select GPU-fitable models, run all benchmarks
result = auto_run(gpu_only=True)

# Run a benchmark (Ollama)
result = run_benchmark(benchmark_name="mmlu", model="llama3")

# Run a benchmark (GGUF)
result = run_benchmark(benchmark_name="mmlu", model="/path/to/model.gguf", backend="gguf")

# Run a benchmark (ModelScope)
result = run_benchmark(benchmark_name="mmlu", model="Qwen/Qwen2-7B-Instruct-GGUF", backend="modelscope")

# Run a suite
result = run_suite(models=["llama3", "mistral"], category="llm")

Agent Integration (OpenAI Function Calling)

from biwu.tools import TOOLS, dispatch

# Use TOOLS in your OpenAI function-calling setup
result = dispatch("biwu_run_benchmark", {
    "benchmark_name": "mmlu",
    "model": "llama3"
})

# Auto benchmark
result = dispatch("biwu_auto_run", {"gpu_only": True})

# GPU info
result = dispatch("biwu_gpu_info", {})

# Discover models (with backend)
result = dispatch("biwu_discover_models", {"backend": "ollama", "gpu_only": True})

# Discover GGUF models
result = dispatch("biwu_discover_models", {"backend": "gguf"})

Benchmark Categories

LLM Benchmarks

Benchmark	Reference	Metric	Description
MMLU	Hendrycks et al., 2020	Accuracy	57 subjects, multiple-choice
MMLU-Pro	Wang et al., 2024	Accuracy	Harder MMLU with 10 choices
HellaSwag	Zellers et al., 2019	Accuracy	Commonsense NLI sentence completion
ARC-Easy	Clark et al., 2018	Accuracy	Grade-school science (easy)
ARC-Challenge	Clark et al., 2018	Accuracy	Grade-school science (hard)
GSM8K	Cobbe et al., 2021	Accuracy	Math word problems
TruthfulQA MC1	Lin et al., 2021	Accuracy	Single-true truthfulness
TruthfulQA MC2	Lin et al., 2021	Accuracy	Multiple-true truthfulness
C-Eval	Huang et al., 2023	Accuracy	Chinese multi-discipline
CMMLU	Li et al., 2023	Accuracy	Chinese massive multitask
HumanEval	Chen et al., 2021	pass@1	Code generation
BBH	Suzgun et al., 2022	Accuracy	Hard reasoning tasks

Embedding Benchmarks

Benchmark	Reference	Metric	Description
embed_classification	Muennighoff et al., 2022	Accuracy	k-NN classification
embed_clustering	Muennighoff et al., 2022	V-measure	k-Means clustering
embed_retrieval	Muennighoff et al., 2022	NDCG@10	Cosine-similarity retrieval
embed_sts	Muennighoff et al., 2022	Spearman	Semantic textual similarity
cmteb	Xiao et al., 2023	Accuracy	Chinese MTEB

Reranking Benchmarks

Benchmark	Reference	Metric	Description
llm_reranking	Askari et al., 2023	NDCG@5	LLM pointwise reranking
embed_reranking	Muennighoff et al., 2022	NDCG@5	Embedding cosine reranking
llm_listwise_reranking	Askari et al., 2023	NDCG@5	LLM listwise reranking

References

All benchmark papers are available in the pdf/ directory:

Paper	arXiv ID
MMLU	2009.03300
HellaSwag	1905.07830
ARC	1803.05457
GSM8K	2110.14168
TruthfulQA	2109.07958
CMMLU	2306.09212
C-Eval	2305.08322
HumanEval	2107.03374
MMLU-Pro	2406.01564
BBH	2206.04615
HELM	2211.09110
MTEB	2210.07316
C-MTEB	2307.09371
BEIR	2104.08663
RankLLM	2310.18548
Jina ColBERT	2402.14759

Development

pip install -e ".[dev]"
pytest tests/ -v
ruff format . && ruff check .
mypy biwu/

License

GPL-3.0-or-later

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.2.1

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biwu-0.2.1.tar.gz (51.9 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

biwu-0.2.1-py3-none-any.whl (59.8 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file biwu-0.2.1.tar.gz.

File metadata

Download URL: biwu-0.2.1.tar.gz
Upload date: Apr 30, 2026
Size: 51.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for biwu-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`95e6250f8d57837562818774d3c75c095d593f28e3f539f9dd09d8eb448cb319`
MD5	`ca88d1230f72a76593f526c4f667f2a2`
BLAKE2b-256	`37982823664f6bc3350c84d377353bb101cb18d44047f6a0aaced6322e61ad4a`

See more details on using hashes here.

File details

Details for the file biwu-0.2.1-py3-none-any.whl.

File metadata

Download URL: biwu-0.2.1-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 59.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for biwu-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3dc45c30a64c88422ffb50a5c0ff75bc79514819c527f92310c3494ed97318d6`
MD5	`fd265b5ec8b5ab96402b7b94f126518c`
BLAKE2b-256	`f18b65b57fb18894d7d5281d0234d843bef929a77555d478b71c4b1589603ff3`

See more details on using hashes here.

biwu 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BiWu (比武)

Features

Requirements

Installation

Quick Start

Usage

CLI Commands

List Benchmarks

List Models

GPU & VRAM

Auto Benchmark (Hardware-Aware)

Run Benchmarks

Run Full Suite

CLI Flags

Python API

Agent Integration (OpenAI Function Calling)

Benchmark Categories

LLM Benchmarks

Embedding Benchmarks

Reranking Benchmarks

References

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes