Pair-wise ELO evaluation arena for local LLMs.
Project description
ollama-arena
A pair-wise evaluation harness for locally hosted language models. Runs matches between two models on a shared task set, scores each response deterministically (or with an LLM judge), and maintains an ELO rating across runs.
pip install git+https://github.com/nazkari86-lab/ollama-arena.git
ollama-arena match --models llama3.2:3b,qwen2.5-coder:7b -n 20
match 1/1 llama3.2:3b vs qwen2.5-coder:7b
code_001 1.00 vs 1.00 draw
code_002 0.00 vs 1.00 B
humaneval_3 1.00 vs 1.00 draw
...
rank model elo W L D matches win%
1 qwen2.5-coder:7b 1271 7 1 2 10 70%
2 llama3.2:3b 1129 1 7 2 10 10%
Why
When you have several local models, you want a quick answer to "which one is better at X?" — without renting GPUs or signing up for a judging API. Existing harnesses (lm-evaluation-harness, lighteval, simple-evals) are absolute-score frameworks designed for paper-grade reporting; they are overkill for the day-to-day "should I switch from llama3.2 to qwen2.5?" question. ollama-arena answers that question with pair-wise battles, a local SQLite ELO table, and built-in or HuggingFace task pools.
ELO rather than Glicko-2 because (a) the implementation is two lines, and (b) for a moderate number of models the difference is negligible.
Install
pip install git+https://github.com/nazkari86-lab/ollama-arena.git
Optional extras (append to the URL, or clone and pip install '.[extra]'):
| Extra | Adds |
|---|---|
[all] |
web dashboard, Plotly charts, HuggingFace datasets |
[hf] |
in-process TransformersBackend (torch, transformers) |
[finetune] |
Unsloth fine-tune pipeline — CUDA recommended |
# clone for extras
git clone https://github.com/nazkari86-lab/ollama-arena.git
cd ollama-arena
pip install '.[all]'
The HuggingFace and fine-tune extras pull large dependencies and are off by default.
Quick start
ollama serve
ollama pull llama3.2:3b
ollama pull qwen2.5-coder:7b
ollama-arena match --models llama3.2:3b,qwen2.5-coder:7b --category coding -n 10
ollama-arena leaderboard
ELO state lives in arena.db in the working directory. Pass --db to
share a leaderboard between runs in different folders.
Backends
Anything that exposes Ollama's native API or the OpenAI
/v1/chat/completions shape works without code changes:
ollama-arena --backend ollama match ... # default, :11434
ollama-arena --backend vllm match ... # :8000
ollama-arena --backend lmstudio match ... # :1234
ollama-arena --backend llamacpp match ... # :8080
ollama-arena --backend openai --api-key sk-... match ...
ollama-arena --backend groq --api-key gsk-... match ...
ollama-arena --backend together --api-key tg-... match ...
ollama-arena --backend openrouter --api-key sk-or-... match ...
Or pass a full URL:
ollama-arena --backend http://192.168.1.50:8000/v1 match ...
A TransformersBackend is also available for in-process generation via
PyTorch; it is lazily imported so the dependency is optional.
Tasks
The package ships with about 100 hand-written tasks across five categories: coding (Python plus JS/TS/Rust/Go/C++), reasoning, security, inspection, and planning. They are intended as a smoke-test starter pack, not a definitive benchmark.
For serious work, load a HuggingFace dataset:
ollama-arena datasets # registered datasets
ollama-arena datasets --pull humaneval,gsm8k
ollama-arena match --dataset humaneval --models A,B -n 50
Registered loaders (more in ollama_arena/datasets/loader.py):
| name | source | reference |
|---|---|---|
| humaneval | openai_humaneval | Chen et al., 2021 |
| mbpp | mbpp | Austin et al., 2021 |
| mbpp_plus | evalplus/mbppplus | Liu et al., 2023 |
| gsm8k | gsm8k | Cobbe et al., 2021 |
| mmlu | cais/mmlu | Hendrycks et al., 2021 |
| bbh | lukaemon/bbh | Suzgun et al., 2022 |
| multipl_e | nuprl/MultiPL-E | Cassano et al., 2022 |
| hellaswag | hellaswag | Zellers et al., 2019 |
| truthfulqa | truthful_qa | Lin et al., 2022 |
| arc | ai2_arc | Clark et al., 2018 |
Downloads are cached in ~/.cache/ollama_arena/datasets/. Override with
OLLAMA_ARENA_CACHE.
Scoring
Each task carries its own scorer:
- coding — extract the code block, append the task's test cases, and execute in the matching language sandbox. Score is 1.0 on a clean exit, 0.0 otherwise.
- math, knowledge — numeric tolerance / multiple-choice letter match.
- reasoning — prefix or substring match against
expected_answer. - security, inspection, planning — keyword presence over an expected set of issues / key components.
- open-ended — when
task["use_judge"]is set and the arena is constructed withjudge_model=..., the LLMJudge grades each pair in both orderings (A then B, B then A) and averages, to suppress position bias. This is meaningfully more expensive — the judge is invoked twice per task, on top of the two model generations.
Code is executed in a subprocess with a hardened pattern filter
(rm -rf, shell=True, raw sockets, …) and a strict timeout. For
untrusted code, pass use_docker=True to run_in_language(); containers
run with --network=none --read-only --memory=512m --cpus=1.
Languages
The sandbox dispatches by the language field on each task. Detected at
runtime from $PATH:
| language | runtime needed |
|---|---|
| python | python3 |
| javascript | node |
| typescript | tsx, ts-node, or deno |
| rust | rustc (edition 2021) |
| go | go ≥ 1.20 |
| cpp | g++ or clang++ (-std=c++17) |
| bash | bash |
ollama-arena tasks shows which languages are currently runnable.
CLI
ollama-arena match --models A,B [--category C] [--dataset NAME] [--difficulty L]
ollama-arena tournament --models A,B,C,...
ollama-arena leaderboard
ollama-arena perf
ollama-arena list
ollama-arena tasks
ollama-arena datasets [--pull NAMES] [--refresh NAMES]
ollama-arena finetune --analyze | --generate | --train PATH
ollama-arena export --out report.html
ollama-arena web [--port 7860]
Global flags: --backend, --api-key, --db, --ollama.
Python
from ollama_arena import Arena
arena = Arena() # Ollama on :11434
# arena = Arena(backend="vllm")
# arena = Arena(backend="groq", api_key="gsk_...")
arena.load_hf_dataset("humaneval", limit=50)
result = arena.run_match(
"llama3.2:3b", "qwen2.5-coder:7b",
category="coding", n=20,
)
print(result.elo_a_after, result.elo_b_after)
Round-robin between several models:
arena.run_tournament(
["llama3.2:3b", "qwen2.5-coder:7b", "gemma2:9b"],
category="reasoning", n_per_match=10,
)
LLM judge for open-ended responses:
arena = Arena(judge_model="qwen2.5:32b-instruct")
# tasks marked {"use_judge": True} are graded by the judge in both orderings
Export a standalone HTML dashboard (Plotly):
from ollama_arena.visualize import export_dashboard
export_dashboard(
"report.html",
leaderboard=arena.leaderboard(),
matches=arena.match_history(limit=500),
categories=["coding", "reasoning", "security", "planning", "inspection"],
performance=arena.performance_stats(),
)
Performance metrics
Every generation logs prompt tokens, output tokens, latency, tokens/sec,
and time-to-first-token. ollama-arena perf prints per-model
aggregates:
model samples tps mean tps p95 lat mean lat p95 ttft
llama3.2:3b 120 48.2 52.1 4.2s 6.3s 0.3s
qwen2.5-coder:7b 120 31.7 34.0 8.1s 11.2s 0.5s
These numbers are backend numbers — they include HTTP overhead, the model server's scheduling, batching, and so on. They are useful as relative comparisons within one backend; treat absolute values with care.
Fine-tuning loop
A small pipeline turns arena failures into a teacher-distilled SFT
dataset, runs Unsloth LoRA on it, exports a GGUF and registers the
result as an Ollama model. End-to-end example:
examples/finetune_pipeline.py.
CUDA is required for the Unsloth step.
Limitations
- ELO updates per task, not per match. This converges faster but is noisier than the official chess formula for small sample sizes.
- The keyword-based scorers for security/inspection/planning are approximate. They reward mentioning the right thing, not necessarily understanding it. Use the LLM judge for higher-stakes scoring.
- Sandbox isolation without Docker relies on the subprocess timeout and the static pattern filter. Do not feed model output from untrusted sources to the host sandbox.
- HuggingFace dataset normalization is per-loader; some upstream schema
changes will require updates to
loader.py.
Contributing
See CONTRIBUTING.md. The most useful contributions are new dataset
loaders, new language sandboxes, and new backends; each takes only a few
dozen lines.
License
MIT. See LICENSE.
Logo
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⢄⡲⠖⠛⠉⠉⠉⠉⠉⠙⠛⠿⣿⣶⣦⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⣡⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⣿⣿⣿⣿⣷⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⣡⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡔⢡⣶⠏⠀⠀⠀⠀⠀⠀⣠⣴⣶⣶⣶⣶⣶⣶⣦⣄⣸⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠌⢀⣿⠏⠀⠀⠀⠀⠀⠀⠸⠿⠋⠙⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡞⠀⡼⢿⣦⣄⠠⠤⠐⠒⠒⠒⠢⠤⣄⣠⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠀⠀⠀⣸⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⢠⠞⠁⠀⠀⠠⠇⣀⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠈⠙⠛⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢀⣴⣁⠀⣀⣤⣴⣾⣿⣿⣿⣿⡿⢿⣿⣶⣄⠀⠀⠀⠀⠀⣿⣷⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⡇⠀⢸⣿⣿⣿⡇⠘⠟⣻⣿⣧⠀⠀⠀⠀⢿⣿⣤⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⡿⠀⠀⠸⣿⠿⠋⠉⠁⠛⠻⠿⢿⣧⠀⠀⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⡿⠋⠁⠀⢀⣄⡀⠀⠀⠀⢀⣀⣤⣴⣿⣿⣧⠀⢀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⠏⢀⠀⢀⡴⠿⣿⣿⣷⣶⣾⣿⣿⣿⣿⣿⣿⣿⣇⠀⢷⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡄⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣤⣿⣷⡈⠀⠀⠀⠙⠻⣿⣿⣿⣿⠿⠛⠛⣻⣿⣿⡄⠈⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡄⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⣄⠀⠀⠀⠀⠈⠋⢉⣠⣴⣾⣿⣿⣿⣿⣷⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⣿⢻⡏⢹⠙⡆⠀⠀⠀⠒⠚⢛⣉⣉⣿⣿⣿⣿⣿⣿⡇⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢀⡞⠁⠉⠀⠁⠀⣄⣀⣠⣴⣶⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⣈⡛⢻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠛⠋⠉⠉⠉⠙⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡷⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⣻⠿⠿⢿⣿⠿⠿⠋⠁⠀⠙⣿⡁⠈⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡟⠛⠋⠉⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⠴⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣈⣹⣦⣴⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⣀⣀⣀⣀⣼⣿⣄⣀⣀⡄⠀⣀⣀⣠⣤⣶⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀⠀⠀
⠀⠀⠀⠀⠀⢰⠿⠿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠟⠉⠀⠀⣰⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀
⠀⠀⠀⢀⣤⣤⣤⣶⣿⣿⣿⣿⠿⠿⠟⠋⢹⠇⠀⠀⢀⣼⣿⣿⣿⣿⣿⡿⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
⠀⢀⣴⣿⣿⣿⣿⣿⣿⣿⡟⠁⠀⠀⠀⢀⡏⠀⠀⢀⣾⠋⣹⣿⣿⣿⡟⠀⠀⣸⡟⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
⢠⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⠀⠀⠀⡼⠀⠀⢀⣾⠏⢀⣿⣿⣿⠋⠀⠀⣰⣿⣧⡀⠹⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ollama_arena-2.1.2.tar.gz.
File metadata
- Download URL: ollama_arena-2.1.2.tar.gz
- Upload date:
- Size: 63.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b7d5dcbe72282ba6cc3f08fa69c33de83800b8e4091e44f5b6e91365edc4fc2
|
|
| MD5 |
b201c9c1903acacc6393f7bfdb4b3f8d
|
|
| BLAKE2b-256 |
786935f37b20b3131bf6000c7d2b699f5bb3792de94addf1df9a92573e28cec4
|
File details
Details for the file ollama_arena-2.1.2-py3-none-any.whl.
File metadata
- Download URL: ollama_arena-2.1.2-py3-none-any.whl
- Upload date:
- Size: 68.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86b3dcd56e627c6686c542502961ccf2aeec7f6b3ef6299c5641027dd6f23e7a
|
|
| MD5 |
d580c8fa5d8e1ddfd61c2d1f9c63733a
|
|
| BLAKE2b-256 |
5a8c5e2e11dcf82014986445c4618f12ef52030e9b3f5613552b5de52abb0270
|