Pair-wise ELO evaluation arena for local LLMs.

These details have not been verified by PyPI

Project links

Project description

ollama-arena

A pair-wise evaluation harness for locally hosted language models. Runs matches between two models on a shared task set, scores each response deterministically (or with an LLM judge), and maintains an ELO rating across runs.

pip install git+https://github.com/nazkari86-lab/ollama-arena.git
ollama-arena match --models llama3.2:3b,qwen2.5-coder:7b -n 20

match 1/1   llama3.2:3b  vs  qwen2.5-coder:7b
  code_001   1.00  vs  1.00   draw
  code_002   0.00  vs  1.00   B
  humaneval_3 1.00 vs 1.00    draw
  ...

rank  model                elo    W   L   D   matches  win%
1     qwen2.5-coder:7b    1271    7   1   2     10     70%
2     llama3.2:3b         1129    1   7   2     10     10%

Why

When you have several local models, you want a quick answer to "which one is better at X?" — without renting GPUs or signing up for a judging API. Existing harnesses (lm-evaluation-harness, lighteval, simple-evals) are absolute-score frameworks designed for paper-grade reporting; they are overkill for the day-to-day "should I switch from llama3.2 to qwen2.5?" question. ollama-arena answers that question with pair-wise battles, a local SQLite ELO table, and built-in or HuggingFace task pools.

ELO rather than Glicko-2 because (a) the implementation is two lines, and (b) for a moderate number of models the difference is negligible.

Install

pip install git+https://github.com/nazkari86-lab/ollama-arena.git

Optional extras (append to the URL, or clone and pip install '.[extra]'):

Extra	Adds
`[all]`	web dashboard, Plotly charts, HuggingFace datasets
`[hf]`	in-process TransformersBackend (torch, transformers)
`[finetune]`	Unsloth fine-tune pipeline — CUDA recommended

# clone for extras
git clone https://github.com/nazkari86-lab/ollama-arena.git
cd ollama-arena
pip install '.[all]'

The HuggingFace and fine-tune extras pull large dependencies and are off by default.

Quick start

ollama serve
ollama pull llama3.2:3b
ollama pull qwen2.5-coder:7b

ollama-arena match --models llama3.2:3b,qwen2.5-coder:7b --category coding -n 10
ollama-arena leaderboard

ELO state lives in arena.db in the working directory. Pass --db to share a leaderboard between runs in different folders.

Backends

Anything that exposes Ollama's native API or the OpenAI /v1/chat/completions shape works without code changes:

ollama-arena --backend ollama   match ...        # default, :11434
ollama-arena --backend vllm     match ...        # :8000
ollama-arena --backend lmstudio match ...        # :1234
ollama-arena --backend llamacpp match ...        # :8080
ollama-arena --backend openai     --api-key sk-... match ...
ollama-arena --backend groq       --api-key gsk-... match ...
ollama-arena --backend together   --api-key tg-... match ...
ollama-arena --backend openrouter --api-key sk-or-... match ...

Or pass a full URL:

ollama-arena --backend http://192.168.1.50:8000/v1 match ...

A TransformersBackend is also available for in-process generation via PyTorch; it is lazily imported so the dependency is optional.

Tasks

The package ships with about 100 hand-written tasks across five categories: coding (Python plus JS/TS/Rust/Go/C++), reasoning, security, inspection, and planning. They are intended as a smoke-test starter pack, not a definitive benchmark.

For serious work, load a HuggingFace dataset:

ollama-arena datasets                       # registered datasets
ollama-arena datasets --pull humaneval,gsm8k
ollama-arena match --dataset humaneval --models A,B -n 50

Registered loaders (more in ollama_arena/datasets/loader.py):

name	source	reference
humaneval	openai_humaneval	Chen et al., 2021
mbpp	mbpp	Austin et al., 2021
mbpp_plus	evalplus/mbppplus	Liu et al., 2023
gsm8k	gsm8k	Cobbe et al., 2021
mmlu	cais/mmlu	Hendrycks et al., 2021
bbh	lukaemon/bbh	Suzgun et al., 2022
multipl_e	nuprl/MultiPL-E	Cassano et al., 2022
hellaswag	hellaswag	Zellers et al., 2019
truthfulqa	truthful_qa	Lin et al., 2022
arc	ai2_arc	Clark et al., 2018

Downloads are cached in ~/.cache/ollama_arena/datasets/. Override with OLLAMA_ARENA_CACHE.

Scoring

Each task carries its own scorer:

coding — extract the code block, append the task's test cases, and execute in the matching language sandbox. Score is 1.0 on a clean exit, 0.0 otherwise.
math, knowledge — numeric tolerance / multiple-choice letter match.
reasoning — prefix or substring match against expected_answer.
security, inspection, planning — keyword presence over an expected set of issues / key components.
open-ended — when task["use_judge"] is set and the arena is constructed with judge_model=..., the LLMJudge grades each pair in both orderings (A then B, B then A) and averages, to suppress position bias. This is meaningfully more expensive — the judge is invoked twice per task, on top of the two model generations.

Code is executed in a subprocess with a hardened pattern filter (rm -rf, shell=True, raw sockets, …) and a strict timeout. For untrusted code, pass use_docker=True to run_in_language(); containers run with --network=none --read-only --memory=512m --cpus=1.

Languages

The sandbox dispatches by the language field on each task. Detected at runtime from $PATH:

language	runtime needed
python	python3
javascript	node
typescript	tsx, ts-node, or deno
rust	rustc (edition 2021)
go	go ≥ 1.20
cpp	g++ or clang++ (-std=c++17)
bash	bash

ollama-arena tasks shows which languages are currently runnable.

CLI

ollama-arena match        --models A,B [--category C] [--dataset NAME] [--difficulty L]
ollama-arena tournament   --models A,B,C,...
ollama-arena leaderboard
ollama-arena perf
ollama-arena list
ollama-arena tasks
ollama-arena datasets     [--pull NAMES] [--refresh NAMES]
ollama-arena finetune     --analyze | --generate | --train PATH
ollama-arena export       --out report.html
ollama-arena web          [--port 7860]

Global flags: --backend, --api-key, --db, --ollama.

Python

from ollama_arena import Arena

arena = Arena()                                      # Ollama on :11434
# arena = Arena(backend="vllm")
# arena = Arena(backend="groq", api_key="gsk_...")

arena.load_hf_dataset("humaneval", limit=50)

result = arena.run_match(
    "llama3.2:3b", "qwen2.5-coder:7b",
    category="coding", n=20,
)
print(result.elo_a_after, result.elo_b_after)

Round-robin between several models:

arena.run_tournament(
    ["llama3.2:3b", "qwen2.5-coder:7b", "gemma2:9b"],
    category="reasoning", n_per_match=10,
)

LLM judge for open-ended responses:

arena = Arena(judge_model="qwen2.5:32b-instruct")
# tasks marked {"use_judge": True} are graded by the judge in both orderings

Export a standalone HTML dashboard (Plotly):

from ollama_arena.visualize import export_dashboard

export_dashboard(
    "report.html",
    leaderboard=arena.leaderboard(),
    matches=arena.match_history(limit=500),
    categories=["coding", "reasoning", "security", "planning", "inspection"],
    performance=arena.performance_stats(),
)

Performance metrics

Every generation logs prompt tokens, output tokens, latency, tokens/sec, and time-to-first-token. ollama-arena perf prints per-model aggregates:

model              samples  tps mean  tps p95  lat mean  lat p95  ttft
llama3.2:3b           120     48.2     52.1     4.2s     6.3s    0.3s
qwen2.5-coder:7b      120     31.7     34.0     8.1s    11.2s    0.5s

These numbers are backend numbers — they include HTTP overhead, the model server's scheduling, batching, and so on. They are useful as relative comparisons within one backend; treat absolute values with care.

Fine-tuning loop

A small pipeline turns arena failures into a teacher-distilled SFT dataset, runs Unsloth LoRA on it, exports a GGUF and registers the result as an Ollama model. End-to-end example: examples/finetune_pipeline.py.

CUDA is required for the Unsloth step.

Limitations

ELO updates per task, not per match. This converges faster but is noisier than the official chess formula for small sample sizes.
The keyword-based scorers for security/inspection/planning are approximate. They reward mentioning the right thing, not necessarily understanding it. Use the LLM judge for higher-stakes scoring.
Sandbox isolation without Docker relies on the subprocess timeout and the static pattern filter. Do not feed model output from untrusted sources to the host sandbox.
HuggingFace dataset normalization is per-loader; some upstream schema changes will require updates to loader.py.

Contributing

See CONTRIBUTING.md. The most useful contributions are new dataset loaders, new language sandboxes, and new backends; each takes only a few dozen lines.

License

MIT. See LICENSE.

Logo

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⢄⡲⠖⠛⠉⠉⠉⠉⠉⠙⠛⠿⣿⣶⣦⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⣡⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⣿⣿⣿⣿⣷⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⣡⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡔⢡⣶⠏⠀⠀⠀⠀⠀⠀⣠⣴⣶⣶⣶⣶⣶⣶⣦⣄⣸⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠌⢀⣿⠏⠀⠀⠀⠀⠀⠀⠸⠿⠋⠙⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡞⠀⡼⢿⣦⣄⠠⠤⠐⠒⠒⠒⠢⠤⣄⣠⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠀⠀⠀⣸⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⢠⠞⠁⠀⠀⠠⠇⣀⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠈⠙⠛⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢀⣴⣁⠀⣀⣤⣴⣾⣿⣿⣿⣿⡿⢿⣿⣶⣄⠀⠀⠀⠀⠀⣿⣷⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⡇⠀⢸⣿⣿⣿⡇⠘⠟⣻⣿⣧⠀⠀⠀⠀⢿⣿⣤⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⡿⠀⠀⠸⣿⠿⠋⠉⠁⠛⠻⠿⢿⣧⠀⠀⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⡿⠋⠁⠀⢀⣄⡀⠀⠀⠀⢀⣀⣤⣴⣿⣿⣧⠀⢀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⠏⢀⠀⢀⡴⠿⣿⣿⣷⣶⣾⣿⣿⣿⣿⣿⣿⣿⣇⠀⢷⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡄⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣤⣿⣷⡈⠀⠀⠀⠙⠻⣿⣿⣿⣿⠿⠛⠛⣻⣿⣿⡄⠈⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡄⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⣄⠀⠀⠀⠀⠈⠋⢉⣠⣴⣾⣿⣿⣿⣿⣷⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⣿⢻⡏⢹⠙⡆⠀⠀⠀⠒⠚⢛⣉⣉⣿⣿⣿⣿⣿⣿⡇⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢀⡞⠁⠉⠀⠁⠀⣄⣀⣠⣴⣶⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⣈⡛⢻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠛⠋⠉⠉⠉⠙⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡷⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⣻⠿⠿⢿⣿⠿⠿⠋⠁⠀⠙⣿⡁⠈⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡟⠛⠋⠉⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⠴⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣈⣹⣦⣴⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⣀⣀⣀⣀⣼⣿⣄⣀⣀⡄⠀⣀⣀⣠⣤⣶⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀⠀⠀
⠀⠀⠀⠀⠀⢰⠿⠿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠟⠉⠀⠀⣰⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀
⠀⠀⠀⢀⣤⣤⣤⣶⣿⣿⣿⣿⠿⠿⠟⠋⢹⠇⠀⠀⢀⣼⣿⣿⣿⣿⣿⡿⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
⠀⢀⣴⣿⣿⣿⣿⣿⣿⣿⡟⠁⠀⠀⠀⢀⡏⠀⠀⢀⣾⠋⣹⣿⣿⣿⡟⠀⠀⣸⡟⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
⢠⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⠀⠀⠀⡼⠀⠀⢀⣾⠏⢀⣿⣿⣿⠋⠀⠀⣰⣿⣧⡀⠹⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.1.2

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollama_arena-2.1.2.tar.gz (63.2 kB view details)

Uploaded Jun 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ollama_arena-2.1.2-py3-none-any.whl (68.3 kB view details)

Uploaded Jun 15, 2026 Python 3

File details

Details for the file ollama_arena-2.1.2.tar.gz.

File metadata

Download URL: ollama_arena-2.1.2.tar.gz
Upload date: Jun 15, 2026
Size: 63.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for ollama_arena-2.1.2.tar.gz
Algorithm	Hash digest
SHA256	`0b7d5dcbe72282ba6cc3f08fa69c33de83800b8e4091e44f5b6e91365edc4fc2`
MD5	`b201c9c1903acacc6393f7bfdb4b3f8d`
BLAKE2b-256	`786935f37b20b3131bf6000c7d2b699f5bb3792de94addf1df9a92573e28cec4`

See more details on using hashes here.

File details

Details for the file ollama_arena-2.1.2-py3-none-any.whl.

File metadata

Download URL: ollama_arena-2.1.2-py3-none-any.whl
Upload date: Jun 15, 2026
Size: 68.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for ollama_arena-2.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`86b3dcd56e627c6686c542502961ccf2aeec7f6b3ef6299c5641027dd6f23e7a`
MD5	`d580c8fa5d8e1ddfd61c2d1f9c63733a`
BLAKE2b-256	`5a8c5e2e11dcf82014986445c4618f12ef52030e9b3f5613552b5de52abb0270`

See more details on using hashes here.

ollama-arena 2.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ollama-arena

Why

Install

Quick start

Backends

Tasks

Scoring

Languages

CLI

Python

Performance metrics

Fine-tuning loop

Limitations

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes