Skip to main content

Pair-wise ELO evaluation arena for local LLMs.

Project description

ollama-arena

A pair-wise evaluation harness for locally hosted language models. Runs matches between two models on a shared task set, scores each response deterministically (or with an LLM judge), and maintains an ELO rating across runs.

pip install git+https://github.com/nazkari86-lab/ollama-arena.git
ollama-arena match --models llama3.2:3b,qwen2.5-coder:7b -n 20
match 1/1   llama3.2:3b  vs  qwen2.5-coder:7b
  code_001   1.00  vs  1.00   draw
  code_002   0.00  vs  1.00   B
  humaneval_3 1.00 vs 1.00    draw
  ...

rank  model                elo    W   L   D   matches  win%
1     qwen2.5-coder:7b    1271    7   1   2     10     70%
2     llama3.2:3b         1129    1   7   2     10     10%

Why

When you have several local models, you want a quick answer to "which one is better at X?" — without renting GPUs or signing up for a judging API. Existing harnesses (lm-evaluation-harness, lighteval, simple-evals) are absolute-score frameworks designed for paper-grade reporting; they are overkill for the day-to-day "should I switch from llama3.2 to qwen2.5?" question. ollama-arena answers that question with pair-wise battles, a local SQLite ELO table, and built-in or HuggingFace task pools.

ELO rather than Glicko-2 because (a) the implementation is two lines, and (b) for a moderate number of models the difference is negligible.

Install

pip install git+https://github.com/nazkari86-lab/ollama-arena.git

Optional extras (append to the URL, or clone and pip install '.[extra]'):

Extra Adds
[all] web dashboard, Plotly charts, HuggingFace datasets
[hf] in-process TransformersBackend (torch, transformers)
[finetune] Unsloth fine-tune pipeline — CUDA recommended
# clone for extras
git clone https://github.com/nazkari86-lab/ollama-arena.git
cd ollama-arena
pip install '.[all]'

The HuggingFace and fine-tune extras pull large dependencies and are off by default.

Quick start

ollama serve
ollama pull llama3.2:3b
ollama pull qwen2.5-coder:7b

ollama-arena match --models llama3.2:3b,qwen2.5-coder:7b --category coding -n 10
ollama-arena leaderboard

ELO state lives in arena.db in the working directory. Pass --db to share a leaderboard between runs in different folders.

Backends

Anything that exposes Ollama's native API or the OpenAI /v1/chat/completions shape works without code changes:

ollama-arena --backend ollama   match ...        # default, :11434
ollama-arena --backend vllm     match ...        # :8000
ollama-arena --backend lmstudio match ...        # :1234
ollama-arena --backend llamacpp match ...        # :8080
ollama-arena --backend openai     --api-key sk-... match ...
ollama-arena --backend groq       --api-key gsk-... match ...
ollama-arena --backend together   --api-key tg-... match ...
ollama-arena --backend openrouter --api-key sk-or-... match ...

Or pass a full URL:

ollama-arena --backend http://192.168.1.50:8000/v1 match ...

A TransformersBackend is also available for in-process generation via PyTorch; it is lazily imported so the dependency is optional.

Tasks

The package ships with about 100 hand-written tasks across five categories: coding (Python plus JS/TS/Rust/Go/C++), reasoning, security, inspection, and planning. They are intended as a smoke-test starter pack, not a definitive benchmark.

For serious work, load a HuggingFace dataset:

ollama-arena datasets                       # registered datasets
ollama-arena datasets --pull humaneval,gsm8k
ollama-arena match --dataset humaneval --models A,B -n 50

Registered loaders (more in ollama_arena/datasets/loader.py):

name source reference
humaneval openai_humaneval Chen et al., 2021
mbpp mbpp Austin et al., 2021
mbpp_plus evalplus/mbppplus Liu et al., 2023
gsm8k gsm8k Cobbe et al., 2021
mmlu cais/mmlu Hendrycks et al., 2021
bbh lukaemon/bbh Suzgun et al., 2022
multipl_e nuprl/MultiPL-E Cassano et al., 2022
hellaswag hellaswag Zellers et al., 2019
truthfulqa truthful_qa Lin et al., 2022
arc ai2_arc Clark et al., 2018

Downloads are cached in ~/.cache/ollama_arena/datasets/. Override with OLLAMA_ARENA_CACHE.

Scoring

Each task carries its own scorer:

  • coding — extract the code block, append the task's test cases, and execute in the matching language sandbox. Score is 1.0 on a clean exit, 0.0 otherwise.
  • math, knowledge — numeric tolerance / multiple-choice letter match.
  • reasoning — prefix or substring match against expected_answer.
  • security, inspection, planning — keyword presence over an expected set of issues / key components.
  • open-ended — when task["use_judge"] is set and the arena is constructed with judge_model=..., the LLMJudge grades each pair in both orderings (A then B, B then A) and averages, to suppress position bias. This is meaningfully more expensive — the judge is invoked twice per task, on top of the two model generations.

Code is executed in a subprocess with a hardened pattern filter (rm -rf, shell=True, raw sockets, …) and a strict timeout. For untrusted code, pass use_docker=True to run_in_language(); containers run with --network=none --read-only --memory=512m --cpus=1.

Languages

The sandbox dispatches by the language field on each task. Detected at runtime from $PATH:

language runtime needed
python python3
javascript node
typescript tsx, ts-node, or deno
rust rustc (edition 2021)
go go ≥ 1.20
cpp g++ or clang++ (-std=c++17)
bash bash

ollama-arena tasks shows which languages are currently runnable.

CLI

ollama-arena match        --models A,B [--category C] [--dataset NAME] [--difficulty L]
ollama-arena tournament   --models A,B,C,...
ollama-arena leaderboard
ollama-arena perf
ollama-arena list
ollama-arena tasks
ollama-arena datasets     [--pull NAMES] [--refresh NAMES]
ollama-arena finetune     --analyze | --generate | --train PATH
ollama-arena export       --out report.html
ollama-arena web          [--port 7860]

Global flags: --backend, --api-key, --db, --ollama.

Python

from ollama_arena import Arena

arena = Arena()                                      # Ollama on :11434
# arena = Arena(backend="vllm")
# arena = Arena(backend="groq", api_key="gsk_...")

arena.load_hf_dataset("humaneval", limit=50)

result = arena.run_match(
    "llama3.2:3b", "qwen2.5-coder:7b",
    category="coding", n=20,
)
print(result.elo_a_after, result.elo_b_after)

Round-robin between several models:

arena.run_tournament(
    ["llama3.2:3b", "qwen2.5-coder:7b", "gemma2:9b"],
    category="reasoning", n_per_match=10,
)

LLM judge for open-ended responses:

arena = Arena(judge_model="qwen2.5:32b-instruct")
# tasks marked {"use_judge": True} are graded by the judge in both orderings

Export a standalone HTML dashboard (Plotly):

from ollama_arena.visualize import export_dashboard

export_dashboard(
    "report.html",
    leaderboard=arena.leaderboard(),
    matches=arena.match_history(limit=500),
    categories=["coding", "reasoning", "security", "planning", "inspection"],
    performance=arena.performance_stats(),
)

Performance metrics

Every generation logs prompt tokens, output tokens, latency, tokens/sec, and time-to-first-token. ollama-arena perf prints per-model aggregates:

model              samples  tps mean  tps p95  lat mean  lat p95  ttft
llama3.2:3b           120     48.2     52.1     4.2s     6.3s    0.3s
qwen2.5-coder:7b      120     31.7     34.0     8.1s    11.2s    0.5s

These numbers are backend numbers — they include HTTP overhead, the model server's scheduling, batching, and so on. They are useful as relative comparisons within one backend; treat absolute values with care.

Fine-tuning loop

A small pipeline turns arena failures into a teacher-distilled SFT dataset, runs Unsloth LoRA on it, exports a GGUF and registers the result as an Ollama model. End-to-end example: examples/finetune_pipeline.py.

CUDA is required for the Unsloth step.

Limitations

  • ELO updates per task, not per match. This converges faster but is noisier than the official chess formula for small sample sizes.
  • The keyword-based scorers for security/inspection/planning are approximate. They reward mentioning the right thing, not necessarily understanding it. Use the LLM judge for higher-stakes scoring.
  • Sandbox isolation without Docker relies on the subprocess timeout and the static pattern filter. Do not feed model output from untrusted sources to the host sandbox.
  • HuggingFace dataset normalization is per-loader; some upstream schema changes will require updates to loader.py.

Contributing

See CONTRIBUTING.md. The most useful contributions are new dataset loaders, new language sandboxes, and new backends; each takes only a few dozen lines.

License

MIT. See LICENSE.

Logo
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⢄⡲⠖⠛⠉⠉⠉⠉⠉⠙⠛⠿⣿⣶⣦⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⣡⠖⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⣿⣿⣿⣿⣷⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠔⣡⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡔⢡⣶⠏⠀⠀⠀⠀⠀⠀⣠⣴⣶⣶⣶⣶⣶⣶⣦⣄⣸⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠌⢀⣿⠏⠀⠀⠀⠀⠀⠀⠸⠿⠋⠙⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡞⠀⡼⢿⣦⣄⠠⠤⠐⠒⠒⠒⠢⠤⣄⣠⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⠀⠀⠀⣸⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⢠⠞⠁⠀⠀⠠⠇⣀⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠈⠙⠛⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢀⣴⣁⠀⣀⣤⣴⣾⣿⣿⣿⣿⡿⢿⣿⣶⣄⠀⠀⠀⠀⠀⣿⣷⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⣿⡇⠀⢸⣿⣿⣿⡇⠘⠟⣻⣿⣧⠀⠀⠀⠀⢿⣿⣤⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⣿⣿⡿⠀⠀⠸⣿⠿⠋⠉⠁⠛⠻⠿⢿⣧⠀⠀⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣿⡿⠋⠁⠀⢀⣄⡀⠀⠀⠀⢀⣀⣤⣴⣿⣿⣧⠀⢀⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⠏⢀⠀⢀⡴⠿⣿⣿⣷⣶⣾⣿⣿⣿⣿⣿⣿⣿⣇⠀⢷⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡄⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣿⣿⣤⣿⣷⡈⠀⠀⠀⠙⠻⣿⣿⣿⣿⠿⠛⠛⣻⣿⣿⡄⠈⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡄⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⣿⣿⣄⠀⠀⠀⠀⠈⠋⢉⣠⣴⣾⣿⣿⣿⣿⣷⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⣿⢻⡏⢹⠙⡆⠀⠀⠀⠒⠚⢛⣉⣉⣿⣿⣿⣿⣿⣿⡇⠀⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢀⡞⠁⠉⠀⠁⠀⣄⣀⣠⣴⣶⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⣈⡛⢻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠛⠋⠉⠉⠉⠙⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⡀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡷⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⣻⠿⠿⢿⣿⠿⠿⠋⠁⠀⠙⣿⡁⠈⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡟⠛⠋⠉⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⠴⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣈⣹⣦⣴⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣤⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⣀⣀⣀⣀⣼⣿⣄⣀⣀⡄⠀⣀⣀⣠⣤⣶⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀⠀⠀
⠀⠀⠀⠀⠀⢰⠿⠿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠟⠉⠀⠀⣰⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀
⠀⠀⠀⢀⣤⣤⣤⣶⣿⣿⣿⣿⠿⠿⠟⠋⢹⠇⠀⠀⢀⣼⣿⣿⣿⣿⣿⡿⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
⠀⢀⣴⣿⣿⣿⣿⣿⣿⣿⡟⠁⠀⠀⠀⢀⡏⠀⠀⢀⣾⠋⣹⣿⣿⣿⡟⠀⠀⣸⡟⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇
⢠⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⠀⠀⠀⡼⠀⠀⢀⣾⠏⢀⣿⣿⣿⠋⠀⠀⣰⣿⣧⡀⠹⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ollama_arena-2.1.2.tar.gz (63.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ollama_arena-2.1.2-py3-none-any.whl (68.3 kB view details)

Uploaded Python 3

File details

Details for the file ollama_arena-2.1.2.tar.gz.

File metadata

  • Download URL: ollama_arena-2.1.2.tar.gz
  • Upload date:
  • Size: 63.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for ollama_arena-2.1.2.tar.gz
Algorithm Hash digest
SHA256 0b7d5dcbe72282ba6cc3f08fa69c33de83800b8e4091e44f5b6e91365edc4fc2
MD5 b201c9c1903acacc6393f7bfdb4b3f8d
BLAKE2b-256 786935f37b20b3131bf6000c7d2b699f5bb3792de94addf1df9a92573e28cec4

See more details on using hashes here.

File details

Details for the file ollama_arena-2.1.2-py3-none-any.whl.

File metadata

  • Download URL: ollama_arena-2.1.2-py3-none-any.whl
  • Upload date:
  • Size: 68.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for ollama_arena-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 86b3dcd56e627c6686c542502961ccf2aeec7f6b3ef6299c5641027dd6f23e7a
MD5 d580c8fa5d8e1ddfd61c2d1f9c63733a
BLAKE2b-256 5a8c5e2e11dcf82014986445c4618f12ef52030e9b3f5613552b5de52abb0270

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page