Skip to main content

A toolkit for LLM-as-a-judge evaluation and arena benchmarks.

Project description

๐Ÿ›๏ธ JudgeArena: LLM Evaluation with Swappable Judges

JudgeArena makes it easy to benchmark language models against each other while giving you complete control over the evaluation process. Whether you're comparing proprietary models or testing your own fine-tuned creations, JudgeArena lets you choose your judge.

โœจ Key Features

๐ŸŽฏ Flexible Benchmarking โ€“ Evaluate models on Alpaca-Eval, Arena-Hard, m-Arena-Hard and others

๐Ÿ”„ Swappable Judges โ€“ Switch between self-hosted (vLLM) or remote judges (OpenAI, Together AI, OpenRouter)

๐ŸŒ Multilingual Support โ€“ Test models across multiple languages with m-Arena-Hard

๐Ÿ› ๏ธ Provider Agnostic โ€“ Works with any model available in LangChain

Compared to other libraries, here is a breakdown of features:

Framework MT-Bench AlpacaEval Arena-Hard M-Arena-Hard Tuned judge configuration Support vLLM Judges
FastChat โœ… โŒ โŒ โŒ โŒ โŒ
AlpacaEval โŒ โœ… โŒ โŒ โŒ โŒ
Arena-Hard-Auto โŒ โŒ โœ… โŒ โŒ โŒ
Lighteval โœ… โŒ โŒ โŒ โŒ โŒ
Evalchemy โœ… โœ… โŒ โŒ โŒ โŒ
JudgeArena ๐Ÿ”œ โœ… โœ… โœ… โœ… โœ…

The table has been done on Oct 2025, in case some libraries implemented missing features, please open an issue or send a PR, we will be happy to update the information.

๐Ÿš€ Quick Start

Installation

git clone https://github.com/OpenEuroLLM/JudgeArena
cd JudgeArena
uv sync 
uv sync --extra vllm      # Optional: install vLLM support
uv sync --extra llamacpp   # Optional: install LlamaCpp support

Basic Evaluation

Compare two models head-to-head:

python judgearena/generate_and_evaluate.py \
  --dataset alpaca-eval \
  --model_A gpt4_1106_preview \
  --model_B VLLM/utter-project/EuroLLM-9B \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 10 

What happens here?

  • Use completions available for gpt4_1106_preview in Alpaca-Eval dataset
  • Generates completions for model_B if not already cached on vLLM
  • Compares two models using deepseek-chat-v3.1 which the cheapest option available on OpenRouter

It will then display the results of the battles:

============================================================
                  ๐Ÿ† MODEL BATTLE RESULTS ๐Ÿ†                  
๐Ÿ“Š Dataset: alpaca-eval
๐Ÿค– Competitors: Model A: gpt4_1106_preview vs Model B: VLLM/utter-project/EuroLLM-9B
โš–๏ธ Judge: OpenRouter/deepseek/deepseek-chat-v3.1
๐Ÿ“ˆ Results Summary:
   Total Battles: 10
   Win Rate (A): 30.0%
   โœ… Wins:   3
   โŒ Losses: 6
   ๐Ÿค Ties:   1
============================================================

Length and Token Parameters

The evaluation scripts expose four different length controls with different roles:

  • --truncate_all_input_chars: character-level truncation applied to prompts before model generation and before judge evaluation.
  • --max_out_tokens_models: generation token budget for each answer from model_A and model_B.
  • --max_out_tokens_judge: generation token budget for the judge completion (reasoning + score output).
  • --max_model_len: optional vLLM context-window limit (prompt + generated tokens), applied to vLLM models; this should be greater than or equal to the two max_out_tokens_* values.

Engine-Specific Configuration (--engine_kwargs)

Some providers expose additional engine-level knobs (for example, vLLM allows configuring tensor parallelism or GPU memory utilization).
JudgeArena lets you forward these options directly to the underlying engine via --engine_kwargs, which expects a JSON object.

For instance, to run vLLM with tensor parallelism across multiple GPUs:

python judgearena/generate_and_evaluate.py \
  --dataset alpaca-eval \
  --model_A VLLM/Qwen/Qwen2.5-0.5B-Instruct \
  --model_B VLLM/Qwen/Qwen2.5-1.5B-Instruct \
  --judge_model VLLM/Qwen/Qwen3.5-27B-FP8 \
  --n_instructions 10 \
  --engine_kwargs '{"tensor_parallel_size": 2}'

While any key in --engine_kwargs is forwarded to the underlying engine (e.g. vllm.LLM, LlamaCpp, ChatOpenAI), existing dedicated flags such as --max_model_len and --chat_template have higher precedence.

๐ŸŽจ Model Specification

Models are specified using the format: {LangChain Backend}/{Model Path}

Examples:

Together/meta-llama/Llama-3.3-70B-Instruct-Turbo
ChatOpenAI/gpt-4o
LlamaCpp/jwiggerthale_Llama-3.2-3B-Q8_0-GGUF_llama-3.2-3b-q8_0.gguf
VLLM/utter-project/EuroLLM-9B
OpenRouter/deepseek/deepseek-chat-v3.1

For instance, to run everything locally with vLLM:

python judgearena/generate_and_evaluate.py \
  --dataset alpaca-eval \
  --model_A VLLM/Qwen/Qwen2.5-0.5B-Instruct \
  --model_B VLLM/Qwen/Qwen2.5-1.5B-Instruct \
  --judge_model VLLM/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 \
  --n_instructions 10 

Running locally with LlamaCpp

LlamaCpp allows you to run GGUF models locally with high efficiency across various hardware, including CPUs, Apple Silicon (Metal), and NVIDIA GPUs. This is ideal for testing your setup without relying on external API keys or high-end server GPUs.

Install the LlamaCpp extra:

uv sync --extra llamacpp

Download GGUF models using huggingface-cli (included via huggingface-hub):

huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q8_0.gguf --local-dir ./models
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF qwen2.5-1.5b-instruct-q8_0.gguf --local-dir ./models

The LlamaCpp provider expects a file path to a .gguf model after the LlamaCpp/ prefix. For absolute paths, this results in a double slash (e.g., LlamaCpp//home/user/models/model.gguf).

Mixed example โ€” local LlamaCpp model with a remote judge:

uv run python judgearena/generate_and_evaluate.py \
  --dataset alpaca-eval \
  --model_A LlamaCpp/./models/qwen2.5-0.5b-instruct-q8_0.gguf \
  --model_B OpenRouter/qwen/qwen-2.5-7b-instruct \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 10 --max_out_tokens_models 16384

Fully local example โ€” no API keys required (useful for verifying your setup):

uv run python judgearena/generate_and_evaluate.py \
  --dataset alpaca-eval \
  --model_A LlamaCpp/./models/qwen2.5-0.5b-instruct-q8_0.gguf \
  --model_B LlamaCpp/./models/qwen2.5-1.5b-instruct-q8_0.gguf \
  --judge_model LlamaCpp/./models/qwen2.5-1.5b-instruct-q8_0.gguf \
  --n_instructions 5 --max_out_tokens_models 16384

Note: Ensure you have the required LangChain dependencies installed for your chosen provider. If you use remote endpoint, you would have to set your credentials.

Chat Templates (vLLM)

When using vLLM, JudgeArena automatically picks the right inference method based on the model:

  • Instruct/chat models (e.g. swiss-ai/Apertus-8B-Instruct-2509): the tokenizer already defines a chat template, so JudgeArena uses vllm.LLM.chat() and the template is applied automatically.
  • Base/pretrained models (e.g. swiss-ai/Apertus-8B-2509): these typically don't ship a chat template. JudgeArena detects this and falls back to vllm.LLM.generate() (plain text, no chat formatting). A warning is printed when this happens.

If you need to force a specific chat template (for example, a base model that you know works with ChatML), pass it via --chat_template:

python judgearena/generate_and_evaluate.py \
  --dataset alpaca-eval \
  --model_A VLLM/swiss-ai/Apertus-8B-2509 \
  --model_B VLLM/swiss-ai/Apertus-8B-Instruct-2509 \
  --judge_model VLLM/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8 \
  --chat_template '{% for message in messages %}<|im_start|>{{ message["role"] }}\n{{ message["content"] }}<|im_end|>\n{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}'

This override applies to all vLLM models in the run. For remote providers (OpenAI, Together, OpenRouter), the flag is ignored since they handle templates server-side.

๐Ÿ“Š Supported Datasets

Dataset Description
alpaca-eval General instruction-following benchmark
arena-hard More challenging evaluation suite
m-arena-hard Translated version of Arena-Hard in 23 languages
m-arena-hard-{lang} Language-specific variants (e.g., ar, cs, de)
m-arena-hard-EU All EU languages combined
fluency-{lang} Fluency evaluation for pretrained models (finnish, french, german, spanish, swedish)

Offline Setup (Slurm/Air-Gapped Environments)

Pre-download all datasets before running jobs:

python -c "from judgearena.utils import download_all; download_all()"  # Download all datasets (optional)

Datasets are stored in:

  • $JUDGEARENA_DATA if set; otherwise $OPENJURY_DATA if set (legacy)
  • ~/judgearena-data/ if neither variable is set

๐Ÿ› ๏ธ Development

To maintain code quality, we use pre-commit hooks. Run this once to set them up:

uv run pre-commit install

Once installed, hooks will automatically check and format your code on every git commit. If a commit is blocked, simply git add the changes made by the hooks and commit again.

๐Ÿค Contributing

We welcome contributions! Whether it's bug fixes, new features, or additional benchmark support, feel free to open an issue or submit a pull request.

Citation

If you use this work in your research, please cite the following paper.

@inproceedings{
  salinas2025tuning,
  title={Tuning {LLM} Judge Design Decisions for 1/1000 of the Cost},
  author={David Salinas and Omar Swelam and Frank Hutter},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=cve4NOiyVp}
}

The judge configurations was tuned in this paper and a lot of code is reused in this package.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judgearena-0.1.0.tar.gz (38.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judgearena-0.1.0-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file judgearena-0.1.0.tar.gz.

File metadata

  • Download URL: judgearena-0.1.0.tar.gz
  • Upload date:
  • Size: 38.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for judgearena-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a6e9bb4c95c4e91e40abb0935c35b87bd46094dc960fdfd2b2719005aa3468f0
MD5 19cb2dfce963e53eb88c8478136bcc4b
BLAKE2b-256 c915e95a660bc86bb7e989e24dca1788be850f56093b034f500a968efa382e6a

See more details on using hashes here.

File details

Details for the file judgearena-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: judgearena-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for judgearena-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3567052a209d3035224274fabd876f972d608b2d8e57c8d8607c23af3ca3b24
MD5 344e0043e060cbc3560d74887c494efa
BLAKE2b-256 cbe4881097c22b1ab2b164b31102610eb9353c673b424d80c57c193aee2b8042

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page