Diagnostic probes for LLM-as-judge reliability.
Project description
judge-bench
judge-bench runs synthetic diagnostics for LLM-as-judge reliability: position bias, verbosity bias, self-preference, paraphrase stability, anchoring, and calibration. It emits JSON, Markdown, and plot-ready summaries.
Quickstart
pip install judge-bench
judge-bench run --backend openai --model gpt-4o --probes position_bias --dry-run
judge-bench run --backend local --probes all --pairs 20 --cache-dir .judge-bench-cache --output report.json
Provider backends call their public APIs directly with standard environment variables:
OPENAI_API_KEY for --backend openai, ANTHROPIC_API_KEY for --backend anthropic,
and GEMINI_API_KEY for --backend google. Non-dry runs require --confirm-cost.
Repeated judge calls are cached by (backend family, model, prompt, response_a, response_b) under .judge-bench-cache so paid backends do not re-run the same synthetic diagnostic pair. Use --cache-dir to isolate or share caches across runs.
Each run writes JSON, Markdown, and plot artifacts next to the requested output path: <name>.md, <name>.plots.json, <name>.svg, and <name>.png when matplotlib is installed.
The local backend can run against local model servers without API spend:
judge-bench run --backend local --model ollama:llama3.1 --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8000/v1 judge-bench run --backend local --model vllm:meta-llama/Llama-3.1-8B-Instruct --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8080 judge-bench run --backend local --model hf:mistral --probes position_bias --output report.json
Supported local modes are ollama:<model> for Ollama /api/generate, vllm:<model> for OpenAI-compatible /chat/completions, and hf:<model> or transformers:<model> for Hugging Face text generation. JUDGE_BENCH_LOCAL_BACKEND, JUDGE_BENCH_LOCAL_URL, and JUDGE_BENCH_LOCAL_API_KEY can override mode, endpoint, and bearer token. If no local mode is selected, local-judge uses a deterministic lexical heuristic for offline smoke tests.
What This Is Not
This is not a benchmark, leaderboard, or claim of model superiority. All bundled pairs are synthetic and disclosed as such.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file judge_bench-0.1.1.tar.gz.
File metadata
- Download URL: judge_bench-0.1.1.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
058426d850487fbe20c4e95570b3860621b9ebc6b39a0a15aa524a1f5c6c7d11
|
|
| MD5 |
dfd7204ba5b14ae34bfc613f7afef86e
|
|
| BLAKE2b-256 |
b669f13f4f39c03c5a468b0ddea1d2f5fec8df57088c9533f43f6c183e5e73fb
|
Provenance
The following attestation bundles were made for judge_bench-0.1.1.tar.gz:
Publisher:
release-python.yml on auraoneai/judge-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
judge_bench-0.1.1.tar.gz -
Subject digest:
058426d850487fbe20c4e95570b3860621b9ebc6b39a0a15aa524a1f5c6c7d11 - Sigstore transparency entry: 1519834895
- Sigstore integration time:
-
Permalink:
auraoneai/judge-bench@55050c213fb9bfb771b6851d1019f3f42af7173f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/auraoneai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-python.yml@55050c213fb9bfb771b6851d1019f3f42af7173f -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file judge_bench-0.1.1-py3-none-any.whl.
File metadata
- Download URL: judge_bench-0.1.1-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4bee411071376a789c28671399e56ffd4d61553529dbcd2275b59339d222a9e
|
|
| MD5 |
32206496200c352e582eacb5023568c0
|
|
| BLAKE2b-256 |
8af1cf870706643864093db00dccae9e8936b160041eabe80bd7f8dd3475c520
|
Provenance
The following attestation bundles were made for judge_bench-0.1.1-py3-none-any.whl:
Publisher:
release-python.yml on auraoneai/judge-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
judge_bench-0.1.1-py3-none-any.whl -
Subject digest:
a4bee411071376a789c28671399e56ffd4d61553529dbcd2275b59339d222a9e - Sigstore transparency entry: 1519834924
- Sigstore integration time:
-
Permalink:
auraoneai/judge-bench@55050c213fb9bfb771b6851d1019f3f42af7173f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/auraoneai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-python.yml@55050c213fb9bfb771b6851d1019f3f42af7173f -
Trigger Event:
workflow_dispatch
-
Statement type: