Diagnostic probes for LLM-as-judge reliability.

These details have been verified by PyPI

Project links

Source

GitHub Statistics

Maintainers

auraoneai

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Programming Language

Project description

judge-bench

judge-bench runs synthetic diagnostics for LLM-as-judge reliability: position bias, verbosity bias, self-preference, paraphrase stability, anchoring, and calibration. It emits JSON, Markdown, and plot-ready summaries.

Quickstart

pip install judge-bench
judge-bench run --backend openai --model gpt-4o --probes position_bias --dry-run
judge-bench run --backend local --probes all --pairs 20 --cache-dir .judge-bench-cache --output report.json

Provider backends call their public APIs directly with standard environment variables: OPENAI_API_KEY for --backend openai, ANTHROPIC_API_KEY for --backend anthropic, and GEMINI_API_KEY for --backend google. Non-dry runs require --confirm-cost.

Repeated judge calls are cached by (backend family, model, prompt, response_a, response_b) under .judge-bench-cache so paid backends do not re-run the same synthetic diagnostic pair. Use --cache-dir to isolate or share caches across runs. Each run writes JSON, Markdown, and plot artifacts next to the requested output path: <name>.md, <name>.plots.json, <name>.svg, and <name>.png when matplotlib is installed.

The local backend can run against local model servers without API spend:

judge-bench run --backend local --model ollama:llama3.1 --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8000/v1 judge-bench run --backend local --model vllm:meta-llama/Llama-3.1-8B-Instruct --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8080 judge-bench run --backend local --model hf:mistral --probes position_bias --output report.json

Supported local modes are ollama:<model> for Ollama /api/generate, vllm:<model> for OpenAI-compatible /chat/completions, and hf:<model> or transformers:<model> for Hugging Face text generation. JUDGE_BENCH_LOCAL_BACKEND, JUDGE_BENCH_LOCAL_URL, and JUDGE_BENCH_LOCAL_API_KEY can override mode, endpoint, and bearer token. If no local mode is selected, local-judge uses a deterministic lexical heuristic for offline smoke tests.

What This Is Not

This is not a benchmark, leaderboard, or claim of model superiority. All bundled pairs are synthetic and disclosed as such.

Project details

These details have been verified by PyPI

Project links

Source

GitHub Statistics

Maintainers

auraoneai

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Programming Language

Release history Release notifications | RSS feed

This version

0.1.1

May 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judge_bench-0.1.1.tar.gz (14.5 kB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

judge_bench-0.1.1-py3-none-any.whl (18.2 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file judge_bench-0.1.1.tar.gz.

File metadata

Download URL: judge_bench-0.1.1.tar.gz
Upload date: May 12, 2026
Size: 14.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for judge_bench-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`058426d850487fbe20c4e95570b3860621b9ebc6b39a0a15aa524a1f5c6c7d11`
MD5	`dfd7204ba5b14ae34bfc613f7afef86e`
BLAKE2b-256	`b669f13f4f39c03c5a468b0ddea1d2f5fec8df57088c9533f43f6c183e5e73fb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for judge_bench-0.1.1.tar.gz:

Publisher: release-python.yml on auraoneai/judge-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: judge_bench-0.1.1.tar.gz
- Subject digest: 058426d850487fbe20c4e95570b3860621b9ebc6b39a0a15aa524a1f5c6c7d11
- Sigstore transparency entry: 1519834895
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: auraoneai/judge-bench@55050c213fb9bfb771b6851d1019f3f42af7173f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/auraoneai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-python.yml@55050c213fb9bfb771b6851d1019f3f42af7173f
- Trigger Event: workflow_dispatch

File details

Details for the file judge_bench-0.1.1-py3-none-any.whl.

File metadata

Download URL: judge_bench-0.1.1-py3-none-any.whl
Upload date: May 12, 2026
Size: 18.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for judge_bench-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a4bee411071376a789c28671399e56ffd4d61553529dbcd2275b59339d222a9e`
MD5	`32206496200c352e582eacb5023568c0`
BLAKE2b-256	`8af1cf870706643864093db00dccae9e8936b160041eabe80bd7f8dd3475c520`

See more details on using hashes here.

Provenance

The following attestation bundles were made for judge_bench-0.1.1-py3-none-any.whl:

Publisher: release-python.yml on auraoneai/judge-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: judge_bench-0.1.1-py3-none-any.whl
- Subject digest: a4bee411071376a789c28671399e56ffd4d61553529dbcd2275b59339d222a9e
- Sigstore transparency entry: 1519834924
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: auraoneai/judge-bench@55050c213fb9bfb771b6851d1019f3f42af7173f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/auraoneai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-python.yml@55050c213fb9bfb771b6851d1019f3f42af7173f
- Trigger Event: workflow_dispatch

judge-bench 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

judge-bench

Quickstart

What This Is Not

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance