Judging LLM-as-a-Judge — a screening tool for bias and miscalibration.
Project description
Judging LLM-as-a-Judge
An LLM-as-a-Judge screening tool for bias and miscalibration.
Install
pip install judicator
Windows note: the report uses Unicode box-drawing characters. If you see
UnicodeEncodeErrorwhen callingprint(report.summary()), run withPYTHONUTF8=1orset PYTHONUTF8=1(Windows shell) before launching Python.
Quickstart
import openai
from judicator import Judge, JudgeAuditor
system_prompt = "You are an expert evaluator. Score responses objectively."
eval_template = "Question: {question}\nResponse: {response}\nScore 1-10."
def my_judge_call(prompt: str) -> str:
return openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
).choices[0].message.content
judge = Judge(
llm_fn=my_judge_call,
system_prompt=system_prompt,
eval_template=eval_template,
judge_name="my_first_judge"
)
# Shows cost estimate and prompts Y/n. Pass confirm=False to skip.
# max_workers=20 runs API calls in parallel — typically 10–15× faster.
report = JudgeAuditor(
judge=judge,
domain="qa",
cost_per_call=0.0003,
max_workers=20,
).audit()
print(report.summary())
report.save_json("my_audit.json")
Speed (max_workers)
A full audit makes ~1,000 LLM calls. Sequential runs take 20–25 minutes.
Set max_workers to run calls in parallel via a thread pool:
JudgeAuditor(judge=judge, domain="qa", max_workers=20).audit()
max_workers |
Wall time (~1k calls) | Speedup |
|---|---|---|
| 1 (default) | 20–25 min | 1× |
| 10 | 2.5 min | 8× |
| 20 | 1.5 min | 13× |
| 50+ | diminishing returns; rate-limit risk | — |
Caveats
- Rate limits. Cost is unchanged but request rate is much higher. Lower
max_workersif you see 429 errors — there is no auto-backoff. - Thread-safe
llm_fnrequired. Stateless calls are safe (OpenAI/Anthropic/OpenRouter clients are thread-safe). Don't share conversation state across calls. - Parallelism is per-test. Within a single bias test, fixture items run concurrently; tests still execute one after another.
What it tests
| Bias | What it catches | Applies to |
|---|---|---|
| position | Judge picks slot A/B regardless of content | pairwise |
| verbosity | Judge inflates scores for longer responses | all types |
| self_consistency | Judge gives different scores to the same input | pointwise, binary |
| scale_anchoring | Judge compresses all scores into a narrow band | pointwise |
| authority | Judge inflates scores for fake credentials | all types |
| concreteness | Judge prefers fabricated specifics over accurate vague answers | pointwise, pairwise |
| yes_bias | Binary judge over-approves false statements | binary |
Supported judge types
| Type | Template shape | Detected by |
|---|---|---|
| pointwise | {question} + {response} → numeric score |
{response} placeholder |
| pairwise | {question} + {response_a} + {response_b} → A or B |
{response_a} and {response_b} |
| binary | {statement} → Yes or No |
yes/no keyword in template |
Judge type is auto-detected from your eval_template. Override with
judge_type="pointwise" if detection fails.
Works with any LLM
Judicator never touches your API keys or model configuration. You wrap your LLM call in a function — Judicator calls that function.
Stateless calls required. Each call to
llm_fnmust be independent with no shared conversation context between calls. Judicator calls it multiple times per fixture item — if your judge accumulates history across calls, bias measurements will be invalid.
OpenAI
import openai
def my_fn(prompt: str) -> str:
return openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
).choices[0].message.content
Anthropic
import anthropic
client = anthropic.Anthropic()
def my_fn(prompt: str) -> str:
return client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
system=system_prompt,
messages=[{"role": "user", "content": prompt}]
).content[0].text
OpenRouter (access 200+ models with one API key)
import openai
client = openai.OpenAI(
api_key=os.environ["OPENROUTER_API_KEY"],
base_url="https://openrouter.ai/api/v1",
)
def my_fn(prompt: str) -> str:
return client.chat.completions.create(
model="meta-llama/llama-3.2-3b-instruct",
max_tokens=256,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
).choices[0].message.content
Ollama (local)
import ollama
def my_fn(prompt: str) -> str:
return ollama.chat(
model="llama3",
messages=[{"role": "user", "content": prompt}]
)["message"]["content"]
Pass any of these as llm_fn to Judge. Judicator works identically with all four.
Understanding the report
╔══════════════════════════════════════════════════════════════╗
║ JUDICATOR — AUDIT REPORT ║
╠══════════════════════════════════════════════════════════════╣
║ Judge: my_qa_judge Domain: qa Type: pointwise ║
╠════════════════════╦═══════╦═══════╦══════════╦══════════════╣
║ BIAS TEST ║ SCORE║ RANK ║ VERDICT ║ SEVERITY ║
╠════════════════════╬═══════╬═══════╬══════════╬══════════════╣
║ scale_anchoring ║ 0.312║ 1/5 ║ FAIL ║ CRITICAL ║
║ verbosity ║ 0.620║ 2/5 ║ FAIL ║ SIGNIFICANT ║
║ concreteness ║ 0.714║ 3/5 ║ PASS ║ MINOR ║
║ authority ║ 0.810║ 4/5 ║ PASS ║ NONE ║
║ self_consistency ║ 0.950║ 5/5 ║ PASS ║ NONE ║
╚══════════════════════════════════════════════════════════════╝
Score: 0–1. Higher = more calibrated. No composite score — each test is independent.
Rank: 1 = worst bias. Address rank 1 first.
Severity bands:
CRITICAL(< 0.50): strong bias, investigate immediatelySIGNIFICANT(0.50–0.65): meaningful bias, likely affects production qualityMINOR(0.65–0.80): borderline — PASS if ≥ 0.70, FAIL otherwiseNONE(≥ 0.80): no detectable bias
N/A results mean the test does not apply to your judge type or domain, not that the judge passed the test.
What v0.2 does NOT cover
- Composite scoring / single overall grade
- Sycophancy, compassion fade, bandwagon, sentiment, fallacy oversight, and other biases beyond the 7 tested
- Listwise, reference-based, CoT, or multi-turn judge types
- Translation, medical, legal, financial, or creative writing domains
- Position pairs for summarization, safety, or dialogue (qa + code only)
- Custom bias tests or BYO-data mode
- Token-aware cost estimation (currently flat-per-call)
- GitHub Actions integration or SaaS dashboard
Coming in future versions
- More statistically significant results — expanded fixture sets; concreteness is currently n=14 (coarse signal)
- Domain coverage expansion — position pairs for summarization, safety, and dialogue
- User-provided data — BYO-data mode to run bias tests on your own examples
- Labeling sheet output — export structured sheets for human annotation workflows
Citation
If you use Judicator in your research, please cite:
@software{judicator2026,
author = {Pandey, Ankur},
title = {Judicator: An LLM-as-a-Judge Bias Auditing Library},
year = {2026},
url = {https://github.com/ankurpand3y/judicator},
version = {0.2.2}
}
Built on
Judicator ships with fixtures derived from the following datasets. All are used in accordance with their licenses.
| Dataset | Paper | License |
|---|---|---|
| OffsetBias | Park et al. 2024 | Apache 2.0 |
| JudgeBench | Tan et al. 2024 | MIT |
| MT-Bench | Zheng et al. 2023 | Apache 2.0 |
| BeaverTails | Ji et al. 2023 | CC-BY-NC-4.0 |
| SummEval | Fabbri et al. 2021 | MIT |
| DSTC11-Track4 | Rodriguez-Cantelar et al. 2023 | Apache 2.0 |
See ATTRIBUTION.md for full item counts.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file judicator-0.2.2.tar.gz.
File metadata
- Download URL: judicator-0.2.2.tar.gz
- Upload date:
- Size: 909.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
497d8155f3d7d6fafb9a312536d71735ae729fa234ea93d2fead645640e912ac
|
|
| MD5 |
0ab8ddce03d75f2960be18de8f97855b
|
|
| BLAKE2b-256 |
09aa1e69636a9e5391d129f0a74b4f1482eba45898396570fca3217c13b41e0f
|
File details
Details for the file judicator-0.2.2-py3-none-any.whl.
File metadata
- Download URL: judicator-0.2.2-py3-none-any.whl
- Upload date:
- Size: 907.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
825823a7fe7ebdb89b5e036506b61afb45f950ec6e3a681163deeaef772dab64
|
|
| MD5 |
d980929b636d254a798ec42a32a6d897
|
|
| BLAKE2b-256 |
a05969593e7896850cfcd6d71c8d2e4b41ecb65ddaa2aa226e0206007d7a70ce
|