judicator

Judging LLM-as-a-Judge — a screening tool for bias and miscalibration.

These details have not been verified by PyPI

Project links

Project description

Judicator

Judging LLM-as-a-Judge

An LLM-as-a-Judge screening tool for bias and miscalibration.

Install

pip install judicator

Windows note: the report uses Unicode box-drawing characters. If you see UnicodeEncodeError when calling print(report.summary()), run with PYTHONUTF8=1 or set PYTHONUTF8=1 (Windows shell) before launching Python.

Quickstart

import openai
from judicator import Judge, JudgeAuditor

system_prompt = "You are an expert evaluator. Score responses objectively."
eval_template = "Question: {question}\nResponse: {response}\nScore 1-10."

def my_judge_call(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

judge = Judge(
    llm_fn=my_judge_call,
    system_prompt=system_prompt,
    eval_template=eval_template,
    judge_name="my_first_judge"
)

# Shows cost estimate and prompts Y/n. Pass confirm=False to skip.
# max_workers=20 runs API calls in parallel — typically 10–15× faster.
report = JudgeAuditor(
    judge=judge,
    domain="qa",
    cost_per_call=0.0003,
    max_workers=20,
).audit()
print(report.summary())
report.save_json("my_audit.json")

Speed (`max_workers`)

A full audit makes ~1,000 LLM calls. Sequential runs take 20–25 minutes. Set max_workers to run calls in parallel via a thread pool:

JudgeAuditor(judge=judge, domain="qa", max_workers=20).audit()

`max_workers`	Wall time (~1k calls)	Speedup
1 (default)	20–25 min	1×
10	2.5 min	8×
20	1.5 min	13×
50+	diminishing returns; rate-limit risk	—

Caveats

Rate limits. Cost is unchanged but request rate is much higher. Lower max_workers if you see 429 errors — there is no auto-backoff.
Thread-safe llm_fn required. Stateless calls are safe (OpenAI/Anthropic/OpenRouter clients are thread-safe). Don't share conversation state across calls.
Parallelism is per-test. Within a single bias test, fixture items run concurrently; tests still execute one after another.

What it tests

Bias	What it catches	Applies to
position	Judge picks slot A/B regardless of content	pairwise
verbosity	Judge inflates scores for longer responses	all types
self_consistency	Judge gives different scores to the same input	pointwise, binary
scale_anchoring	Judge compresses all scores into a narrow band	pointwise
authority	Judge inflates scores for fake credentials	all types
concreteness	Judge prefers fabricated specifics over accurate vague answers	pointwise, pairwise
yes_bias	Binary judge over-approves false statements	binary

Supported judge types

Type	Template shape	Detected by
pointwise	`{question}` + `{response}` → numeric score	`{response}` placeholder
pairwise	`{question}` + `{response_a}` + `{response_b}` → A or B	`{response_a}` and `{response_b}`
binary	`{statement}` → Yes or No	yes/no keyword in template

Judge type is auto-detected from your eval_template. Override with judge_type="pointwise" if detection fails.

Works with any LLM

Judicator never touches your API keys or model configuration. You wrap your LLM call in a function — Judicator calls that function.

Stateless calls required. Each call to llm_fn must be independent with no shared conversation context between calls. Judicator calls it multiple times per fixture item — if your judge accumulates history across calls, bias measurements will be invalid.

OpenAI

import openai

def my_fn(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

Anthropic

import anthropic
client = anthropic.Anthropic()

def my_fn(prompt: str) -> str:
    return client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        system=system_prompt,
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

OpenRouter (access 200+ models with one API key)

import openai

client = openai.OpenAI(
    api_key=os.environ["OPENROUTER_API_KEY"],
    base_url="https://openrouter.ai/api/v1",
)

def my_fn(prompt: str) -> str:
    return client.chat.completions.create(
        model="meta-llama/llama-3.2-3b-instruct",
        max_tokens=256,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

Ollama (local)

import ollama

def my_fn(prompt: str) -> str:
    return ollama.chat(
        model="llama3",
        messages=[{"role": "user", "content": prompt}]
    )["message"]["content"]

Pass any of these as llm_fn to Judge. Judicator works identically with all four.

Understanding the report

╔══════════════════════════════════════════════════════════════╗
║  JUDICATOR — AUDIT REPORT                                    ║
╠══════════════════════════════════════════════════════════════╣
║  Judge:   my_qa_judge   Domain:  qa   Type:  pointwise       ║
╠════════════════════╦═══════╦═══════╦══════════╦══════════════╣
║   BIAS TEST        ║  SCORE║  RANK ║  VERDICT ║  SEVERITY    ║
╠════════════════════╬═══════╬═══════╬══════════╬══════════════╣
║   scale_anchoring  ║  0.312║  1/5  ║  FAIL    ║  CRITICAL    ║
║   verbosity        ║  0.620║  2/5  ║  FAIL    ║  SIGNIFICANT ║
║   concreteness     ║  0.714║  3/5  ║  PASS    ║  MINOR       ║
║   authority        ║  0.810║  4/5  ║  PASS    ║  NONE        ║
║   self_consistency ║  0.950║  5/5  ║  PASS    ║  NONE        ║
╚══════════════════════════════════════════════════════════════╝

Score: 0–1. Higher = more calibrated. No composite score — each test is independent.

Rank: 1 = worst bias. Address rank 1 first.

Severity bands:

CRITICAL (< 0.50): strong bias, investigate immediately
SIGNIFICANT (0.50–0.65): meaningful bias, likely affects production quality
MINOR (0.65–0.80): borderline — PASS if ≥ 0.70, FAIL otherwise
NONE (≥ 0.80): no detectable bias

N/A results mean the test does not apply to your judge type or domain, not that the judge passed the test.

What v0.2 does NOT cover

Composite scoring / single overall grade
Sycophancy, compassion fade, bandwagon, sentiment, fallacy oversight, and other biases beyond the 7 tested
Listwise, reference-based, CoT, or multi-turn judge types
Translation, medical, legal, financial, or creative writing domains
Position pairs for summarization, safety, or dialogue (qa + code only)
Custom bias tests or BYO-data mode
Token-aware cost estimation (currently flat-per-call)
GitHub Actions integration or SaaS dashboard

Coming in future versions

More statistically significant results — expanded fixture sets; concreteness is currently n=14 (coarse signal)
Domain coverage expansion — position pairs for summarization, safety, and dialogue
User-provided data — BYO-data mode to run bias tests on your own examples
Labeling sheet output — export structured sheets for human annotation workflows

Citation

If you use Judicator in your research, please cite:

@software{judicator2026,
  author = {Pandey, Ankur},
  title  = {Judicator: An LLM-as-a-Judge Bias Auditing Library},
  year   = {2026},
  url    = {https://github.com/ankurpand3y/judicator},
  version = {0.2.2}
}

Built on

Judicator ships with fixtures derived from the following datasets. All are used in accordance with their licenses.

Dataset	Paper	License
OffsetBias	Park et al. 2024	Apache 2.0
JudgeBench	Tan et al. 2024	MIT
MT-Bench	Zheng et al. 2023	Apache 2.0
BeaverTails	Ji et al. 2023	CC-BY-NC-4.0
SummEval	Fabbri et al. 2021	MIT
DSTC11-Track4	Rodriguez-Cantelar et al. 2023	Apache 2.0

See ATTRIBUTION.md for full item counts.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.4

May 1, 2026

0.2.3

May 1, 2026

This version

0.2.2

May 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judicator-0.2.2.tar.gz (909.6 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

judicator-0.2.2-py3-none-any.whl (907.1 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file judicator-0.2.2.tar.gz.

File metadata

Download URL: judicator-0.2.2.tar.gz
Upload date: May 1, 2026
Size: 909.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for judicator-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`497d8155f3d7d6fafb9a312536d71735ae729fa234ea93d2fead645640e912ac`
MD5	`0ab8ddce03d75f2960be18de8f97855b`
BLAKE2b-256	`09aa1e69636a9e5391d129f0a74b4f1482eba45898396570fca3217c13b41e0f`

See more details on using hashes here.

File details

Details for the file judicator-0.2.2-py3-none-any.whl.

File metadata

Download URL: judicator-0.2.2-py3-none-any.whl
Upload date: May 1, 2026
Size: 907.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for judicator-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`825823a7fe7ebdb89b5e036506b61afb45f950ec6e3a681163deeaef772dab64`
MD5	`d980929b636d254a798ec42a32a6d897`
BLAKE2b-256	`a05969593e7896850cfcd6d71c8d2e4b41ecb65ddaa2aa226e0206007d7a70ce`

See more details on using hashes here.

judicator 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Judging LLM-as-a-Judge

Install

Quickstart

Speed (`max_workers`)

What it tests

Supported judge types

Works with any LLM

Understanding the report

What v0.2 does NOT cover

Coming in future versions

Citation

Built on

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

judicator 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Judging LLM-as-a-Judge

Install

Quickstart

Speed (max_workers)

What it tests

Supported judge types

Works with any LLM

Understanding the report

What v0.2 does NOT cover

Coming in future versions

Citation

Built on

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Speed (`max_workers`)