Skip to main content

Judging LLM-as-a-Judge — a screening tool for bias and miscalibration.

Project description

Judicator

Judging LLM-as-a-Judge

An LLM-as-a-Judge screening tool for bias and miscalibration.

PyPI version License Python


Install

pip install judicator

Windows note: the report uses Unicode box-drawing characters. If you see UnicodeEncodeError when calling print(report.summary()), run with PYTHONUTF8=1 or set PYTHONUTF8=1 (Windows shell) before launching Python.


Quickstart

import openai
from judicator import Judge, JudgeAuditor

system_prompt = "You are an expert evaluator. Score responses objectively."
eval_template = "Question: {question}\nResponse: {response}\nScore 1-10."

def my_judge_call(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

judge = Judge(
    llm_fn=my_judge_call,
    system_prompt=system_prompt,
    eval_template=eval_template,
    judge_name="my_first_judge"
)

# Shows cost estimate and prompts Y/n. Pass confirm=False to skip.
# max_workers=20 runs API calls in parallel — typically 10–15× faster.
report = JudgeAuditor(
    judge=judge,
    domain="qa",
    cost_per_call=0.0003,
    max_workers=20,
).audit()
print(report.summary())
report.save_json("my_audit.json")

Speed (max_workers)

A full audit makes ~1,000 LLM calls. Sequential runs take 20–25 minutes. Set max_workers to run calls in parallel via a thread pool:

JudgeAuditor(judge=judge, domain="qa", max_workers=20).audit()
max_workers Wall time (~1k calls) Speedup
1 (default) 20–25 min
10 2.5 min
20 1.5 min 13×
50+ diminishing returns; rate-limit risk

Caveats

  • Rate limits. Cost is unchanged but request rate is much higher. Lower max_workers if you see 429 errors — there is no auto-backoff.
  • Thread-safe llm_fn required. Stateless calls are safe (OpenAI/Anthropic/OpenRouter clients are thread-safe). Don't share conversation state across calls.
  • Parallelism is per-test. Within a single bias test, fixture items run concurrently; tests still execute one after another.

What it tests

Bias What it catches Applies to
position Judge picks slot A/B regardless of content pairwise
verbosity Judge inflates scores for longer responses all types
self_consistency Judge gives different scores to the same input pointwise, binary
scale_anchoring Judge compresses all scores into a narrow band pointwise
authority Judge inflates scores for fake credentials all types
concreteness Judge prefers fabricated specifics over accurate vague answers pointwise, pairwise
yes_bias Binary judge over-approves false statements binary

Supported judge types

Type Template shape Detected by
pointwise {question} + {response} → numeric score {response} placeholder
pairwise {question} + {response_a} + {response_b} → A or B {response_a} and {response_b}
binary {statement} → Yes or No yes/no keyword in template

Judge type is auto-detected from your eval_template. Override with judge_type="pointwise" if detection fails.


Works with any LLM

Judicator never touches your API keys or model configuration. You wrap your LLM call in a function — Judicator calls that function.

Stateless calls required. Each call to llm_fn must be independent with no shared conversation context between calls. Judicator calls it multiple times per fixture item — if your judge accumulates history across calls, bias measurements will be invalid.

OpenAI

import openai

def my_fn(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

Anthropic

import anthropic
client = anthropic.Anthropic()

def my_fn(prompt: str) -> str:
    return client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        system=system_prompt,
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

OpenRouter (access 200+ models with one API key)

import openai

client = openai.OpenAI(
    api_key=os.environ["OPENROUTER_API_KEY"],
    base_url="https://openrouter.ai/api/v1",
)

def my_fn(prompt: str) -> str:
    return client.chat.completions.create(
        model="meta-llama/llama-3.2-3b-instruct",
        max_tokens=256,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

Ollama (local)

import ollama

def my_fn(prompt: str) -> str:
    return ollama.chat(
        model="llama3",
        messages=[{"role": "user", "content": prompt}]
    )["message"]["content"]

Pass any of these as llm_fn to Judge. Judicator works identically with all four.


Understanding the report

╔══════════════════════════════════════════════════════════════╗
║  JUDICATOR — AUDIT REPORT                                    ║
╠══════════════════════════════════════════════════════════════╣
║  Judge:   my_qa_judge   Domain:  qa   Type:  pointwise       ║
╠════════════════════╦═══════╦═══════╦══════════╦══════════════╣
║   BIAS TEST        ║  SCORE║  RANK ║  VERDICT ║  SEVERITY    ║
╠════════════════════╬═══════╬═══════╬══════════╬══════════════╣
║   scale_anchoring  ║  0.312║  1/5  ║  FAIL    ║  CRITICAL    ║
║   verbosity        ║  0.620║  2/5  ║  FAIL    ║  SIGNIFICANT ║
║   concreteness     ║  0.714║  3/5  ║  PASS    ║  MINOR       ║
║   authority        ║  0.810║  4/5  ║  PASS    ║  NONE        ║
║   self_consistency ║  0.950║  5/5  ║  PASS    ║  NONE        ║
╚══════════════════════════════════════════════════════════════╝

Score: 0–1. Higher = more calibrated. No composite score — each test is independent.

Rank: 1 = worst bias. Address rank 1 first.

Severity bands:

  • CRITICAL (< 0.50): strong bias, investigate immediately
  • SIGNIFICANT (0.50–0.65): meaningful bias, likely affects production quality
  • MINOR (0.65–0.80): borderline — PASS if ≥ 0.70, FAIL otherwise
  • NONE (≥ 0.80): no detectable bias

N/A results mean the test does not apply to your judge type or domain, not that the judge passed the test.


What v0.2 does NOT cover

  • Composite scoring / single overall grade
  • Sycophancy, compassion fade, bandwagon, sentiment, fallacy oversight, and other biases beyond the 7 tested
  • Listwise, reference-based, CoT, or multi-turn judge types
  • Translation, medical, legal, financial, or creative writing domains
  • Position pairs for summarization, safety, or dialogue (qa + code only)
  • Custom bias tests or BYO-data mode
  • Token-aware cost estimation (currently flat-per-call)
  • GitHub Actions integration or SaaS dashboard

Coming in future versions

  • More statistically significant results — expanded fixture sets; concreteness is currently n=14 (coarse signal)
  • Domain coverage expansion — position pairs for summarization, safety, and dialogue
  • User-provided data — BYO-data mode to run bias tests on your own examples
  • Labeling sheet output — export structured sheets for human annotation workflows

Citation

If you use Judicator in your research, please cite:

@software{judicator2026,
  author = {Pandey, Ankur},
  title  = {Judicator: An LLM-as-a-Judge Bias Auditing Library},
  year   = {2026},
  url    = {https://github.com/ankurpand3y/judicator},
  version = {0.2.2}
}

Built on

Judicator ships with fixtures derived from the following datasets. All are used in accordance with their licenses.

Dataset Paper License
OffsetBias Park et al. 2024 Apache 2.0
JudgeBench Tan et al. 2024 MIT
MT-Bench Zheng et al. 2023 Apache 2.0
BeaverTails Ji et al. 2023 CC-BY-NC-4.0
SummEval Fabbri et al. 2021 MIT
DSTC11-Track4 Rodriguez-Cantelar et al. 2023 Apache 2.0

See ATTRIBUTION.md for full item counts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judicator-0.2.2.tar.gz (909.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judicator-0.2.2-py3-none-any.whl (907.1 kB view details)

Uploaded Python 3

File details

Details for the file judicator-0.2.2.tar.gz.

File metadata

  • Download URL: judicator-0.2.2.tar.gz
  • Upload date:
  • Size: 909.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for judicator-0.2.2.tar.gz
Algorithm Hash digest
SHA256 497d8155f3d7d6fafb9a312536d71735ae729fa234ea93d2fead645640e912ac
MD5 0ab8ddce03d75f2960be18de8f97855b
BLAKE2b-256 09aa1e69636a9e5391d129f0a74b4f1482eba45898396570fca3217c13b41e0f

See more details on using hashes here.

File details

Details for the file judicator-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: judicator-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 907.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for judicator-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 825823a7fe7ebdb89b5e036506b61afb45f950ec6e3a681163deeaef772dab64
MD5 d980929b636d254a798ec42a32a6d897
BLAKE2b-256 a05969593e7896850cfcd6d71c8d2e4b41ecb65ddaa2aa226e0206007d7a70ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page