Skip to main content

Judging LLM-as-a-Judge — a screening tool for bias and miscalibration.

Project description

Judicator

Judging LLM-as-a-Judge

An LLM-as-a-Judge screening tool for bias and miscalibration.

PyPI version License Python


Install

pip install judicator

Windows note: the report uses Unicode box-drawing characters. If you see UnicodeEncodeError when calling print(report.summary()), run with PYTHONUTF8=1 or set PYTHONUTF8=1 (Windows shell) before launching Python.


Quickstart

import openai
from judicator import Judge, JudgeAuditor

system_prompt = "You are an expert evaluator. Score responses objectively."
eval_template = "Question: {question}\nResponse: {response}\nScore 1-10."

def my_judge_call(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

judge = Judge(
    llm_fn=my_judge_call,
    system_prompt=system_prompt,
    eval_template=eval_template,
    judge_name="my_first_judge"
)

# Shows cost estimate and prompts Y/n. Pass confirm=False to skip.
# max_workers=20 runs API calls in parallel — typically 10–15× faster.
report = JudgeAuditor(
    judge=judge,
    domain="qa",
    cost_per_call=0.0003,
    max_workers=20,
).audit()
print(report.summary())
report.save_json("my_audit.json")

Speed (max_workers)

A full audit makes ~1,000 LLM calls. Sequential runs take 20–25 minutes. Set max_workers to run calls in parallel via a thread pool:

JudgeAuditor(judge=judge, domain="qa", max_workers=20).audit()
max_workers Wall time (~1k calls) Speedup
1 (default) 20–25 min
10 2.5 min
20 1.5 min 13×
50+ diminishing returns; rate-limit risk

Caveats

  • Rate limits. Cost is unchanged but request rate is much higher. Lower max_workers if you see 429 errors — there is no auto-backoff.
  • Thread-safe llm_fn required. Stateless calls are safe (OpenAI/Anthropic/OpenRouter clients are thread-safe). Don't share conversation state across calls.
  • Parallelism is per-test. Within a single bias test, fixture items run concurrently; tests still execute one after another.

What it tests

Bias What it catches Applies to
position Judge picks slot A/B regardless of content pairwise
verbosity Judge inflates scores for longer responses all types
self_consistency Judge gives different scores to the same input pointwise, binary
scale_anchoring Judge compresses all scores into a narrow band pointwise
authority Judge inflates scores for fake credentials all types
concreteness Judge prefers fabricated specifics over accurate vague answers pointwise, pairwise
yes_bias Binary judge over-approves false statements binary

Supported judge types

Type Template shape Detected by
pointwise {question} + {response} → numeric score {response} placeholder
pairwise {question} + {response_a} + {response_b} → A or B {response_a} and {response_b}
binary {statement} → Yes or No yes/no keyword in template

Judge type is auto-detected from your eval_template. Override with judge_type="pointwise" if detection fails.


Works with any LLM

Judicator never touches your API keys or model configuration. You wrap your LLM call in a function — Judicator calls that function.

Stateless calls required. Each call to llm_fn must be independent with no shared conversation context between calls. Judicator calls it multiple times per fixture item — if your judge accumulates history across calls, bias measurements will be invalid.

OpenAI

import openai

def my_fn(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

Anthropic

import anthropic
client = anthropic.Anthropic()

def my_fn(prompt: str) -> str:
    return client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        system=system_prompt,
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

OpenRouter (access 200+ models with one API key)

import openai

client = openai.OpenAI(
    api_key=os.environ["OPENROUTER_API_KEY"],
    base_url="https://openrouter.ai/api/v1",
)

def my_fn(prompt: str) -> str:
    return client.chat.completions.create(
        model="meta-llama/llama-3.2-3b-instruct",
        max_tokens=256,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

Ollama (local)

import ollama

def my_fn(prompt: str) -> str:
    return ollama.chat(
        model="llama3",
        messages=[{"role": "user", "content": prompt}]
    )["message"]["content"]

Pass any of these as llm_fn to Judge. Judicator works identically with all four.


Understanding the report

╔══════════════════════════════════════════════════════════════╗
║  JUDICATOR — AUDIT REPORT                                    ║
╠══════════════════════════════════════════════════════════════╣
║  Judge:   my_qa_judge   Domain:  qa   Type:  pointwise       ║
╠════════════════════╦═══════╦═══════╦══════════╦══════════════╣
║   BIAS TEST        ║  SCORE║  RANK ║  VERDICT ║  SEVERITY    ║
╠════════════════════╬═══════╬═══════╬══════════╬══════════════╣
║   scale_anchoring  ║  0.312║  1/5  ║  FAIL    ║  CRITICAL    ║
║   verbosity        ║  0.620║  2/5  ║  FAIL    ║  SIGNIFICANT ║
║   concreteness     ║  0.714║  3/5  ║  PASS    ║  MINOR       ║
║   authority        ║  0.810║  4/5  ║  PASS    ║  NONE        ║
║   self_consistency ║  0.950║  5/5  ║  PASS    ║  NONE        ║
╚══════════════════════════════════════════════════════════════╝

Score: 0–1. Higher = more calibrated. No composite score — each test is independent.

Rank: 1 = worst bias. Address rank 1 first.

Severity bands:

  • CRITICAL (< 0.50): strong bias, investigate immediately
  • SIGNIFICANT (0.50–0.65): meaningful bias, likely affects production quality
  • MINOR (0.65–0.80): borderline — PASS if ≥ 0.70, FAIL otherwise
  • NONE (≥ 0.80): no detectable bias

N/A results mean the test does not apply to your judge type or domain, not that the judge passed the test.


What v0.2 does NOT cover

  • Composite scoring / single overall grade
  • Sycophancy, compassion fade, bandwagon, sentiment, fallacy oversight, and other biases beyond the 7 tested
  • Listwise, reference-based, CoT, or multi-turn judge types
  • Translation, medical, legal, financial, or creative writing domains
  • Position pairs for summarization, safety, or dialogue (qa + code only)
  • Custom bias tests or BYO-data mode
  • Token-aware cost estimation (currently flat-per-call)
  • GitHub Actions integration or SaaS dashboard

Coming in future versions

  • More statistically significant results — expanded fixture sets; concreteness is currently n=14 (coarse signal)
  • Domain coverage expansion — position pairs for summarization, safety, and dialogue
  • User-provided data — BYO-data mode to run bias tests on your own examples
  • Labeling sheet output — export structured sheets for human annotation workflows

Citation

If you use Judicator in your research, please cite:

@software{judicator2026,
  author = {Pandey, Ankur},
  title  = {Judicator: An LLM-as-a-Judge Bias Auditing Library},
  year   = {2026},
  url    = {https://github.com/ankurpand3y/judicator},
  version = {0.2.3}
}

Built on

Judicator ships with fixtures derived from the following datasets. All are used in accordance with their licenses.

Dataset Paper License
OffsetBias Park et al. 2024 Apache 2.0
JudgeBench Tan et al. 2024 MIT
MT-Bench Zheng et al. 2023 Apache 2.0
BeaverTails Ji et al. 2023 CC-BY-NC-4.0
SummEval Fabbri et al. 2021 MIT
DSTC11-Track4 Rodriguez-Cantelar et al. 2023 Apache 2.0

See ATTRIBUTION.md for full item counts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judicator-0.2.3.tar.gz (910.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judicator-0.2.3-py3-none-any.whl (907.0 kB view details)

Uploaded Python 3

File details

Details for the file judicator-0.2.3.tar.gz.

File metadata

  • Download URL: judicator-0.2.3.tar.gz
  • Upload date:
  • Size: 910.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for judicator-0.2.3.tar.gz
Algorithm Hash digest
SHA256 b0edccd790928f81730bd1d5eeb76160d3813e3ffea9d952237cc41fe90786b9
MD5 923b6935be994d8d13c14511def49cef
BLAKE2b-256 1fc913627e519e74b5f0157bace9648bb4fb6a94de10a0bdf8b385d9ab5b2767

See more details on using hashes here.

Provenance

The following attestation bundles were made for judicator-0.2.3.tar.gz:

Publisher: publish.yml on ankurpand3y/judicator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file judicator-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: judicator-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 907.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for judicator-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 08c64c602f92fb73e2059088705d8b504de6d8f45469ec6696faa48ff1ad23ed
MD5 916134036556ea44fd554ee268181897
BLAKE2b-256 7c04d12afe28b38669acba821f47f1cab7c5ca10198f1b6efa6dd7cdbe258291

See more details on using hashes here.

Provenance

The following attestation bundles were made for judicator-0.2.3-py3-none-any.whl:

Publisher: publish.yml on ankurpand3y/judicator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page