Skip to main content

Audit the reliability of an LLM-as-judge eval pipeline: agreement, bias, drift, calibration.

Project description

judge-audit

Your eval is a measurement instrument. Audit it like one.

LLM-as-judge has quietly become the backbone of model evaluation — but the judge itself is rarely audited. judge-audit treats the judge as an instrument and asks the questions you'd ask of any sensor before trusting its readout:

  1. Reliability — does it give the same answer twice? (self-consistency, judge↔judge)
  2. Validity — does it agree with humans? (Spearman, Krippendorff's α)
  3. Systematic bias — position bias, verbosity bias, measured by controlled perturbation, not correlation.
  4. Drift — does the scale move over time as the provider silently ships new model versions? (anchor-set EWMA / CUSUM control charts)
  5. Calibration — map raw judge scores onto the human scale (isotonic), with honest bootstrap error bars on every leaderboard gap.

Pure NumPy, zero required deps beyond numpy (matplotlib optional for plots).

Install

pip install -e .          # core
pip install -e ".[plot]"  # + matplotlib for PNG plots
pip install -e ".[dev]"   # + pytest

60-second demo

python examples/run_benchmark.py
# -> examples/out/report.md  (+ drift.png, calibration.png)

Or open the interactive walkthrough: examples/demo.ipynb.

The demo runs an inject → recover benchmark: a synthetic judge is built with known pathologies, then the audit recovers them — the same "plant the signal, prove the instrument finds it" design as model-collapse-testbed.

pathology injected recovered
verbosity coef (pts / SD log-len) 0.45 0.37¹
self-consistency σ (pts) 0.40 0.32
first-slot preference 0.68 0.71
drift changepoint (batch) 7
calibration (ECE) 0.177 → 0.042 after isotonic

¹ integer-Likert rounding attenuates measured coefficients — an honest, expected effect. ² EWMA lags the changepoint by ~1 batch by design (that's the smoothing).

drift calibration

Use on your own data

import numpy as np
from judge_audit import (
    self_consistency, krippendorff_alpha, spearman,
    position_bias, length_bias, ewma_chart,
    IsotonicCalibrator, calibration_error, bootstrap_diff,
)

# 1. reliability ceiling: same judge, same items, scored R times
self_consistency(repeats)                      # repeats: (n_items, R)

# 2. validity vs human labels
krippendorff_alpha(np.vstack([judge, human]), level="interval")
spearman(judge, human)

# 3. systematic bias (controlled-perturbation inputs)
position_bias(winner_ab, winner_ba)            # order-swapped pairwise verdicts
length_bias(scores, lengths, control=human)    # verbosity, quality partialled out

# 4. drift on a fixed anchor set, one stat per batch
ewma_chart(anchor_mean_per_batch, lam=0.3, L=3.0)

# 5. calibrate + error bars
cal = IsotonicCalibrator().fit(judge_cal, human_cal)
corrected = cal.transform(judge_test)
calibration_error(corrected, human_test, scale=(1, 5))
bootstrap_diff(model_a_scores, model_b_scores) # CI straddles 0 -> no winner

Or run everything and render a report:

from judge_audit import make_benchmark, audit, render_markdown, save_plots
a = audit(make_benchmark())          # or pass your own dict (see synthetic.py keys)
open("report.md", "w").write(render_markdown(a, save_plots(a, "out")))

On your own eval logs (CSV / JSONL)

io.assemble turns a long-format log into the same dict; audit() then runs only the axes your columns support (no human labels ⇒ no calibration; no anchor set ⇒ no drift; etc.).

from judge_audit import read_jsonl, assemble, pairwise, audit, render_markdown

recs  = read_jsonl("eval_log.jsonl")              # rows: item_id, score, human?, length?, batch?, repeat?
bench = assemble(recs, scale=(1, 5), anchor_ids=my_anchor_ids)
bench.update(pairwise(read_jsonl("pairwise_log.jsonl")))   # optional: pair_id, order, winner
print(render_markdown(audit(bench)))

Columns named differently? Pass a mapping: assemble(recs, fields={"score": "rating", "human": "gold"}). A runnable end-to-end example: python examples/make_sample_log.py && python examples/audit_log.py.

The five corrections

problem correction
position bias score both orders; only count when they agree, else tie/uncertain
miscalibration ship the isotonic map; report corrected scores + ECE, not raw
drift keep the anchor-set control chart in CI; block/re-baseline on alarm
low reliability raise repeats / lower temperature before trusting any ranking
noisy leaderboards bootstrap CI on every gap; straddles 0 ⇒ "no significant difference"

Layout

judge_audit/
  agreement.py   Krippendorff α (nominal/ordinal/interval), Spearman, self-consistency
  bias.py        position bias (order-swap), verbosity bias (partialled OLS)
  drift.py       EWMA + CUSUM control charts
  calibrate.py   isotonic (PAVA), ECE / reliability curve, bootstrap CIs
  synthetic.py   judge with injected pathologies + ground truth
  io.py          CSV/JSONL adapters for real eval logs -> audit-ready dict
  report.py      run the full audit (tolerant of missing axes), render MD + plots
examples/
  run_benchmark.py   synthetic inject->recover report
  make_sample_log.py + audit_log.py   real-log adapter path
  demo.ipynb         interactive walkthrough
tests/           Krippendorff cross-checked against the reference package

See ARTICLE.md (Vietnamese) / ARTICLE.en.md (English) for the write-up.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judge_audit-0.1.0.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

judge_audit-0.1.0-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file judge_audit-0.1.0.tar.gz.

File metadata

  • Download URL: judge_audit-0.1.0.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for judge_audit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f5be2b244db9800b6f9cd529e1fe818cb6fc6fcdc30e4f8dcd3fe98693384447
MD5 59902e1cec5c7694669466651b2f67bb
BLAKE2b-256 95896bd8a6f8f5166de76024086e13d069a0ace8b30905b5d122edcaff7f227a

See more details on using hashes here.

File details

Details for the file judge_audit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: judge_audit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for judge_audit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b2d5c54a043834aade4ead3d4e4a4f9697ffc2ff6cf4abb044244abb463446e4
MD5 a45723fc9b3b85d271362b0078a30a69
BLAKE2b-256 9b8309146e03f5b5a6d293c2f73d07ca40fb2adcf4d5570523b67f699c3f932c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page