Audit the reliability of an LLM-as-judge eval pipeline: agreement, bias, drift, calibration.

These details have not been verified by PyPI

Project description

judge-audit

Your eval is a measurement instrument. Audit it like one.

LLM-as-judge has quietly become the backbone of model evaluation — but the judge itself is rarely audited. judge-audit treats the judge as an instrument and asks the questions you'd ask of any sensor before trusting its readout:

Reliability — does it give the same answer twice? (self-consistency, judge↔judge)
Validity — does it agree with humans? (Spearman, Krippendorff's α)
Systematic bias — position bias, verbosity bias, measured by controlled perturbation, not correlation.
Drift — does the scale move over time as the provider silently ships new model versions? (anchor-set EWMA / CUSUM control charts)
Calibration — map raw judge scores onto the human scale (isotonic), with honest bootstrap error bars on every leaderboard gap.

Pure NumPy, zero required deps beyond numpy (matplotlib optional for plots).

Install

pip install -e .          # core
pip install -e ".[plot]"  # + matplotlib for PNG plots
pip install -e ".[dev]"   # + pytest

60-second demo

python examples/run_benchmark.py
# -> examples/out/report.md  (+ drift.png, calibration.png)

Or open the interactive walkthrough: examples/demo.ipynb.

The demo runs an inject → recover benchmark: a synthetic judge is built with known pathologies, then the audit recovers them — the same "plant the signal, prove the instrument finds it" design as model-collapse-testbed.

pathology	injected	recovered
verbosity coef (pts / SD log-len)	0.45	0.37¹
self-consistency σ (pts)	0.40	0.32
first-slot preference	0.68	0.71
drift changepoint (batch)	7	8²
calibration (ECE)	—	0.177 → 0.042 after isotonic

¹ integer-Likert rounding attenuates measured coefficients — an honest, expected effect. ² EWMA lags the changepoint by ~1 batch by design (that's the smoothing).

drift calibration

Use on your own data

import numpy as np
from judge_audit import (
    self_consistency, krippendorff_alpha, spearman,
    position_bias, length_bias, ewma_chart,
    IsotonicCalibrator, calibration_error, bootstrap_diff,
)

# 1. reliability ceiling: same judge, same items, scored R times
self_consistency(repeats)                      # repeats: (n_items, R)

# 2. validity vs human labels
krippendorff_alpha(np.vstack([judge, human]), level="interval")
spearman(judge, human)

# 3. systematic bias (controlled-perturbation inputs)
position_bias(winner_ab, winner_ba)            # order-swapped pairwise verdicts
length_bias(scores, lengths, control=human)    # verbosity, quality partialled out

# 4. drift on a fixed anchor set, one stat per batch
ewma_chart(anchor_mean_per_batch, lam=0.3, L=3.0)

# 5. calibrate + error bars
cal = IsotonicCalibrator().fit(judge_cal, human_cal)
corrected = cal.transform(judge_test)
calibration_error(corrected, human_test, scale=(1, 5))
bootstrap_diff(model_a_scores, model_b_scores) # CI straddles 0 -> no winner

Or run everything and render a report:

from judge_audit import make_benchmark, audit, render_markdown, save_plots
a = audit(make_benchmark())          # or pass your own dict (see synthetic.py keys)
open("report.md", "w").write(render_markdown(a, save_plots(a, "out")))

On your own eval logs (CSV / JSONL)

io.assemble turns a long-format log into the same dict; audit() then runs only the axes your columns support (no human labels ⇒ no calibration; no anchor set ⇒ no drift; etc.).

from judge_audit import read_jsonl, assemble, pairwise, audit, render_markdown

recs  = read_jsonl("eval_log.jsonl")              # rows: item_id, score, human?, length?, batch?, repeat?
bench = assemble(recs, scale=(1, 5), anchor_ids=my_anchor_ids)
bench.update(pairwise(read_jsonl("pairwise_log.jsonl")))   # optional: pair_id, order, winner
print(render_markdown(audit(bench)))

Columns named differently? Pass a mapping: assemble(recs, fields={"score": "rating", "human": "gold"}). A runnable end-to-end example: python examples/make_sample_log.py && python examples/audit_log.py.

The five corrections

problem	correction
position bias	score both orders; only count when they agree, else tie/uncertain
miscalibration	ship the isotonic map; report corrected scores + ECE, not raw
drift	keep the anchor-set control chart in CI; block/re-baseline on alarm
low reliability	raise repeats / lower temperature before trusting any ranking
noisy leaderboards	bootstrap CI on every gap; straddles 0 ⇒ "no significant difference"

Layout

judge_audit/
  agreement.py   Krippendorff α (nominal/ordinal/interval), Spearman, self-consistency
  bias.py        position bias (order-swap), verbosity bias (partialled OLS)
  drift.py       EWMA + CUSUM control charts
  calibrate.py   isotonic (PAVA), ECE / reliability curve, bootstrap CIs
  synthetic.py   judge with injected pathologies + ground truth
  io.py          CSV/JSONL adapters for real eval logs -> audit-ready dict
  report.py      run the full audit (tolerant of missing axes), render MD + plots
examples/
  run_benchmark.py   synthetic inject->recover report
  make_sample_log.py + audit_log.py   real-log adapter path
  demo.ipynb         interactive walkthrough
tests/           Krippendorff cross-checked against the reference package

See ARTICLE.md (Vietnamese) / ARTICLE.en.md (English) for the write-up.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

judge_audit-0.1.0.tar.gz (23.8 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

judge_audit-0.1.0-py3-none-any.whl (22.4 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file judge_audit-0.1.0.tar.gz.

File metadata

Download URL: judge_audit-0.1.0.tar.gz
Upload date: Jun 23, 2026
Size: 23.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for judge_audit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f5be2b244db9800b6f9cd529e1fe818cb6fc6fcdc30e4f8dcd3fe98693384447`
MD5	`59902e1cec5c7694669466651b2f67bb`
BLAKE2b-256	`95896bd8a6f8f5166de76024086e13d069a0ace8b30905b5d122edcaff7f227a`

See more details on using hashes here.

File details

Details for the file judge_audit-0.1.0-py3-none-any.whl.

File metadata

Download URL: judge_audit-0.1.0-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 22.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for judge_audit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b2d5c54a043834aade4ead3d4e4a4f9697ffc2ff6cf4abb044244abb463446e4`
MD5	`a45723fc9b3b85d271362b0078a30a69`
BLAKE2b-256	`9b8309146e03f5b5a6d293c2f73d07ca40fb2adcf4d5570523b67f699c3f932c`

See more details on using hashes here.

judge-audit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

judge-audit

Install

60-second demo

Use on your own data

On your own eval logs (CSV / JSONL)

The five corrections

Layout

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes