Measure LLM-judge verdict drift across model versions by re-grading a stored Inspect eval log with two graders over the same samples.
Project description
inspect-judge-drift
Measure LLM-judge verdict drift across model versions for Inspect AI.
When the model you use as a grader is upgraded (say claude-opus-4-7 →
claude-opus-4-8), do its verdicts change on identical inputs? inspect-judge-drift
answers that by re-grading a stored .eval log with two grader
configurations over the same logged samples and reporting how often they
disagree.
Why re-grade from a stored log (and not run two live evals)
Drift measurement requires holding everything except the judge constant. Running
two live evals introduces two fresh variance sources — the model-under-test's
own stochasticity between runs and grader sampling — that contaminate the
exact signal you are trying to isolate. Re-grading a single stored log fixes the
inputs byte-for-byte: every grader sees the same output.completion, target,
and input that were already recorded. The judge model is the only variable.
No MUT re-run, reproducible input.
This is the input a credible power-dimensioned drift study has to stand on. Live dual-grading measures something weaker and costs more.
How it isolates the judge
For each sample, the same grading prompt — built from the stored
input / output.completion / target — is sent to both graders. The only
difference between the two verdicts is the grader model.
A grade-parse failure (the grader not emitting a parseable verdict) is treated
as a scoring-instrument failure: that sample is left Score.unscored(),
excluded from the drift-rate and kappa denominators, and reported separately as
unscored_rate — never fabricated into a verdict. This mirrors the position
argued upstream in
inspect_ai#4026 /
#4048: an instrument
failure must stay visible, not masquerade as a model result.
Install
pip install inspect-judge-drift
Use
from inspect_judge_drift import regrade_eval_log, GraderSpec
report = regrade_eval_log(
"logs/2026-06-25T03-57-26-00-00_task_RXYpgDBbXfECH6EoqaX5w9.eval",
GraderSpec("anthropic/claude-opus-4-7", label="opus-4.7"),
GraderSpec("anthropic/claude-opus-4-8", label="opus-4.8"),
)
print(report.to_dict())
# Illustrative output (hypothetical numbers, not a measured result):
# {
# 'grader_a': 'opus-4.7', 'grader_b': 'opus-4.8',
# 'n_total': 100, 'n_comparable': 98, 'n_unscored': 2,
# 'drift_rate': 0.061, # 6.1% of comparable samples flipped verdict
# 'cohens_kappa': 0.83, # grader-vs-grader agreement
# 'unscored_rate': 0.02, # 2% lost to grade-parse failure (reported, not hidden)
# }
for s in report.samples:
if s.agree is False:
print(s.sample_id, s.value_a, "->", s.value_b)
Inside an existing event loop (a notebook, an async app), use the async form:
from inspect_judge_drift import regrade_eval_log_async
report = await regrade_eval_log_async(log, grader_a, grader_b)
What it reports
| Field | Meaning |
|---|---|
drift_rate |
fraction of comparable samples where the two graders disagree (None if none are comparable) |
cohens_kappa |
Cohen's κ between grader A and grader B over comparable samples |
unscored_rate |
fraction of samples where at least one grader failed to parse |
n_total / n_comparable / n_unscored |
sample counts |
Parse failures are excluded from drift_rate / cohens_kappa and surfaced in
unscored_rate instead — the denominator is never silently inflated.
Fidelity boundary (honest limit)
The grade-extraction pattern (DEFAULT_GRADE_PATTERN) is reproduced verbatim
from Inspect's internal scorer/_model.py, including its word-boundary guard and
zero-width-unicode tolerance, so verdict parsing matches upstream character for
character. But the package depends on no internal Inspect module, which means
it cannot call upstream's internal neutralize_structural_delimiters prompt
preparation (it is private). So the grading prompt is a faithful reimplementation,
not byte-identical to what model_graded_qa builds at eval time.
The consequence is bounded and disclosed:
- For A-vs-B drift — the package's actual job — both graders receive the exact same reimplemented prompt, so any prompt-prep difference from upstream is held constant and cancels. The drift signal is internally consistent.
- For the stronger claim "this reproduces Inspect's grading byte-for-byte" — it does not, and does not pretend to. Drift is measured under this faithful-but- not-identical reimplementation.
This is the same kind of fidelity limit inspect-claim-support flagged for its
grader path — stated plainly rather than left implicit.
Scope (0.1.0)
- Log mode only: re-grade from a stored
.eval. Live dual-grading is a possible later option, deliberately not in 0.1.0 — it measures a weaker, noisier signal. - Default rubric is the C/P/I model-graded family; pass
template=andgrade_pattern=for a custom rubric. - Built on Inspect's public API only (
read_eval_log,EvalSample,Score,get_model). The internal grading template and pattern are reimplemented locally rather than imported, so the package depends on no internal Inspect module.
Where it sits next to related tools
The tool enables a drift study; it is not the study. A credible study still needs a power-dimensioned N, a drift-benchmark dataset, and published findings. Its axis is distinct from the neighbouring eval-reliability work: this measures drift in the judge across versions, with the MUT held constant — not confidence intervals on a single mean, not deltas between models-under-test, not prompt-output snapshot regression.
Citation
See CITATION.cff.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inspect_judge_drift-0.1.0.tar.gz.
File metadata
- Download URL: inspect_judge_drift-0.1.0.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b643f0bf51f9bfb65d2eef58536ded1faf254d57a5d1f6f506b13b6a8caf9d59
|
|
| MD5 |
08ccbf075a4845eed90488d567aa2150
|
|
| BLAKE2b-256 |
b5ccee50ed0fa2abfdb344dbdf7173a82fc04481cc2445ebf298d5cb32235bc0
|
File details
Details for the file inspect_judge_drift-0.1.0-py3-none-any.whl.
File metadata
- Download URL: inspect_judge_drift-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cef1aad1e65e6e30ae28cbb32a129b21936843b1a1b8e48da8b38462eb5ef6f
|
|
| MD5 |
e2035bd06e7ea7c0c052bb080e0a0bfa
|
|
| BLAKE2b-256 |
b851fadabbe87b35d37ca6f287661452cd3f600be402e6ac4a967c23a0dd878f
|