Measure LLM-judge verdict drift across model versions by re-grading a stored Inspect eval log with two graders over the same samples.

These details have not been verified by PyPI

Project links

Project description

inspect-judge-drift

Measure LLM-judge verdict drift across model versions for Inspect AI.

When the model you use as a grader is upgraded (say claude-opus-4-7 → claude-opus-4-8), do its verdicts change on identical inputs? inspect-judge-drift answers that by re-grading a stored .eval log with two grader configurations over the same logged samples and reporting how often they disagree.

Why re-grade from a stored log (and not run two live evals)

Drift measurement requires holding everything except the judge constant. Running two live evals introduces two fresh variance sources — the model-under-test's own stochasticity between runs and grader sampling — that contaminate the exact signal you are trying to isolate. Re-grading a single stored log fixes the inputs byte-for-byte: every grader sees the same output.completion, target, and input that were already recorded. The judge model is the only variable. No MUT re-run, reproducible input.

This is the input a credible power-dimensioned drift study has to stand on. Live dual-grading measures something weaker and costs more.

How it isolates the judge

For each sample, the same grading prompt — built from the stored input / output.completion / target — is sent to both graders. The only difference between the two verdicts is the grader model.

A grade-parse failure (the grader not emitting a parseable verdict) is treated as a scoring-instrument failure: that sample is left Score.unscored(), excluded from the drift-rate and kappa denominators, and reported separately as unscored_rate — never fabricated into a verdict. This mirrors the position argued upstream in inspect_ai#4026 / #4048: an instrument failure must stay visible, not masquerade as a model result.

Install

pip install inspect-judge-drift

Use

from inspect_judge_drift import regrade_eval_log, GraderSpec

report = regrade_eval_log(
    "logs/2026-06-25T03-57-26-00-00_task_RXYpgDBbXfECH6EoqaX5w9.eval",
    GraderSpec("anthropic/claude-opus-4-7", label="opus-4.7"),
    GraderSpec("anthropic/claude-opus-4-8", label="opus-4.8"),
)

print(report.to_dict())
# Illustrative output (hypothetical numbers, not a measured result):
# {
#   'grader_a': 'opus-4.7', 'grader_b': 'opus-4.8',
#   'n_total': 100, 'n_comparable': 98, 'n_unscored': 2,
#   'drift_rate': 0.061,        # 6.1% of comparable samples flipped verdict
#   'cohens_kappa': 0.83,       # grader-vs-grader agreement
#   'unscored_rate': 0.02,      # 2% lost to grade-parse failure (reported, not hidden)
# }

for s in report.samples:
    if s.agree is False:
        print(s.sample_id, s.value_a, "->", s.value_b)

Inside an existing event loop (a notebook, an async app), use the async form:

from inspect_judge_drift import regrade_eval_log_async
report = await regrade_eval_log_async(log, grader_a, grader_b)

What it reports

Field	Meaning
`drift_rate`	fraction of comparable samples where the two graders disagree (`None` if none are comparable)
`cohens_kappa`	Cohen's κ between grader A and grader B over comparable samples
`unscored_rate`	fraction of samples where at least one grader failed to parse
`n_total` / `n_comparable` / `n_unscored`	sample counts

Parse failures are excluded from drift_rate / cohens_kappa and surfaced in unscored_rate instead — the denominator is never silently inflated.

Fidelity boundary (honest limit)

The grade-extraction pattern (DEFAULT_GRADE_PATTERN) is reproduced verbatim from Inspect's internal scorer/_model.py, including its word-boundary guard and zero-width-unicode tolerance, so verdict parsing matches upstream character for character. But the package depends on no internal Inspect module, which means it cannot call upstream's internal neutralize_structural_delimiters prompt preparation (it is private). So the grading prompt is a faithful reimplementation, not byte-identical to what model_graded_qa builds at eval time.

The consequence is bounded and disclosed:

For A-vs-B drift — the package's actual job — both graders receive the exact same reimplemented prompt, so any prompt-prep difference from upstream is held constant and cancels. The drift signal is internally consistent.
For the stronger claim "this reproduces Inspect's grading byte-for-byte" — it does not, and does not pretend to. Drift is measured under this faithful-but- not-identical reimplementation.

This is the same kind of fidelity limit inspect-claim-support flagged for its grader path — stated plainly rather than left implicit.

Scope (0.1.0)

Log mode only: re-grade from a stored .eval. Live dual-grading is a possible later option, deliberately not in 0.1.0 — it measures a weaker, noisier signal.
Default rubric is the C/P/I model-graded family; pass template= and grade_pattern= for a custom rubric.
Built on Inspect's public API only (read_eval_log, EvalSample, Score, get_model). The internal grading template and pattern are reimplemented locally rather than imported, so the package depends on no internal Inspect module.

Where it sits next to related tools

The tool enables a drift study; it is not the study. A credible study still needs a power-dimensioned N, a drift-benchmark dataset, and published findings. Its axis is distinct from the neighbouring eval-reliability work: this measures drift in the judge across versions, with the MUT held constant — not confidence intervals on a single mean, not deltas between models-under-test, not prompt-output snapshot regression.

Citation

See CITATION.cff.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inspect_judge_drift-0.1.0.tar.gz (11.4 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inspect_judge_drift-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file inspect_judge_drift-0.1.0.tar.gz.

File metadata

Download URL: inspect_judge_drift-0.1.0.tar.gz
Upload date: Jun 30, 2026
Size: 11.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for inspect_judge_drift-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b643f0bf51f9bfb65d2eef58536ded1faf254d57a5d1f6f506b13b6a8caf9d59`
MD5	`08ccbf075a4845eed90488d567aa2150`
BLAKE2b-256	`b5ccee50ed0fa2abfdb344dbdf7173a82fc04481cc2445ebf298d5cb32235bc0`

See more details on using hashes here.

File details

Details for the file inspect_judge_drift-0.1.0-py3-none-any.whl.

File metadata

Download URL: inspect_judge_drift-0.1.0-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 11.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for inspect_judge_drift-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9cef1aad1e65e6e30ae28cbb32a129b21936843b1a1b8e48da8b38462eb5ef6f`
MD5	`e2035bd06e7ea7c0c052bb080e0a0bfa`
BLAKE2b-256	`b851fadabbe87b35d37ca6f287661452cd3f600be402e6ac4a967c23a0dd878f`

See more details on using hashes here.

inspect-judge-drift 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

inspect-judge-drift

Why re-grade from a stored log (and not run two live evals)

How it isolates the judge

Install

Use

What it reports

Fidelity boundary (honest limit)

Scope (0.1.0)

Where it sits next to related tools

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes