Causal Judge Evaluation - Calibrated LLM-judge evaluation with honest confidence intervals
Project description
CJE — Causal Judge Evaluation
LLM-judge scores are cheap, plentiful, and miscalibrated. In our benchmark, naive 95% confidence intervals built on raw judge scores contained the true value 0% of the time. CJE calibrates your judge against a small slice of ground truth (5–25% of samples), evaluates your policies at scale, and reports uncertainty you can defend — including telling you when not to trust the result.
60 seconds
pip install cje-eval
Generate responses from each candidate policy on a shared prompt set, judge everything, and attach ground-truth labels (oracle_label) to the slice you can afford — human raters, expert review, a downstream KPI. Any bounded scale works (0–1, 0–100, Likert). CJE needs at least 10 labeled rows pooled across policies:
from cje import analyze_dataset
results = analyze_dataset(
fresh_draws_data={
"gpt-4o": [
{"prompt_id": "q01", "judge_score": 0.85, "oracle_label": 0.90},
{"prompt_id": "q02", "judge_score": 0.72, "oracle_label": 0.70},
{"prompt_id": "q03", "judge_score": 0.91, "oracle_label": 0.88},
{"prompt_id": "q04", "judge_score": 0.64, "oracle_label": 0.55},
{"prompt_id": "q05", "judge_score": 0.77, "oracle_label": 0.74},
{"prompt_id": "q06", "judge_score": 0.88, "oracle_label": 0.92},
{"prompt_id": "q07", "judge_score": 0.68},
{"prompt_id": "q08", "judge_score": 0.79},
],
"claude-sonnet": [
{"prompt_id": "q01", "judge_score": 0.78, "oracle_label": 0.82},
{"prompt_id": "q02", "judge_score": 0.81, "oracle_label": 0.79},
{"prompt_id": "q03", "judge_score": 0.86, "oracle_label": 0.84},
{"prompt_id": "q04", "judge_score": 0.70, "oracle_label": 0.66},
{"prompt_id": "q05", "judge_score": 0.74, "oracle_label": 0.71},
{"prompt_id": "q06", "judge_score": 0.93, "oracle_label": 0.90},
{"prompt_id": "q07", "judge_score": 0.75},
{"prompt_id": "q08", "judge_score": 0.83},
],
}
)
for policy, estimate, (lo, hi) in zip(
results.metadata["target_policies"], results.estimates, results.ci()
):
print(f"{policy:15s} {estimate:.3f} 95% CI [{lo:.3f}, {hi:.3f}]")
claude-sonnet 0.773 95% CI [0.686, 0.884]
gpt-4o 0.745 95% CI [0.595, 0.909]
And when the data can't support an answer, CJE says so instead of handing you a confident number. Here a candidate policy's judge scores land mostly outside the range where the calibrator saw oracle labels — the run emits the paper's coverage badge and refuses level claims for that policy:
REFUSE-LEVEL for policy 'candidate': 88.3% of fresh-draw judge scores fall
outside the oracle calibration range [0.161, 0.595]. Do not ship level
(absolute) claims for this policy; rankings may stand. Collect oracle labels
covering the missing score range.
The cje analyze CLI takes the same gates into account before crowning a winner:
⚠️ Best by point estimate: candidate (UNRELIABLE — see diagnostics)
🏆 Best reliable policy: baseline
→ Runnable Colab with real data · Full docs
Is CJE the right tool?
| Your situation | Use |
|---|---|
| Rank/compare policies using an LLM judge, with some ground-truth labels | CJE |
| One dataset, labels sampled from it, want a CI on its mean | PPI works; CJE's calibrated_mean_ci is the same primitive and adds the transport audit + coverage badge |
| Evaluate many policies without labeling under each — and know when calibration reuse breaks | CJE (the transport audit is the point) |
| Predict how a specific response will score | Not CJE — per-item prediction (conformal methods) |
| Off-policy estimates from logs only (importance weighting / doubly robust) | pip install "cje-eval==0.3.*" — the frozen OPE line; 0.4.x is Direct-mode only (see Notes on 0.4.0) |
How it works
- Calibrate: learn the judge → oracle mapping on the labeled slice (isotonic, two-stage when needed; mean-preserving by construction; cross-fitted).
- Evaluate: score every policy's fresh responses through the calibrated judge and compare policies on the same prompts.
- Audit & refuse: a transport audit per policy (does the calibration still hold on this policy's outputs?) and a coverage badge for level claims (were there oracle labels where this policy's scores live?). Failing gates change the output — they are not footnotes.
Confidence intervals account for the judge being learned from a finite label budget (calibration-aware inference), not just sampling noise.
Calibrated estimates with 95% CIs (valid under the calibration and transport checks CJE runs by default)
Validation on real ground truth
- HealthBench (physician labels, n=29,511): two LLM judges were overconfident by 24.5 and 13.0 points and disagreed with each other by up to 73 points on specific criteria categories. Calibrated on 5% physician labels (~1,400 records), both converged to the physician ground truth. Read the full audit →
- Chatbot Arena (4,961 prompts, 5 policies): 99% pairwise ranking accuracy at a 5% oracle fraction — 14× cheaper than labeling everything, with ~95% CI coverage vs 0% for naive judge-score CIs. An adversarial policy that fools the judge is correctly flagged by the transport audit. Paper →
The array API
calibrated_mean_ci is the library's bottom layer — a ppi_py-style primitive that takes plain numpy arrays and returns a calibrated mean with an honest CI. Use it when you have one sample of judge scores and a partial oracle slice; use analyze_dataset for multi-policy comparisons.
import numpy as np
from cje import calibrated_mean_ci
rng = np.random.default_rng(0)
scores = rng.uniform(size=400) # judge scores for every sample
labels = np.full(400, np.nan) # NaN = unlabeled
labeled = rng.choice(400, size=100, replace=False) # oracle slice (25%)
labels[labeled] = np.clip(scores[labeled] + rng.normal(0, 0.1, size=100), 0, 1)
result = calibrated_mean_ci(scores, labels)
print(result.summary())
Calibrated mean: 0.5316 (SE 0.0175, CI [0.4965, 0.5649], n=400, n_oracle=100, bootstrap)
result.calibrator is reusable: transport_audit(probe_scores, probe_labels, result.calibrator) checks whether the calibration still holds on a new slice. result.diagnostics["boundary_card"] carries the coverage badge.
Documentation
| Resource | Description |
|---|---|
| Interactive Tutorial | Walk through a complete example in Colab — no setup required |
| CJE in 3 Minutes | Video: why raw judge scores mislead and how CJE fixes it |
| Technical Walkthrough | Video: calibration, evaluation, and transport auditing pipeline |
| Operational Playbook | End-to-end runbook: audits, drift correction, label budgeting |
| Planning Notebook | Optimize your evaluation budget with pilot data |
| Full Docs | Installation, assumptions, API reference, research notes |
Bridges: Already running evals in Promptfoo, TruLens, LangSmith, or OpenCompass? Convert those outputs into CJE format with one command.
Module deep dives: Calibration · Diagnostics · Estimators · Interface/API · Data formats
Notes on 0.4.0
0.4.0 is a breaking release: CJE is now Direct-mode only. The off-policy machinery — importance-sampling and doubly-robust estimators (calibrated-ips, dr-cpo, mrdr, tmle, stacked-dr), teacher forcing, SIMCal weight stabilization, and the overlap diagnostics — has been removed. Our own paper's results drove the cut: for realistic LLM policy pairs, importance weighting failed even when ESS looked healthy (target-typicality coverage 0.19–0.49, far below the 0.70 gate), and the best DR stack merely matched Direct mode's accuracy at ~12× the compute. Direct mode — fresh draws, calibrated judge, audits — is what the evidence supports, so it is now the whole product.
- Need IPS/DR from logged propensities? Pin the frozen OPE line:
pip install "cje-eval==0.3.*"(maintained on the0.3.xbranch; docs at thev0.3.0tag). - Have old logged data with
judge_score+oracle_label? It still works as the calibration source:analyze_dataset(fresh_draws_dir=..., calibration_data_path="logged.jsonl"). - Removed entry points raise migration errors that say exactly this.
Full details in the CHANGELOG.
Development
git clone https://github.com/cimo-labs/cje.git
cd cje && poetry install && make test
Citation
If you use CJE in your research, please cite:
@misc{landesberg2025causaljudgeevaluationcalibrated,
title={Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems},
author={Eddie Landesberg},
year={2025},
eprint={2512.11150},
archivePrefix={arXiv},
primaryClass={stat.ME},
url={https://arxiv.org/abs/2512.11150},
}
License
MIT — See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cje_eval-0.4.0.tar.gz.
File metadata
- Download URL: cje_eval-0.4.0.tar.gz
- Upload date:
- Size: 236.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eba7e0008320648eb2ee0a9441b31c36191e6132103239bc89ed236cb5d8d31d
|
|
| MD5 |
631be4b95ac177bb3fd9527ad392a680
|
|
| BLAKE2b-256 |
856e0a9470cec70765b18a7c9eaaacaf48aefb6b50fd8c2eda5a898b2db1ba79
|
Provenance
The following attestation bundles were made for cje_eval-0.4.0.tar.gz:
Publisher:
publish.yml on cimo-labs/cje
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cje_eval-0.4.0.tar.gz -
Subject digest:
eba7e0008320648eb2ee0a9441b31c36191e6132103239bc89ed236cb5d8d31d - Sigstore transparency entry: 2055751723
- Sigstore integration time:
-
Permalink:
cimo-labs/cje@a8e9d85a7b89ac25bf251fd138991fbbd59caf40 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cimo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a8e9d85a7b89ac25bf251fd138991fbbd59caf40 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file cje_eval-0.4.0-py3-none-any.whl.
File metadata
- Download URL: cje_eval-0.4.0-py3-none-any.whl
- Upload date:
- Size: 280.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1c429350746adefa098574a7b7a4fecd6990610a842005baae151d6cafcc7f6
|
|
| MD5 |
561945d8e527365017fc2a457f9ce405
|
|
| BLAKE2b-256 |
8e079604ff43b149b85023570b2fd9f4b3ced90a0fff6d0b05c28c75fa4824ca
|
Provenance
The following attestation bundles were made for cje_eval-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on cimo-labs/cje
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cje_eval-0.4.0-py3-none-any.whl -
Subject digest:
f1c429350746adefa098574a7b7a4fecd6990610a842005baae151d6cafcc7f6 - Sigstore transparency entry: 2055752111
- Sigstore integration time:
-
Permalink:
cimo-labs/cje@a8e9d85a7b89ac25bf251fd138991fbbd59caf40 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cimo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a8e9d85a7b89ac25bf251fd138991fbbd59caf40 -
Trigger Event:
workflow_dispatch
-
Statement type: