Skip to main content

Causal Judge Evaluation - Calibrated LLM-judge evaluation with honest confidence intervals

Project description

CJE Logo

CJE — Causal Judge Evaluation

LLM-judge scores are cheap, plentiful, and miscalibrated. In our benchmark, naive 95% confidence intervals built on raw judge scores contained the true value 0% of the time. CJE calibrates your judge against a small slice of ground truth (5–25% of samples), evaluates your policies at scale, and reports uncertainty you can defend — including telling you when not to trust the result.

arXiv Dataset Open In Colab Docs Python Tests License PyPI Downloads

60 seconds

pip install cje-eval

Generate responses from each candidate policy on a shared prompt set, judge everything, and attach ground-truth labels (oracle_label) to the slice you can afford — human raters, expert review, a downstream KPI. Any bounded scale works (0–1, 0–100, Likert). CJE needs at least 10 labeled rows pooled across policies:

from cje import analyze_dataset

results = analyze_dataset(
    fresh_draws_data={
        "gpt-4o": [
            {"prompt_id": "q01", "judge_score": 0.85, "oracle_label": 0.90},
            {"prompt_id": "q02", "judge_score": 0.72, "oracle_label": 0.70},
            {"prompt_id": "q03", "judge_score": 0.91, "oracle_label": 0.88},
            {"prompt_id": "q04", "judge_score": 0.64, "oracle_label": 0.55},
            {"prompt_id": "q05", "judge_score": 0.77, "oracle_label": 0.74},
            {"prompt_id": "q06", "judge_score": 0.88, "oracle_label": 0.92},
            {"prompt_id": "q07", "judge_score": 0.68},
            {"prompt_id": "q08", "judge_score": 0.79},
        ],
        "claude-sonnet": [
            {"prompt_id": "q01", "judge_score": 0.78, "oracle_label": 0.82},
            {"prompt_id": "q02", "judge_score": 0.81, "oracle_label": 0.79},
            {"prompt_id": "q03", "judge_score": 0.86, "oracle_label": 0.84},
            {"prompt_id": "q04", "judge_score": 0.70, "oracle_label": 0.66},
            {"prompt_id": "q05", "judge_score": 0.74, "oracle_label": 0.71},
            {"prompt_id": "q06", "judge_score": 0.93, "oracle_label": 0.90},
            {"prompt_id": "q07", "judge_score": 0.75},
            {"prompt_id": "q08", "judge_score": 0.83},
        ],
    }
)

for policy, estimate, (lo, hi) in zip(
    results.metadata["target_policies"], results.estimates, results.ci()
):
    print(f"{policy:15s} {estimate:.3f}  95% CI [{lo:.3f}, {hi:.3f}]")
claude-sonnet   0.773  95% CI [0.686, 0.884]
gpt-4o          0.745  95% CI [0.595, 0.909]

And when the data can't support an answer, CJE says so instead of handing you a confident number. Here a candidate policy's judge scores land mostly outside the range where the calibrator saw oracle labels — the run emits the paper's coverage badge and refuses level claims for that policy:

REFUSE-LEVEL for policy 'candidate': 88.3% of fresh-draw judge scores fall
outside the oracle calibration range [0.161, 0.595]. Do not ship level
(absolute) claims for this policy; rankings may stand. Collect oracle labels
covering the missing score range.

The cje analyze CLI takes the same gates into account before crowning a winner:

⚠️ Best by point estimate: candidate (UNRELIABLE — see diagnostics)
🏆 Best reliable policy: baseline

Runnable Colab with real data · Full docs

Is CJE the right tool?

Your situation Use
Rank/compare policies using an LLM judge, with some ground-truth labels CJE
One dataset, labels sampled from it, want a CI on its mean PPI works; CJE's calibrated_mean_ci is the same primitive and adds the transport audit + coverage badge
Evaluate many policies without labeling under each — and know when calibration reuse breaks CJE (the transport audit is the point)
Predict how a specific response will score Not CJE — per-item prediction (conformal methods)
Off-policy estimates from logs only (importance weighting / doubly robust) pip install "cje-eval==0.3.*" — the frozen OPE line; 0.4.x is Direct-mode only (see Notes on 0.4.0)

How it works

  1. Calibrate: learn the judge → oracle mapping on the labeled slice (isotonic, two-stage when needed; mean-preserving by construction; cross-fitted).
  2. Evaluate: score every policy's fresh responses through the calibrated judge and compare policies on the same prompts.
  3. Audit & refuse: a transport audit per policy (does the calibration still hold on this policy's outputs?) and a coverage badge for level claims (were there oracle labels where this policy's scores live?). Failing gates change the output — they are not footnotes.

Confidence intervals account for the judge being learned from a finite label budget (calibration-aware inference), not just sampling noise.

CJE forest plot showing calibrated policy estimates with confidence intervals
Calibrated estimates with 95% CIs (valid under the calibration and transport checks CJE runs by default)

Validation on real ground truth

  • HealthBench (physician labels, n=29,511): two LLM judges were overconfident by 24.5 and 13.0 points and disagreed with each other by up to 73 points on specific criteria categories. Calibrated on 5% physician labels (~1,400 records), both converged to the physician ground truth. Read the full audit →
  • Chatbot Arena (4,961 prompts, 5 policies): 99% pairwise ranking accuracy at a 5% oracle fraction — 14× cheaper than labeling everything, with ~95% CI coverage vs 0% for naive judge-score CIs. An adversarial policy that fools the judge is correctly flagged by the transport audit. Paper →

The array API

calibrated_mean_ci is the library's bottom layer — a ppi_py-style primitive that takes plain numpy arrays and returns a calibrated mean with an honest CI. Use it when you have one sample of judge scores and a partial oracle slice; use analyze_dataset for multi-policy comparisons.

import numpy as np
from cje import calibrated_mean_ci

rng = np.random.default_rng(0)
scores = rng.uniform(size=400)                      # judge scores for every sample
labels = np.full(400, np.nan)                       # NaN = unlabeled
labeled = rng.choice(400, size=100, replace=False)  # oracle slice (25%)
labels[labeled] = np.clip(scores[labeled] + rng.normal(0, 0.1, size=100), 0, 1)

result = calibrated_mean_ci(scores, labels)
print(result.summary())
Calibrated mean: 0.5316 (SE 0.0175, CI [0.4965, 0.5649], n=400, n_oracle=100, bootstrap)

result.calibrator is reusable: transport_audit(probe_scores, probe_labels, result.calibrator) checks whether the calibration still holds on a new slice. result.diagnostics["boundary_card"] carries the coverage badge.

Documentation

Resource Description
Interactive Tutorial Walk through a complete example in Colab — no setup required
CJE in 3 Minutes Video: why raw judge scores mislead and how CJE fixes it
Technical Walkthrough Video: calibration, evaluation, and transport auditing pipeline
Operational Playbook End-to-end runbook: audits, drift correction, label budgeting
Planning Notebook Optimize your evaluation budget with pilot data
Full Docs Installation, assumptions, API reference, research notes

Bridges: Already running evals in Promptfoo, TruLens, LangSmith, or OpenCompass? Convert those outputs into CJE format with one command.

Module deep dives: Calibration · Diagnostics · Estimators · Interface/API · Data formats

Notes on 0.4.0

0.4.0 is a breaking release: CJE is now Direct-mode only. The off-policy machinery — importance-sampling and doubly-robust estimators (calibrated-ips, dr-cpo, mrdr, tmle, stacked-dr), teacher forcing, SIMCal weight stabilization, and the overlap diagnostics — has been removed. Our own paper's results drove the cut: for realistic LLM policy pairs, importance weighting failed even when ESS looked healthy (target-typicality coverage 0.19–0.49, far below the 0.70 gate), and the best DR stack merely matched Direct mode's accuracy at ~12× the compute. Direct mode — fresh draws, calibrated judge, audits — is what the evidence supports, so it is now the whole product.

  • Need IPS/DR from logged propensities? Pin the frozen OPE line: pip install "cje-eval==0.3.*" (maintained on the 0.3.x branch; docs at the v0.3.0 tag).
  • Have old logged data with judge_score + oracle_label? It still works as the calibration source: analyze_dataset(fresh_draws_dir=..., calibration_data_path="logged.jsonl").
  • Removed entry points raise migration errors that say exactly this.

Full details in the CHANGELOG.

Development

git clone https://github.com/cimo-labs/cje.git
cd cje && poetry install && make test

Citation

If you use CJE in your research, please cite:

@misc{landesberg2025causaljudgeevaluationcalibrated,
  title={Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems},
  author={Eddie Landesberg},
  year={2025},
  eprint={2512.11150},
  archivePrefix={arXiv},
  primaryClass={stat.ME},
  url={https://arxiv.org/abs/2512.11150},
}

License

MIT — See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cje_eval-0.4.0.tar.gz (236.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cje_eval-0.4.0-py3-none-any.whl (280.1 kB view details)

Uploaded Python 3

File details

Details for the file cje_eval-0.4.0.tar.gz.

File metadata

  • Download URL: cje_eval-0.4.0.tar.gz
  • Upload date:
  • Size: 236.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cje_eval-0.4.0.tar.gz
Algorithm Hash digest
SHA256 eba7e0008320648eb2ee0a9441b31c36191e6132103239bc89ed236cb5d8d31d
MD5 631be4b95ac177bb3fd9527ad392a680
BLAKE2b-256 856e0a9470cec70765b18a7c9eaaacaf48aefb6b50fd8c2eda5a898b2db1ba79

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.4.0.tar.gz:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cje_eval-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: cje_eval-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 280.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cje_eval-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1c429350746adefa098574a7b7a4fecd6990610a842005baae151d6cafcc7f6
MD5 561945d8e527365017fc2a457f9ce405
BLAKE2b-256 8e079604ff43b149b85023570b2fd9f4b3ced90a0fff6d0b05c28c75fa4824ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.4.0-py3-none-any.whl:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page