Causal Judge Evaluation - Calibrated LLM-judge evaluation with honest confidence intervals

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

elandesberg

These details have not been verified by PyPI

Project links

Project description

CJE — Causal Judge Evaluation

LLM-judge scores are cheap, plentiful, and miscalibrated. In our benchmark, naive 95% confidence intervals built on raw judge scores contained the true value 0% of the time. CJE calibrates your judge against a small slice of ground truth (5–25% of samples), evaluates your policies at scale, and reports uncertainty you can defend — including telling you when not to trust the result.

60 seconds

pip install cje-eval

Generate responses from each candidate policy on a shared prompt set, judge everything, and attach ground-truth labels (oracle_label) to the slice you can afford — human raters, expert review, a downstream KPI. Any bounded scale works (0–1, 0–100, Likert). CJE needs at least 10 labeled rows pooled across policies — they can all come from one policy, and even that policy doesn't need labels on every row:

from cje import analyze_dataset

# One policy carries the oracle slice — CJE learns the judge→oracle mapping
# once and applies it to every unlabeled response, in any policy. Coverage
# is never required to be complete: gpt-5.6 has 20 responses with labels on
# only 10 (oracle_label=None means unlabeled); fable-5 has no labels at all.
labeled = [(0.62, 0.55), (0.68, 0.60), (0.72, 0.70), (0.76, 0.74), (0.79, 0.75),
           (0.83, 0.80), (0.85, 0.90), (0.88, 0.92), (0.91, 0.88), (0.95, 0.97)]
unlabeled = [0.64, 0.69, 0.73, 0.77, 0.80, 0.84, 0.87, 0.89, 0.92, 0.94]
judge_only = [0.70, 0.74, 0.75, 0.78, 0.81, 0.83, 0.86, 0.90, 0.93, 0.94]

results = analyze_dataset(
    fresh_draws_data={
        "gpt-5.6": [
            {"prompt_id": f"q{i:02d}", "judge_score": s, "oracle_label": y}
            for i, (s, y) in enumerate(labeled + [(u, None) for u in unlabeled])
        ],
        "fable-5": [
            {"prompt_id": f"q{i:02d}", "judge_score": s}
            for i, s in enumerate(judge_only)
        ],
    }
)

for policy, estimate, (lo, hi) in zip(
    results.metadata["target_policies"], results.estimates, results.ci()
):
    print(f"{policy:15s} {estimate:.3f}  95% CI [{lo:.3f}, {hi:.3f}]")

fable-5         0.811  95% CI [0.751, 0.873]
gpt-5.6         0.784  95% CI [0.589, 0.844]

fable-5 gets a calibrated estimate and an honest CI without a single label of its own, and gpt-5.6's ten unlabeled rows are covered by the same transfer — labels are a pooled budget, not a per-row requirement. That transfer is what the labels-under-every-policy workflow wastes money re-buying.

And when the data can't support an answer, CJE says so instead of handing you a confident number. Here a candidate policy's judge scores land mostly outside the range where the calibrator saw oracle labels — the run emits the paper's coverage badge and refuses level claims for that policy:

REFUSE-LEVEL for policy 'candidate': 88.3% of fresh-draw judge scores fall
outside the oracle calibration range [0.161, 0.595]. Do not ship level
(absolute) claims for this policy; rankings may stand. Collect oracle labels
covering the missing score range.

The cje analyze CLI takes the same gates into account before crowning a winner:

⚠️ Best by point estimate: candidate (UNRELIABLE — see diagnostics)
🏆 Best reliable policy: baseline

→ Runnable Colab with real data · Full docs

Is CJE the right tool?

Your situation	Use
Rank/compare policies using an LLM judge, with some ground-truth labels	CJE
One dataset, labels sampled from it, want a CI on its mean	PPI works; CJE's `calibrated_mean_ci` is the same primitive and adds the transport audit + coverage badge
Evaluate many policies without labeling under each — and know when calibration reuse breaks	CJE (the transport audit is the point)
Predict how a specific response will score	Not CJE — per-item prediction (conformal methods)
Off-policy estimates from logs only (importance weighting / doubly robust)	`pip install "cje-eval==0.3.*"` — the frozen OPE line; 0.4.x is Direct-mode only (see Notes on 0.4.0)

How it works

Calibrate: learn the judge → oracle mapping on the labeled slice (isotonic, two-stage when needed; mean-preserving by construction; cross-fitted).
Evaluate: score every policy's fresh responses through the calibrated judge and compare policies on the same prompts.
Audit & refuse: a transport audit per policy (does the calibration still hold on this policy's outputs?) and a coverage badge for level claims (were there oracle labels where this policy's scores live?). Failing gates change the output — they are not footnotes.

Confidence intervals account for the judge being learned from a finite label budget (calibration-aware inference), not just sampling noise.

CJE forest plot showing calibrated policy estimates with confidence intervals

Calibrated estimates with 95% CIs (valid under the calibration and transport checks CJE runs by default)

Validation on real ground truth

HealthBench (physician labels, n=29,511): two LLM judges were overconfident by 24.5 and 13.0 points and disagreed with each other by up to 73 points on specific criteria categories. Calibrated on 5% physician labels (~1,400 records), both converged to the physician ground truth. Read the full audit →
Chatbot Arena (4,961 prompts, 5 policies): 99% pairwise ranking accuracy at a 5% oracle fraction — 14× cheaper than labeling everything, with ~95% CI coverage vs 0% for naive judge-score CIs. An adversarial policy that fools the judge is correctly flagged by the transport audit. Paper →

The array API

calibrated_mean_ci is the library's bottom layer — a ppi_py-style primitive that takes plain numpy arrays and returns a calibrated mean with an honest CI. Use it when you have one sample of judge scores and a partial oracle slice; use analyze_dataset for multi-policy comparisons.

import numpy as np
from cje import calibrated_mean_ci

rng = np.random.default_rng(0)
scores = rng.uniform(size=400)                      # judge scores for every sample
labels = np.full(400, np.nan)                       # NaN = unlabeled
labeled = rng.choice(400, size=100, replace=False)  # oracle slice (25%)
labels[labeled] = np.clip(scores[labeled] + rng.normal(0, 0.1, size=100), 0, 1)

result = calibrated_mean_ci(scores, labels)
print(result.summary())

Calibrated mean: 0.5316 (SE 0.0175, CI [0.4965, 0.5649], n=400, n_oracle=100, bootstrap)

result.calibrator is reusable: transport_audit(probe_scores, probe_labels, result.calibrator) checks whether the calibration still holds on a new slice. result.diagnostics["boundary_card"] carries the coverage badge.

Documentation

Resource	Description
Interactive Tutorial	Walk through a complete example in Colab — no setup required
CJE in 3 Minutes	Video: why raw judge scores mislead and how CJE fixes it
Technical Walkthrough	Video: calibration, evaluation, and transport auditing pipeline
Operational Playbook	End-to-end runbook: audits, drift correction, label budgeting
Planning Notebook	Optimize your evaluation budget with pilot data
Full Docs	Installation, assumptions, API reference, research notes

Bridges: Already running evals in Promptfoo, TruLens, LangSmith, or OpenCompass? Convert those outputs into CJE format with one command.

Module deep dives: Calibration · Diagnostics · Estimators · Interface/API · Data formats

Notes on 0.4.0

0.4.0 is a breaking release: CJE is now Direct-mode only. The off-policy machinery — importance-sampling and doubly-robust estimators (calibrated-ips, dr-cpo, mrdr, tmle, stacked-dr), teacher forcing, SIMCal weight stabilization, and the overlap diagnostics — has been removed. Our own paper's results drove the cut: for realistic LLM policy pairs, importance weighting failed even when ESS looked healthy (target-typicality coverage 0.19–0.49, far below the 0.70 gate), and the best DR stack merely matched Direct mode's accuracy at ~12× the compute. Direct mode — fresh draws, calibrated judge, audits — is what the evidence supports, so it is now the whole product.

Need IPS/DR from logged propensities? Pin the frozen OPE line: pip install "cje-eval==0.3.*" (maintained on the 0.3.x branch; docs at the v0.3.0 tag; requires Python <=3.12 — on 3.13 use a 3.12 env for OPE).
Have old logged data with judge_score + oracle_label? It still works as the calibration source: analyze_dataset(fresh_draws_dir=..., calibration_data_path="logged.jsonl").
Removed entry points raise migration errors that say exactly this.

Full details in the CHANGELOG.

Development

git clone https://github.com/cimo-labs/cje.git
cd cje && poetry install && make test

Citation

If you use CJE in your research, please cite:

@misc{landesberg2025causaljudgeevaluationcalibrated,
  title={Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems},
  author={Eddie Landesberg},
  year={2025},
  eprint={2512.11150},
  archivePrefix={arXiv},
  primaryClass={stat.ME},
  url={https://arxiv.org/abs/2512.11150},
}

License

MIT — See LICENSE for details.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

elandesberg

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.3

Jul 4, 2026

This version

0.4.2

Jul 4, 2026

0.4.1

Jul 4, 2026

0.4.0

Jul 3, 2026

0.3.0

Jul 2, 2026

0.2.25

Apr 2, 2026

0.2.24

Mar 24, 2026

0.2.23

Mar 1, 2026

0.2.22

Feb 3, 2026

0.2.21

Feb 3, 2026

0.2.20

Feb 2, 2026

0.2.19

Jan 27, 2026

0.2.18

Jan 21, 2026

0.2.17

Jan 20, 2026

0.2.16

Jan 20, 2026

0.2.13

Jan 15, 2026

0.2.12

Jan 11, 2026

0.2.10

Jan 2, 2026

0.2.9

Dec 14, 2025

0.2.8

Dec 14, 2025

0.2.7

Dec 14, 2025

0.2.6

Dec 13, 2025

0.2.5

Oct 30, 2025

0.2.4

Oct 30, 2025

0.2.3

Oct 27, 2025

0.2.2

Oct 9, 2025

0.2.1

Oct 9, 2025

0.2.0

Oct 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cje_eval-0.4.2.tar.gz (233.1 kB view details)

Uploaded Jul 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cje_eval-0.4.2-py3-none-any.whl (275.9 kB view details)

Uploaded Jul 4, 2026 Python 3

File details

Details for the file cje_eval-0.4.2.tar.gz.

File metadata

Download URL: cje_eval-0.4.2.tar.gz
Upload date: Jul 4, 2026
Size: 233.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cje_eval-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`fd11460421211b6398c5b0f711ff1684d9a475e9fd76b6c158c5dad4a4b9d030`
MD5	`6754706f77c120392095238fbe48a1ad`
BLAKE2b-256	`a3c75fb7461c01e9e29ec0ad3609b3442da73f61661d55c9363922eaf0ddc35e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.4.2.tar.gz:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cje_eval-0.4.2.tar.gz
- Subject digest: fd11460421211b6398c5b0f711ff1684d9a475e9fd76b6c158c5dad4a4b9d030
- Sigstore transparency entry: 2065344530
- Sigstore integration time: Jul 4, 2026
Source repository:
- Permalink: cimo-labs/cje@04a9fd23cdee9822e619ec5195307362d51b010e
- Branch / Tag: refs/heads/main
- Owner: https://github.com/cimo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@04a9fd23cdee9822e619ec5195307362d51b010e
- Trigger Event: workflow_dispatch

File details

Details for the file cje_eval-0.4.2-py3-none-any.whl.

File metadata

Download URL: cje_eval-0.4.2-py3-none-any.whl
Upload date: Jul 4, 2026
Size: 275.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cje_eval-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1761d95863bbee52157c701e08ba58c33a73f86ecdeb0c334d9283f777b1bf06`
MD5	`3f7f454404d40a70b9f4284466e765c8`
BLAKE2b-256	`35ac7194122dd66e006b750b18c981057437c980164d7d063823539ca1182710`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.4.2-py3-none-any.whl:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cje_eval-0.4.2-py3-none-any.whl
- Subject digest: 1761d95863bbee52157c701e08ba58c33a73f86ecdeb0c334d9283f777b1bf06
- Sigstore transparency entry: 2065344631
- Sigstore integration time: Jul 4, 2026
Source repository:
- Permalink: cimo-labs/cje@04a9fd23cdee9822e619ec5195307362d51b010e
- Branch / Tag: refs/heads/main
- Owner: https://github.com/cimo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@04a9fd23cdee9822e619ec5195307362d51b010e
- Trigger Event: workflow_dispatch

cje-eval 0.4.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CJE — Causal Judge Evaluation

60 seconds

Is CJE the right tool?

How it works

Validation on real ground truth

The array API

Documentation

Notes on 0.4.0

Development

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance