Causal Judge Evaluation - Unbiased LLM evaluation framework

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

elandesberg

These details have not been verified by PyPI

Project links

Homepage

Project description

CJE - Causal Judge Evaluation

Turn noisy LLM-judge scores into precise, unbiased estimates of the outcomes you care about.

CJE calibrates judge scores using a small oracle slice (5-10% coverage), then delivers statistically rigorous estimates.

How It Works

CJE follows a simple three-step workflow:

┌─────────────────────────────────┐
│           Data                  │
│  LLM-judge scores +             │
│  oracle slice (5-50%)           │
└─────────────────────────────────┘
              ↓
┌─────────────────────────────────┐
│         Calibrate               │
│  Learn judge → oracle mapping   │
└─────────────────────────────────┘
              ↓
┌─────────────────────────────────┐
│          Estimate               │
│  Estimates with honest          │
│  uncertainty                    │
└─────────────────────────────────┘

Key benefits:

Small label budget: 5-10% oracle coverage often sufficient
Unbiased estimates: Judge scores (+ optional covariates) mapped to oracle scale
Rigorous inference: CIs account for both sampling and calibration uncertainty

See cje/calibration/README.md for technical details.

📊 Performance

Arena Experiment: 5k Real Evaluations - Comprehensive benchmarking on ChatBot Arena data:

94% pairwise ranking accuracy with Direct Model + covariates
158× ESS improvement with SIMCal-W vs raw SNIPS
Kendall τ = 0.837 vs -0.235 for uncalibrated methods
Validates AutoCal-R calibration and doubly-robust estimation on real data

Calibration Methods

CJE provides two calibration modes for mapping judge scores to oracle outcomes (i.e., the KPI you care about):

Monotone

Standard isotonic regression enforces: higher judge score → no worse expected outcome.

Why isotonic? It's the right structural prior—assumes only monotonicity (which you actually believe), preserves oracle KPI levels by construction (mean-preserving by KKT conditions), and is highly efficient with small label budgets. See technical rationale.

Simple, stable, works well when the judge-oracle relationship is already monotone.

Two-Stage (Default with Covariates)

Learns smooth transformation g(S, X) → rank → isotonic. Handles non-monotone patterns and incorporates additional covariates (e.g., response length, domain metadata) while maintaining final monotonicity guarantee.

_{Two-stage calibration learns flexible relationships between covariates (judge score, response length) and oracle outcomes in Stage 1, then enforces monotonicity via isotonic regression in Stage 2. Left/Middle: Partial dependence plots show how each covariate relates to oracle score while holding others at mean values. Right: Final monotone mapping from Stage 1 risk index to calibrated oracle score. Full benchmarking results: Arena Experiment. Data from LMSYS Chatbot Arena.}

When to use two-stage:

You have covariates (response length, domain, etc.) → two-stage is default and recommended
Judge shows non-monotone empirical E[Oracle|Judge] relationship
Regional miscalibration (monotone works well at low/high but poorly at mid-range)
Length bias (judge gives different scores to same-quality responses based on length)

Auto mode:

With covariates: Two-stage is automatically used (can incorporate additional features)
Judge score only: CJE automatically selects monotone vs two-stage via cross-validation (1-SE rule)

# Default: Judge score only (no covariates, auto-selects monotone/two-stage via CV)
result = analyze_dataset(fresh_draws_dir="responses/")

# Include response_length covariate for two-stage calibration
result = analyze_dataset(
    fresh_draws_dir="responses/",
    include_response_length=True
)

# Add domain as additional covariate (combine with response_length)
result = analyze_dataset(
    fresh_draws_dir="responses/",
    include_response_length=True,
    calibration_covariates=["domain"]
)

# Force a specific mode
result = analyze_dataset(
    fresh_draws_dir="responses/",
    calibration_mode="monotone"  # or "two_stage"
)

Installation

pip install cje-eval

🚀 Try it Now - Interactive Demo

Quick Start

Minimal Example

from cje import analyze_dataset

# Compare policies on an eval set
result = analyze_dataset(fresh_draws_dir="responses/")

# Get estimates with confidence intervals
for policy, est, se in zip(
    result.metadata["target_policies"],
    result.estimates,
    result.standard_errors
):
    print(f"{policy}: {est:.3f} ± {1.96*se:.3f}")

Data Format

Directory structure: One JSONL file per policy

responses/
├── model_a_responses.jsonl
└── model_b_responses.jsonl

Minimal record (inside each file):

{"prompt_id": "eval_0", "judge_score": 0.85}
{"prompt_id": "eval_1", "judge_score": 0.72}

With calibration (add oracle labels to 5-10% of samples):

{"prompt_id": "eval_0", "judge_score": 0.85, "oracle_label": 0.86}
{"prompt_id": "eval_1", "judge_score": 0.72}

CJE automatically:

Discovers policies from filenames (model_a_responses.jsonl → policy "model_a")
Applies AutoCal-R when oracle labels are present
Uses cluster-robust SEs for paired comparisons (when same prompts across policies)
Returns unbiased estimates with valid 95% CIs

Paired Comparisons

When comparing policies on the same prompts (paired design), CJE automatically uses cluster-robust standard errors:

# Both files must have matching prompt_ids for pairing
result = analyze_dataset(fresh_draws_dir="responses/")

# CJE automatically clusters by prompt for valid inference
if result.metadata.get("prompts_aligned"):
    print("✓ Paired comparison - using cluster-robust SEs")

Why it matters: Paired designs have correlated outcomes across policies (same prompt evaluated by multiple models). Standard SEs would understate uncertainty. CJE automatically accounts for this by clustering by prompt_id.

Beyond Direct Mode

CJE also supports IPS (counterfactual inference from logs) and DR (doubly robust with fresh draws). These require log probabilities from your models.

# IPS: Estimate "what if we deployed policy X?" from existing logs
result = analyze_dataset(logged_data_path="logs.jsonl")

# DR: Combine logged data + fresh draws for maximum accuracy
result = analyze_dataset(
    logged_data_path="logs.jsonl",
    fresh_draws_dir="responses/"
)

For IPS/DR data formats and API details: Run help(analyze_dataset) or see cje/interface/ module docs.

Visualization

CJE provides diagnostic plots for understanding and validating results:

from cje import analyze_dataset, plot_policy_estimates

# Run analysis
result = analyze_dataset(fresh_draws_dir="responses/")

# Quick plot with convenience method
result.plot_estimates(save_path="estimates.png")

# Or use visualization functions directly for more control
plot_policy_estimates(
    estimates={"policy_a": 0.75, "policy_b": 0.68},
    standard_errors={"policy_a": 0.02, "policy_b": 0.03},
    oracle_values={"policy_a": 0.74, "policy_b": 0.69}  # Optional
)

Available visualizations:

plot_policy_estimates - Forest plots with confidence intervals
plot_calibration_comparison - Judge→oracle calibration curves
plot_weight_dashboard_summary - Weight diagnostics for IPS/DR
plot_weight_dashboard_detailed - Per-policy weight analysis
plot_dr_dashboard - Doubly robust diagnostics

Jupyter notebooks: Results automatically display as formatted tables when evaluated in a cell.

See cje/visualization/README.md for complete guide.

Documentation

📚 Getting Started

Interactive Demo - Try in your browser
Examples - Working code samples

🔧 For Engineers

Calibration Methods - AutoCal-R, isotonic regression, two-stage fallback
Diagnostics System - Uncertainty quantification, OUA, transportability tests
Estimators - Direct, IPS, DR implementations
Interface/API - analyze_dataset implementation and mode selection

Development

git clone https://github.com/cimo-labs/cje.git
cd cje
poetry install
make test

Support

🐛 Issues
💬 Discussions

License

MIT - See LICENSE for details.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

elandesberg

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.25

Apr 2, 2026

0.2.24

Mar 24, 2026

0.2.23

Mar 1, 2026

0.2.22

Feb 3, 2026

0.2.21

Feb 3, 2026

0.2.20

Feb 2, 2026

0.2.19

Jan 27, 2026

0.2.18

Jan 21, 2026

0.2.17

Jan 20, 2026

0.2.16

Jan 20, 2026

0.2.13

Jan 15, 2026

0.2.12

Jan 11, 2026

0.2.10

Jan 2, 2026

0.2.9

Dec 14, 2025

0.2.8

Dec 14, 2025

0.2.7

Dec 14, 2025

0.2.6

Dec 13, 2025

0.2.5

Oct 30, 2025

0.2.4

Oct 30, 2025

This version

0.2.3

Oct 27, 2025

0.2.2

Oct 9, 2025

0.2.1

Oct 9, 2025

0.2.0

Oct 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cje_eval-0.2.3.tar.gz (298.2 kB view details)

Uploaded Oct 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cje_eval-0.2.3-py3-none-any.whl (359.5 kB view details)

Uploaded Oct 27, 2025 Python 3

File details

Details for the file cje_eval-0.2.3.tar.gz.

File metadata

Download URL: cje_eval-0.2.3.tar.gz
Upload date: Oct 27, 2025
Size: 298.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`df66b45c7d818f296db52a4b05444ff90cd5fc5f62e13a794534964667c42de0`
MD5	`a3bcbfa4db1b68f0ec8fa5ea450bc7ad`
BLAKE2b-256	`eb3867f0406249bf6fc0fc901f66dff7c58b3a943d631c0ae7973ac1d3114436`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.3.tar.gz:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cje_eval-0.2.3.tar.gz
- Subject digest: df66b45c7d818f296db52a4b05444ff90cd5fc5f62e13a794534964667c42de0
- Sigstore transparency entry: 646737034
- Sigstore integration time: Oct 27, 2025
Source repository:
- Permalink: cimo-labs/cje@439bf9104fde62d98ced0a5d2fb1882b67fc09b7
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/cimo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@439bf9104fde62d98ced0a5d2fb1882b67fc09b7
- Trigger Event: release

File details

Details for the file cje_eval-0.2.3-py3-none-any.whl.

File metadata

Download URL: cje_eval-0.2.3-py3-none-any.whl
Upload date: Oct 27, 2025
Size: 359.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c9d3399a733792cc0add5ab880170409672b18d0f660c95b51664e5952fea8e9`
MD5	`2729f0cd9323a50ddafca43e3f2dbb8f`
BLAKE2b-256	`ce0d220daba8e7fbafa38f542bac5b497571736a42efae463c5e658f696895a5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.3-py3-none-any.whl:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cje_eval-0.2.3-py3-none-any.whl
- Subject digest: c9d3399a733792cc0add5ab880170409672b18d0f660c95b51664e5952fea8e9
- Sigstore transparency entry: 646737045
- Sigstore integration time: Oct 27, 2025
Source repository:
- Permalink: cimo-labs/cje@439bf9104fde62d98ced0a5d2fb1882b67fc09b7
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/cimo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@439bf9104fde62d98ced0a5d2fb1882b67fc09b7
- Trigger Event: release

cje-eval 0.2.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CJE - Causal Judge Evaluation

How It Works

📊 Performance

Calibration Methods

Monotone

Two-Stage (Default with Covariates)

Installation

🚀 Try it Now - Interactive Demo

Quick Start

Minimal Example

Data Format

Paired Comparisons

Beyond Direct Mode

Visualization

Documentation

Development

Support

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance