Skip to main content

Causal Judge Evaluation - Unbiased LLM evaluation framework

Project description

CJE Logo

CJE - Causal Judge Evaluation

Docs Python Tests License

Turn noisy LLM-judge scores into precise, unbiased estimates of the outcomes you care about.

CJE calibrates judge scores using a small oracle slice (5-10% coverage), then delivers statistically rigorous estimates.

How It Works

CJE follows a simple three-step workflow:

┌─────────────────────────────────┐
│           Data                  │
│  LLM-judge scores +             │
│  oracle slice (5-50%)           │
└─────────────────────────────────┘
              ↓
┌─────────────────────────────────┐
│         Calibrate               │
│  Learn judge → oracle mapping   │
└─────────────────────────────────┘
              ↓
┌─────────────────────────────────┐
│          Estimate               │
│  Estimates with honest          │
│  uncertainty                    │
└─────────────────────────────────┘

Key benefits:

  • Small label budget: 5-10% oracle coverage often sufficient
  • Unbiased estimates: Judge scores (+ optional covariates) mapped to oracle scale
  • Rigorous inference: CIs account for both sampling and calibration uncertainty

See cje/calibration/README.md for technical details.

📊 Performance

Arena Experiment: 5k Real Evaluations - Comprehensive benchmarking on ChatBot Arena data:

  • 94% pairwise ranking accuracy with Direct Model + covariates
  • 158× ESS improvement with SIMCal-W vs raw SNIPS
  • Kendall τ = 0.837 vs -0.235 for uncalibrated methods
  • Validates AutoCal-R calibration and doubly-robust estimation on real data

Reproduction code: Full experimental pipeline available at cje-arena-experiments

Calibration Methods

CJE provides two calibration modes for mapping judge scores to oracle outcomes (i.e., the KPI you care about):

Monotone

Standard isotonic regression enforces: higher judge score → no worse expected outcome.

Why isotonic? It's the right structural prior—assumes only monotonicity (which you actually believe), preserves oracle KPI levels by construction (mean-preserving by KKT conditions), and is highly efficient with small label budgets. See technical rationale.

Simple, stable, works well when the judge-oracle relationship is already monotone.

Two-Stage (Default with Covariates)

Learns smooth transformation g(S, X) → rank → isotonic. Handles non-monotone patterns and incorporates additional covariates (e.g., response length, domain metadata) while maintaining final monotonicity guarantee.

Two-Stage Calibration with Covariates

Two-stage calibration learns flexible relationships between covariates (judge score, response length) and oracle outcomes in Stage 1, then enforces monotonicity via isotonic regression in Stage 2. Left/Middle: Partial dependence plots show how each covariate relates to oracle score while holding others at mean values. Right: Final monotone mapping from Stage 1 risk index to calibrated oracle score. Full benchmarking results: Arena Experiment. Data from LMSYS Chatbot Arena.

When to use two-stage:

  • You have covariates (response length, domain, etc.) → two-stage is default and recommended
  • Judge shows non-monotone empirical E[Oracle|Judge] relationship
  • Regional miscalibration (monotone works well at low/high but poorly at mid-range)
  • Length bias (judge gives different scores to same-quality responses based on length)

Auto mode:

  • With covariates: Two-stage is automatically used (can incorporate additional features)
  • Judge score only: CJE automatically selects monotone vs two-stage via cross-validation (1-SE rule)
# Default: Judge score only (no covariates, auto-selects monotone/two-stage via CV)
result = analyze_dataset(fresh_draws_dir="responses/")

# Include response_length covariate for two-stage calibration
result = analyze_dataset(
    fresh_draws_dir="responses/",
    include_response_length=True
)

# Add domain as additional covariate (combine with response_length)
result = analyze_dataset(
    fresh_draws_dir="responses/",
    include_response_length=True,
    calibration_covariates=["domain"]
)

# Force a specific mode
result = analyze_dataset(
    fresh_draws_dir="responses/",
    calibration_mode="monotone"  # or "two_stage"
)

Installation

pip install cje-eval

🚀 Try it Now - Interactive Demo

Open In Colab

Quick Start

Minimal Example

from cje import analyze_dataset

# Compare policies on an eval set
result = analyze_dataset(fresh_draws_dir="responses/")

# Get estimates with confidence intervals
for policy, est, se in zip(
    result.metadata["target_policies"],
    result.estimates,
    result.standard_errors
):
    print(f"{policy}: {est:.3f} ± {1.96*se:.3f}")

Data Format

Directory structure: One JSONL file per policy

responses/
├── model_a_responses.jsonl
└── model_b_responses.jsonl

Minimal record (inside each file):

{"prompt_id": "eval_0", "judge_score": 0.85}
{"prompt_id": "eval_1", "judge_score": 0.72}

With calibration (add oracle labels to 5-10% of samples):

{"prompt_id": "eval_0", "judge_score": 0.85, "oracle_label": 0.86}
{"prompt_id": "eval_1", "judge_score": 0.72}

CJE automatically:

  • Discovers policies from filenames (model_a_responses.jsonl → policy "model_a")
  • Applies AutoCal-R when oracle labels are present
  • Uses cluster-robust SEs for paired comparisons (when same prompts across policies)
  • Returns unbiased estimates with valid 95% CIs

Paired Comparisons

When comparing policies on the same prompts (paired design), CJE automatically uses cluster-robust standard errors:

# Both files must have matching prompt_ids for pairing
result = analyze_dataset(fresh_draws_dir="responses/")

# CJE automatically clusters by prompt for valid inference
if result.metadata.get("prompts_aligned"):
    print("✓ Paired comparison - using cluster-robust SEs")

Why it matters: Paired designs have correlated outcomes across policies (same prompt evaluated by multiple models). Standard SEs would understate uncertainty. CJE automatically accounts for this by clustering by prompt_id.

Beyond Direct Mode

CJE also supports IPS (counterfactual inference from logs) and DR (doubly robust with fresh draws). These require log probabilities from your models.

# IPS: Estimate "what if we deployed policy X?" from existing logs
result = analyze_dataset(logged_data_path="logs.jsonl")

# DR: Combine logged data + fresh draws for maximum accuracy
result = analyze_dataset(
    logged_data_path="logs.jsonl",
    fresh_draws_dir="responses/"
)

For IPS/DR data formats and API details: Run help(analyze_dataset) or see cje/interface/ module docs.

Visualization

CJE provides diagnostic plots for understanding and validating results:

from cje import analyze_dataset, plot_policy_estimates

# Run analysis
result = analyze_dataset(fresh_draws_dir="responses/")

# Quick plot with convenience method
result.plot_estimates(save_path="estimates.png")

# Or use visualization functions directly for more control
plot_policy_estimates(
    estimates={"policy_a": 0.75, "policy_b": 0.68},
    standard_errors={"policy_a": 0.02, "policy_b": 0.03},
    oracle_values={"policy_a": 0.74, "policy_b": 0.69}  # Optional
)

Available visualizations:

  • plot_policy_estimates - Forest plots with confidence intervals
  • plot_calibration_comparison - Judge→oracle calibration curves
  • plot_weight_dashboard_summary - Weight diagnostics for IPS/DR
  • plot_weight_dashboard_detailed - Per-policy weight analysis
  • plot_dr_dashboard - Doubly robust diagnostics

Jupyter notebooks: Results automatically display as formatted tables when evaluated in a cell.

See cje/visualization/README.md for complete guide.

Documentation

📚 Getting Started

🔧 For Engineers

Development

git clone https://github.com/cimo-labs/cje.git
cd cje
poetry install
make test

Support

License

MIT - See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cje_eval-0.2.4.tar.gz (298.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cje_eval-0.2.4-py3-none-any.whl (360.1 kB view details)

Uploaded Python 3

File details

Details for the file cje_eval-0.2.4.tar.gz.

File metadata

  • Download URL: cje_eval-0.2.4.tar.gz
  • Upload date:
  • Size: 298.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.4.tar.gz
Algorithm Hash digest
SHA256 decb3af5353eec3379cc3f8ef29dd099fd1a94c055d666092ab542a723c0ba54
MD5 a36e9bd94a21bf9bcc77816be52f8c52
BLAKE2b-256 d745dba81400f3dc0a7da6c51e242ade2fadb9b11eb05deaa006ffb70e3b825d

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.4.tar.gz:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cje_eval-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: cje_eval-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 360.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 da7ae6a4e13e9b2ac3c458b8ed9cf918e790398c9869e7f5ca0d7360108a33dd
MD5 343dd7c71cae5f7f58af88c647f96d2d
BLAKE2b-256 98d2fbada6011aa4d999747499fa464418075cef04dc0981ae552d20cc77a8f4

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.4-py3-none-any.whl:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page