Causal Judge Evaluation - Unbiased LLM evaluation framework
Project description
CJE - Causal Judge Evaluation
Turn noisy LLM-judge scores into precise, unbiased estimates of the outcomes you care about.
CJE calibrates judge scores using a small oracle slice (5-10% coverage), then delivers statistically rigorous estimates.
How It Works
CJE follows a simple three-step workflow:
┌─────────────────────────────────┐
│ Data │
│ LLM-judge scores + │
│ oracle slice (5-50%) │
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ Calibrate │
│ Learn judge → oracle mapping │
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ Estimate │
│ Estimates with honest │
│ uncertainty │
└─────────────────────────────────┘
Key benefits:
- Small label budget: 5-10% oracle coverage often sufficient
- Unbiased estimates: Judge scores (+ optional covariates) mapped to oracle scale
- Rigorous inference: CIs account for both sampling and calibration uncertainty
See cje/calibration/README.md for technical details.
📊 Performance
Arena Experiment: 5k Real Evaluations - Comprehensive benchmarking on ChatBot Arena data:
- 94% pairwise ranking accuracy with Direct Model + covariates
- 158× ESS improvement with SIMCal-W vs raw SNIPS
- Kendall τ = 0.837 vs -0.235 for uncalibrated methods
- Validates AutoCal-R calibration and doubly-robust estimation on real data
Reproduction code: Full experimental pipeline available at cje-arena-experiments
Calibration Methods
CJE provides two calibration modes for mapping judge scores to oracle outcomes (i.e., the KPI you care about):
Monotone
Standard isotonic regression enforces: higher judge score → no worse expected outcome.
Why isotonic? It's the right structural prior—assumes only monotonicity (which you actually believe), preserves oracle KPI levels by construction (mean-preserving by KKT conditions), and is highly efficient with small label budgets. See technical rationale.
Simple, stable, works well when the judge-oracle relationship is already monotone.
Two-Stage (Default with Covariates)
Learns smooth transformation g(S, X) → rank → isotonic. Handles non-monotone patterns and incorporates additional covariates (e.g., response length, domain metadata) while maintaining final monotonicity guarantee.
Two-stage calibration learns flexible relationships between covariates (judge score, response length) and oracle outcomes in Stage 1, then enforces monotonicity via isotonic regression in Stage 2. Left/Middle: Partial dependence plots show how each covariate relates to oracle score while holding others at mean values. Right: Final monotone mapping from Stage 1 risk index to calibrated oracle score. Full benchmarking results: Arena Experiment. Data from LMSYS Chatbot Arena.
When to use two-stage:
- You have covariates (response length, domain, etc.) → two-stage is default and recommended
- Judge shows non-monotone empirical E[Oracle|Judge] relationship
- Regional miscalibration (monotone works well at low/high but poorly at mid-range)
- Length bias (judge gives different scores to same-quality responses based on length)
Auto mode:
- With covariates: Two-stage is automatically used (can incorporate additional features)
- Judge score only: CJE automatically selects monotone vs two-stage via cross-validation (1-SE rule)
# Default: Judge score only (no covariates, auto-selects monotone/two-stage via CV)
result = analyze_dataset(fresh_draws_dir="responses/")
# Include response_length covariate for two-stage calibration
result = analyze_dataset(
fresh_draws_dir="responses/",
include_response_length=True
)
# Add domain as additional covariate (combine with response_length)
result = analyze_dataset(
fresh_draws_dir="responses/",
include_response_length=True,
calibration_covariates=["domain"]
)
# Force a specific mode
result = analyze_dataset(
fresh_draws_dir="responses/",
calibration_mode="monotone" # or "two_stage"
)
Installation
pip install cje-eval
🚀 Try it Now - Interactive Demo
Quick Start
Minimal Example
from cje import analyze_dataset
# Compare policies on an eval set
result = analyze_dataset(fresh_draws_dir="responses/")
# Get estimates with confidence intervals
for policy, est, se in zip(
result.metadata["target_policies"],
result.estimates,
result.standard_errors
):
print(f"{policy}: {est:.3f} ± {1.96*se:.3f}")
Data Format
Directory structure: One JSONL file per policy
responses/
├── model_a_responses.jsonl
└── model_b_responses.jsonl
Minimal record (inside each file):
{"prompt_id": "eval_0", "judge_score": 0.85}
{"prompt_id": "eval_1", "judge_score": 0.72}
With calibration (add oracle labels to 5-10% of samples):
{"prompt_id": "eval_0", "judge_score": 0.85, "oracle_label": 0.86}
{"prompt_id": "eval_1", "judge_score": 0.72}
CJE automatically:
- Discovers policies from filenames (
model_a_responses.jsonl→ policy"model_a") - Applies AutoCal-R when oracle labels are present
- Uses cluster-robust SEs for paired comparisons (when same prompts across policies)
- Returns unbiased estimates with valid 95% CIs
Paired Comparisons
When comparing policies on the same prompts (paired design), CJE automatically uses cluster-robust standard errors:
# Both files must have matching prompt_ids for pairing
result = analyze_dataset(fresh_draws_dir="responses/")
# CJE automatically clusters by prompt for valid inference
if result.metadata.get("prompts_aligned"):
print("✓ Paired comparison - using cluster-robust SEs")
Why it matters: Paired designs have correlated outcomes across policies (same prompt evaluated by multiple models). Standard SEs would understate uncertainty. CJE automatically accounts for this by clustering by prompt_id.
Beyond Direct Mode
CJE also supports IPS (counterfactual inference from logs) and DR (doubly robust with fresh draws). These require log probabilities from your models.
# IPS: Estimate "what if we deployed policy X?" from existing logs
result = analyze_dataset(logged_data_path="logs.jsonl")
# DR: Combine logged data + fresh draws for maximum accuracy
result = analyze_dataset(
logged_data_path="logs.jsonl",
fresh_draws_dir="responses/"
)
For IPS/DR data formats and API details: Run help(analyze_dataset) or see cje/interface/ module docs.
Visualization
CJE provides diagnostic plots for understanding and validating results:
from cje import analyze_dataset, plot_policy_estimates
# Run analysis
result = analyze_dataset(fresh_draws_dir="responses/")
# Quick plot with convenience method
result.plot_estimates(save_path="estimates.png")
# Or use visualization functions directly for more control
plot_policy_estimates(
estimates={"policy_a": 0.75, "policy_b": 0.68},
standard_errors={"policy_a": 0.02, "policy_b": 0.03},
oracle_values={"policy_a": 0.74, "policy_b": 0.69} # Optional
)
Available visualizations:
plot_policy_estimates- Forest plots with confidence intervalsplot_calibration_comparison- Judge→oracle calibration curvesplot_weight_dashboard_summary- Weight diagnostics for IPS/DRplot_weight_dashboard_detailed- Per-policy weight analysisplot_dr_dashboard- Doubly robust diagnostics
Jupyter notebooks: Results automatically display as formatted tables when evaluated in a cell.
See cje/visualization/README.md for complete guide.
Documentation
📚 Getting Started
- Interactive Demo - Try in your browser
- Examples - Working code samples
🔧 For Engineers
- Calibration Methods - AutoCal-R, isotonic regression, two-stage fallback
- Diagnostics System - Uncertainty quantification, OUA, transportability tests
- Estimators - Direct, IPS, DR implementations
- Interface/API -
analyze_datasetimplementation and mode selection
Development
git clone https://github.com/cimo-labs/cje.git
cd cje
poetry install
make test
Support
- 🐛 Issues
- 💬 Discussions
License
MIT - See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cje_eval-0.2.5.tar.gz.
File metadata
- Download URL: cje_eval-0.2.5.tar.gz
- Upload date:
- Size: 298.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ad28f9b7996ea305cce2b08c46c53f15f7d94a7b7b5c2dc85719ee6e44890b5
|
|
| MD5 |
11c0df13c8e72fcd889b79163741c1af
|
|
| BLAKE2b-256 |
3df0eedd0c288f53f75f5041ec0b3a986865fa917f6956a1b740a1297a7045f4
|
Provenance
The following attestation bundles were made for cje_eval-0.2.5.tar.gz:
Publisher:
publish.yml on cimo-labs/cje
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cje_eval-0.2.5.tar.gz -
Subject digest:
4ad28f9b7996ea305cce2b08c46c53f15f7d94a7b7b5c2dc85719ee6e44890b5 - Sigstore transparency entry: 656124537
- Sigstore integration time:
-
Permalink:
cimo-labs/cje@245619dd3e1ba7be7b71d421138c04fa3f31bffb -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/cimo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@245619dd3e1ba7be7b71d421138c04fa3f31bffb -
Trigger Event:
release
-
Statement type:
File details
Details for the file cje_eval-0.2.5-py3-none-any.whl.
File metadata
- Download URL: cje_eval-0.2.5-py3-none-any.whl
- Upload date:
- Size: 360.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
855de9fd53e271658ec669fd4cee3cac6ead820769d1d1adc0cffd36a12b2029
|
|
| MD5 |
af97140b247fc4bdc6e52593f679c913
|
|
| BLAKE2b-256 |
a463d90d0e5e0030297e5c7e799b1d1710f2bc59cf7f964ffdfe11024ee22eab
|
Provenance
The following attestation bundles were made for cje_eval-0.2.5-py3-none-any.whl:
Publisher:
publish.yml on cimo-labs/cje
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cje_eval-0.2.5-py3-none-any.whl -
Subject digest:
855de9fd53e271658ec669fd4cee3cac6ead820769d1d1adc0cffd36a12b2029 - Sigstore transparency entry: 656124613
- Sigstore integration time:
-
Permalink:
cimo-labs/cje@245619dd3e1ba7be7b71d421138c04fa3f31bffb -
Branch / Tag:
refs/tags/v0.2.5 - Owner: https://github.com/cimo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@245619dd3e1ba7be7b71d421138c04fa3f31bffb -
Trigger Event:
release
-
Statement type: