Causal Judge Evaluation - Unbiased LLM evaluation framework
Project description
CJE - Causal Judge Evaluation
What if your AI evals looked like A/B tests with reliable confidence intervals and causal guarantees?
CJE makes it possible. Get unbiased estimates of how your new model will perform before deployment, with the statistical rigor you'd expect from production experimentation.
Why CJE?
🎯 Problem: Your LLM-judge scores are noisy, biased, and untrustworthy ✅ Solution: CJE uses AutoCal-R (Automatic Calibration for Rewards) and causal inference to debias them, giving you reliable estimates with confidence intervals without compromising on judge flexibility
Installation
pip install cje-eval
For development:
git clone https://github.com/fondutech/causal-judge-evaluation.git
cd causal-judge-evaluation
poetry install # or pip install -e .
Quick Start
CJE automatically selects the right mode based on your data:
from cje import analyze_dataset
# Mode 1: Direct (simplest - just fresh draws)
result = analyze_dataset(fresh_draws_dir="responses/")
print(f"Policy value: {result.estimates[0]:.3f} ± {result.standard_errors[0]:.3f}")
# Mode 2: IPS (logged data with logprobs)
result = analyze_dataset(logged_data_path="logs.jsonl") # Auto-selects IPS mode
# Mode 3: DR (logged data + fresh draws - most accurate)
result = analyze_dataset(
logged_data_path="logs.jsonl",
fresh_draws_dir="responses/" # Auto-selects DR mode
)
CLI usage:
# Direct mode (fresh draws only)
python -m cje analyze --fresh-draws-dir responses/
# IPS mode (logged data) - auto-selects calibrated-ips
python -m cje analyze logs.jsonl
# DR mode (both) - auto-selects stacked-dr
python -m cje analyze logs.jsonl --fresh-draws-dir responses/
Three Analysis Modes
CJE automatically selects the best mode based on your data. Mode selection follows a simple 4-rule system based on logprob coverage (fraction of samples with complete logprobs) and data sources:
- fresh_draws + coverage ≥50% → DR mode (doubly robust - most accurate)
- no fresh_draws + coverage ≥50% → IPS mode (importance sampling - counterfactual)
- fresh_draws + coverage <50% → Direct mode (on-policy comparison)
- no fresh_draws + coverage <50% → Error (insufficient data)
See Interface README for details.
1. Direct Mode (Fresh draws only)
- Use when: You have responses from target policies, no logprobs needed
- Estimand: "Which policy performs best on this eval set?" (on-policy comparison)
- Data needed: Fresh responses with judge scores
- Example: Comparing 3 model variants on 1000 prompts
2. IPS Mode (Logged data with logprobs)
- Use when: You have logged data with importance weights, no fresh draws
- Estimand: "What would happen if we deployed this policy?" (counterfactual)
- Data needed: Logged responses with base/target logprobs
- Example: Evaluating a new model on production traffic logs
3. DR Mode (Both logged data and fresh draws)
- Use when: You want maximum accuracy and have both
- Estimand: Counterfactual deployment value (most accurate)
- Data needed: Both logged data and fresh draws
- Default estimator:
stacked-dr(ensemble of DR-CPO, TMLE, MRDR, OC-DR-CPO, TR-CPO-E) - Example: High-stakes A/B decision for model deployment
Note: The paper's "Calibrated DR" refers to DR mode, which defaults to stacked-dr - an optimal convex combination of multiple DR estimators for robustness and tighter intervals.
When to Use CJE
✅ Perfect for:
- Comparing LLM policies before deployment
- Evaluating multiple model variants
- Reusing existing data for new evaluations
- High-stakes decisions needing confidence intervals
❌ Not for:
- Online learning (CJE is offline)
- Real-time scoring (CJE is batch)
- Very small samples (<100 examples)
Data Requirements
Requirements depend on which mode you're using:
For Direct Mode (fresh draws only):
{
"prompt_id": "arena_0",
"prompt": "What is 2+2?",
"response": "4",
"policy": "clone",
"judge_score": 0.85, // Required: judge evaluation
"oracle_label": 0.86 // Optional: ground truth (50% coverage enables calibration)
}
AutoCal-R: If 50%+ of fresh draws have oracle_label, Direct mode automatically applies AutoCal-R to learn judge→oracle calibration and uses calibrated rewards.
For IPS/DR Modes (logged data):
{
"prompt": "What is 2+2?",
"response": "4",
"base_policy_logprob": -14.7, // Required: log P(response|prompt) for logging policy
"target_policy_logprobs": { // Required: same for policies to evaluate
"clone": -14.7,
"parallel_universe_prompt": -18.3,
"unhelpful": -42.1
},
"judge_score": 0.85, // Required: judge evaluation
"oracle_label": 0.86 // Optional: ground truth (5-10% is enough for calibration)
}
Key difference: Direct mode doesn't need logprobs! Just responses from each policy with judge scores (and optionally oracle labels for AutoCal-R calibration).
Working example: See examples/arena_sample/ for complete dataset examples with logged data and fresh draws.
Generating Log Probabilities
CJE includes built-in Fireworks API integration for computing teacher-forced log probabilities:
from cje.teacher_forcing import compute_teacher_forced_logprob
# Compute log P(response|prompt) for any model on Fireworks
result = compute_teacher_forced_logprob(
prompt="What is 2+2?",
response="4",
model="accounts/fireworks/models/llama-v3p2-3b-instruct"
)
if result.status == "success":
logprob = result.value # e.g., -2.3
This handles chat templates, tokenization, and API calls automatically. See cje/teacher_forcing/ for details.
Choosing an Estimator
Most users should use estimator="auto" (the default) - CJE will automatically select:
directwhen you only providefresh_draws_dircalibrated-ipswhen you only providelogged_data_pathstacked-drwhen you provide both
You can override automatic selection by specifying an estimator explicitly:
# Use IPS even with fresh draws available
analyze_dataset("logs.jsonl", fresh_draws_dir="responses/", estimator="calibrated-ips")
# Use Direct mode for on-policy comparison instead of DR
analyze_dataset("logs.jsonl", fresh_draws_dir="responses/", estimator="direct")
Manual estimator options:
direct: On-policy comparison (Direct mode - no counterfactual inference)calibrated-ips: IPS with SIMCal weight stabilization (IPS mode default)stacked-dr: Ensemble of DR estimators (DR mode default - recommended for production)- Optimally combines: DR-CPO, TMLE, MRDR, OC-DR-CPO, TR-CPO-E
- Provides robustness and tighter confidence intervals
- Individual DR estimators:
dr-cpo,tmle,mrdr,oc-dr-cpo,tr-cpo,tr-cpo-e(for research)
Paper terminology: "Calibrated DR" in the paper = DR mode with stacked-dr estimator in the code.
See the examples for mode-specific workflows.
Documentation
📚 Getting Started
- 5-Minute Quickstart - First analysis step-by-step
- Examples - Working code samples
- Full documentation coming soon on cimo-labs.com
🔧 For Engineers
- Engineering Guide - Interface specs and patterns
- Arena Experiment - Production pipeline example
- Module READMEs - Each subdirectory in
cje/contains a developer-oriented README:cje/estimators/README.md- Estimator implementations and hierarchycje/diagnostics/README.md- Diagnostic system architecturecje/data/README.md- Data models and validationcje/calibration/README.md- Calibration methodscje/interface/README.md- High-level API details
📊 Additional Resources
- API Reference - Coming soon
- Mathematical Foundations - Coming soon
- Troubleshooting Guide - Coming soon
Development
make install # Install with Poetry
make test # Run tests
make format # Auto-format code
make lint # Check code quality
Support
- 🐛 Issues
- 💬 Discussions
License
MIT - See LICENSE for details.
Ready to start? → 5-Minute Quickstart
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cje_eval-0.2.0.tar.gz.
File metadata
- Download URL: cje_eval-0.2.0.tar.gz
- Upload date:
- Size: 250.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce8d324afeffc3791177899c2b7d4edb313a43f948145a9b1f1bd454aa722947
|
|
| MD5 |
b09c4cd0842ad218348e7d3f24f1c05d
|
|
| BLAKE2b-256 |
c79daaa426d10e852ef1ea84811574acdc86672f6b9fbb6045b8e0cf6d4aa906
|
Provenance
The following attestation bundles were made for cje_eval-0.2.0.tar.gz:
Publisher:
publish.yml on cimo-labs/cje
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cje_eval-0.2.0.tar.gz -
Subject digest:
ce8d324afeffc3791177899c2b7d4edb313a43f948145a9b1f1bd454aa722947 - Sigstore transparency entry: 595725704
- Sigstore integration time:
-
Permalink:
cimo-labs/cje@e8ef1965768d3bc29cf5956bc45700ebae45ff90 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/cimo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e8ef1965768d3bc29cf5956bc45700ebae45ff90 -
Trigger Event:
release
-
Statement type:
File details
Details for the file cje_eval-0.2.0-py3-none-any.whl.
File metadata
- Download URL: cje_eval-0.2.0-py3-none-any.whl
- Upload date:
- Size: 305.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
758fd05783380cdfb75dbe4498ec4e4338af484a2d72f7c51052a3b1668db5c2
|
|
| MD5 |
80c4915b77a2d7ec4079350c2f8363b0
|
|
| BLAKE2b-256 |
266da8e9a5fa7b994850508b3f8d0e3a1758ac470fb062fb6767eddb8162e3fd
|
Provenance
The following attestation bundles were made for cje_eval-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on cimo-labs/cje
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cje_eval-0.2.0-py3-none-any.whl -
Subject digest:
758fd05783380cdfb75dbe4498ec4e4338af484a2d72f7c51052a3b1668db5c2 - Sigstore transparency entry: 595725753
- Sigstore integration time:
-
Permalink:
cimo-labs/cje@e8ef1965768d3bc29cf5956bc45700ebae45ff90 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/cimo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e8ef1965768d3bc29cf5956bc45700ebae45ff90 -
Trigger Event:
release
-
Statement type: