Causal Judge Evaluation - Unbiased LLM evaluation framework
Project description
CJE - Causal Judge Evaluation
Evaluate LLM policies with statistical rigor - as simple as comparing responses, as powerful as A/B testing.
CJE turns your LLM-judge evaluations into reliable estimates with confidence intervals. Compare policies head-to-head, or estimate counterfactual deployment value from logged data.
Why CJE?
🎯 Problem: LLM-judge scores are noisy and biased ✅ Solution: Automatic calibration (AutoCal-R) learns judge→oracle mapping to debias scores and provide reliable estimates with confidence intervals
Three modes, one interface:
- Direct mode: Compare policies on an eval set (simplest - no logprobs needed)
- IPS mode: Estimate counterfactual value from logged data (reuse existing logs)
- DR mode: Combine both for maximum accuracy (doubly robust)
Installation
pip install cje-eval
For development:
git clone https://github.com/fondutech/causal-judge-evaluation.git
cd causal-judge-evaluation
poetry install # or pip install -e .
🚀 Try it Now - Interactive Demo
No installation required! Try CJE in your browser with real Arena data:
The notebook demonstrates all three analysis modes (IPS, DR, Direct) with step-by-step explanations and diagnostics interpretation.
Quick Start
Simplest workflow - Direct mode (no logprobs needed):
from cje import analyze_dataset
# Compare policies on an eval set
result = analyze_dataset(fresh_draws_dir="responses/")
# Get estimates with confidence intervals
for i, policy in enumerate(result.metadata["target_policies"]):
est = result.estimates[i]
se = result.standard_errors[i]
print(f"{policy}: {est:.3f} ± {1.96*se:.3f}")
Your responses/ directory just needs JSONL files like:
{"prompt_id": "eval_0", "policy": "model_a", "judge_score": 0.85}
{"prompt_id": "eval_0", "policy": "model_b", "judge_score": 0.72}
That's it! CJE handles the rest - auto-discovers policies, applies calibration if oracle labels are present, and returns reliable estimates.
Advanced: Reuse logged data (IPS/DR modes)
If you have production logs with log probabilities, CJE can estimate counterfactual deployment value:
# IPS mode: Use logged data only
result = analyze_dataset(logged_data_path="logs.jsonl")
# DR mode: Combine logged data + fresh draws (most accurate)
result = analyze_dataset(
logged_data_path="logs.jsonl",
fresh_draws_dir="responses/"
)
See Data Requirements for IPS/DR data format and Teacher Forcing for computing logprobs.
Three Analysis Modes
CJE automatically selects the best mode based on your data:
| Mode | Data | What it tells you | Best for |
|---|---|---|---|
| Direct | Responses from each policy | Which policy is best on this eval set? | Quick comparisons, A/B testing |
| IPS | Logged data with logprobs | What if we deployed policy X? (counterfactual) | Reusing existing logs, fast iteration |
| DR | Both logged + responses | Counterfactual value (most accurate) | High-stakes decisions, maximum accuracy |
Automatic mode selection:
fresh_draws_dironly → Direct modelogged_data_pathonly → IPS mode (importance sampling)- Both → DR mode (doubly robust)
Direct mode is the simplest - just provide responses from each policy with judge scores. No logprobs needed!
IPS/DR modes enable counterfactual inference: "What would happen if we deployed this policy?" This requires log probabilities from your models. See Generating Log Probabilities below for Fireworks API integration.
When to Use CJE
✅ Use CJE when you need:
- Statistical rigor (confidence intervals, p-values)
- Debiased judge scores (automatic calibration)
- Policy comparisons or counterfactual estimates
- To reuse logged data for new evaluations
❌ Don't use CJE for:
- Online learning (CJE is offline/batch)
- Real-time scoring (use raw judge for that)
- Very small samples (<100 examples)
Data Requirements
Requirements depend on which mode you're using:
For Direct Mode (fresh draws only):
{
"prompt_id": "arena_0",
"prompt": "What is 2+2?",
"response": "4",
"policy": "clone",
"judge_score": 0.85, // Required: judge evaluation
"oracle_label": 0.86 // Optional: ground truth (enables AutoCal-R)
}
AutoCal-R: If any fresh draws have oracle_label, Direct mode automatically applies AutoCal-R to learn judge→oracle calibration and uses calibrated rewards. More oracle labels = better calibration (5-10% is often sufficient).
For IPS/DR Modes (logged data):
{
"prompt": "What is 2+2?",
"response": "4",
"base_policy_logprob": -14.7, // Required: log P(response|prompt) for logging policy
"target_policy_logprobs": { // Required: same for policies to evaluate
"clone": -14.7,
"parallel_universe_prompt": -18.3,
"unhelpful": -42.1
},
"judge_score": 0.85, // Required: judge evaluation
"oracle_label": 0.86 // Optional: ground truth (5-10% is enough for calibration)
}
Key difference: Direct mode doesn't need logprobs! Just responses from each policy with judge scores (and optionally oracle labels for AutoCal-R calibration).
Working example: See examples/arena_sample/ for complete dataset examples with logged data and fresh draws.
Generating Log Probabilities
For IPS/DR modes, you need log probabilities. CJE includes built-in Fireworks API integration:
from cje.teacher_forcing import compute_teacher_forced_logprob
# Compute log P(response|prompt) for any model on Fireworks
result = compute_teacher_forced_logprob(
prompt="What is 2+2?",
response="4",
model="accounts/fireworks/models/llama-v3p2-3b-instruct"
)
if result.status == "success":
logprob = result.value # e.g., -2.3
This handles chat templates, tokenization, and API calls automatically. Supports all Fireworks models.
Don't have Fireworks access? Direct mode doesn't need logprobs - just use fresh_draws_dir with judge scores.
See cje/teacher_forcing/README.md for batch processing and advanced options.
Advanced: Choosing an Estimator
Most users: Use estimator="auto" (the default). CJE auto-selects the best estimator for your mode.
For researchers: You can specify estimators explicitly:
direct: On-policy comparison (no counterfactual inference)calibrated-ips: IPS with variance-reduced weights (SIMCal)stacked-dr: Ensemble of DR estimators (recommended for production)- Individual DR variants:
dr-cpo,tmle,mrdr,oc-dr-cpo,tr-cpo-e
See cje/estimators/README.md for technical details on each estimator.
Documentation
📚 Getting Started
- 5-Minute Quickstart - First analysis step-by-step
- Examples - Working code samples
- Full documentation coming soon on cimo-labs.com
🔧 For Engineers
- Engineering Guide - Interface specs and patterns
- Arena Experiment - Production pipeline example
- Module READMEs - Each subdirectory in
cje/contains a developer-oriented README:cje/estimators/README.md- Estimator implementations and hierarchycje/diagnostics/README.md- Diagnostic system architecturecje/data/README.md- Data models and validationcje/calibration/README.md- Calibration methodscje/interface/README.md- High-level API details
📊 Additional Resources
- API Reference - Coming soon
- Mathematical Foundations - Coming soon
- Troubleshooting Guide - Coming soon
Development
make install # Install with Poetry
make test # Run tests
make format # Auto-format code
make lint # Check code quality
Support
- 🐛 Issues
- 💬 Discussions
License
MIT - See LICENSE for details.
Ready to start? → 5-Minute Quickstart
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cje_eval-0.2.2.tar.gz.
File metadata
- Download URL: cje_eval-0.2.2.tar.gz
- Upload date:
- Size: 259.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f307d44b6c60db2968e79c9ecb5e299f7c60e8959fdde644732d645137f1168a
|
|
| MD5 |
6ec0a1f25c2d51186a5389954fe18040
|
|
| BLAKE2b-256 |
64f9fc0667628c9358f1331b67f9ceec8202c2ecedca5bd6778d4dab12e9e9de
|
Provenance
The following attestation bundles were made for cje_eval-0.2.2.tar.gz:
Publisher:
publish.yml on cimo-labs/cje
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cje_eval-0.2.2.tar.gz -
Subject digest:
f307d44b6c60db2968e79c9ecb5e299f7c60e8959fdde644732d645137f1168a - Sigstore transparency entry: 597907985
- Sigstore integration time:
-
Permalink:
cimo-labs/cje@70fe2b1a233cffd508cf9b275efa00df6d051032 -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/cimo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@70fe2b1a233cffd508cf9b275efa00df6d051032 -
Trigger Event:
release
-
Statement type:
File details
Details for the file cje_eval-0.2.2-py3-none-any.whl.
File metadata
- Download URL: cje_eval-0.2.2-py3-none-any.whl
- Upload date:
- Size: 315.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c1f71cfcd4e19661e88aac35ee7d9c289f7d999f58f86a10ffc9f4392ce6e8e
|
|
| MD5 |
75f059465b9fe1e00c818b7851fd7854
|
|
| BLAKE2b-256 |
cad8b0ceecd6dd0631dbd85bd162728e8939199978aaf7f908705a2add737578
|
Provenance
The following attestation bundles were made for cje_eval-0.2.2-py3-none-any.whl:
Publisher:
publish.yml on cimo-labs/cje
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cje_eval-0.2.2-py3-none-any.whl -
Subject digest:
7c1f71cfcd4e19661e88aac35ee7d9c289f7d999f58f86a10ffc9f4392ce6e8e - Sigstore transparency entry: 597907995
- Sigstore integration time:
-
Permalink:
cimo-labs/cje@70fe2b1a233cffd508cf9b275efa00df6d051032 -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/cimo-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@70fe2b1a233cffd508cf9b275efa00df6d051032 -
Trigger Event:
release
-
Statement type: