Skip to main content

Causal Judge Evaluation - Unbiased LLM evaluation framework

Project description

CJE Logo

CJE - Causal Judge Evaluation

Docs Python Tests License

What if your AI evals looked like A/B tests with reliable confidence intervals and causal guarantees?

CJE makes it possible. Get unbiased estimates of how your new model will perform before deployment, with the statistical rigor you'd expect from production experimentation.

Why CJE?

🎯 Problem: Your LLM-judge scores are noisy, biased, and untrustworthy ✅ Solution: CJE uses AutoCal-R (Automatic Calibration for Rewards) and causal inference to debias them, giving you reliable estimates with confidence intervals without compromising on judge flexibility

Installation

pip install cje-eval

For development:

git clone https://github.com/fondutech/causal-judge-evaluation.git
cd causal-judge-evaluation
poetry install  # or pip install -e .

Quick Start

CJE automatically selects the right mode based on your data:

from cje import analyze_dataset

# Mode 1: Direct (simplest - just fresh draws)
result = analyze_dataset(fresh_draws_dir="responses/")
print(f"Policy value: {result.estimates[0]:.3f} ± {result.standard_errors[0]:.3f}")

# Mode 2: IPS (logged data with logprobs)
result = analyze_dataset(logged_data_path="logs.jsonl")  # Auto-selects IPS mode

# Mode 3: DR (logged data + fresh draws - most accurate)
result = analyze_dataset(
    logged_data_path="logs.jsonl",
    fresh_draws_dir="responses/"  # Auto-selects DR mode
)

CLI usage:

# Direct mode (fresh draws only)
python -m cje analyze --fresh-draws-dir responses/

# IPS mode (logged data) - auto-selects calibrated-ips
python -m cje analyze logs.jsonl

# DR mode (both) - auto-selects stacked-dr
python -m cje analyze logs.jsonl --fresh-draws-dir responses/

Three Analysis Modes

CJE automatically selects the best mode based on your data. Mode selection follows a simple 4-rule system based on logprob coverage (fraction of samples with complete logprobs) and data sources:

  1. fresh_draws + coverage ≥50% → DR mode (doubly robust - most accurate)
  2. no fresh_draws + coverage ≥50% → IPS mode (importance sampling - counterfactual)
  3. fresh_draws + coverage <50% → Direct mode (on-policy comparison)
  4. no fresh_draws + coverage <50% → Error (insufficient data)

See Interface README for details.

1. Direct Mode (Fresh draws only)

  • Use when: You have responses from target policies, no logprobs needed
  • Estimand: "Which policy performs best on this eval set?" (on-policy comparison)
  • Data needed: Fresh responses with judge scores
  • Example: Comparing 3 model variants on 1000 prompts

2. IPS Mode (Logged data with logprobs)

  • Use when: You have logged data with importance weights, no fresh draws
  • Estimand: "What would happen if we deployed this policy?" (counterfactual)
  • Data needed: Logged responses with base/target logprobs
  • Example: Evaluating a new model on production traffic logs

3. DR Mode (Both logged data and fresh draws)

  • Use when: You want maximum accuracy and have both
  • Estimand: Counterfactual deployment value (most accurate)
  • Data needed: Both logged data and fresh draws
  • Default estimator: stacked-dr (ensemble of DR-CPO, TMLE, MRDR, OC-DR-CPO, TR-CPO-E)
  • Example: High-stakes A/B decision for model deployment

Note: The paper's "Calibrated DR" refers to DR mode, which defaults to stacked-dr - an optimal convex combination of multiple DR estimators for robustness and tighter intervals.

When to Use CJE

Perfect for:

  • Comparing LLM policies before deployment
  • Evaluating multiple model variants
  • Reusing existing data for new evaluations
  • High-stakes decisions needing confidence intervals

Not for:

  • Online learning (CJE is offline)
  • Real-time scoring (CJE is batch)
  • Very small samples (<100 examples)

Data Requirements

Requirements depend on which mode you're using:

For Direct Mode (fresh draws only):

{
  "prompt_id": "arena_0",
  "prompt": "What is 2+2?",
  "response": "4",
  "policy": "clone",
  "judge_score": 0.85,                       // Required: judge evaluation
  "oracle_label": 0.86                       // Optional: ground truth (50% coverage enables calibration)
}

AutoCal-R: If 50%+ of fresh draws have oracle_label, Direct mode automatically applies AutoCal-R to learn judge→oracle calibration and uses calibrated rewards.

For IPS/DR Modes (logged data):

{
  "prompt": "What is 2+2?",
  "response": "4",
  "base_policy_logprob": -14.7,              // Required: log P(response|prompt) for logging policy
  "target_policy_logprobs": {                // Required: same for policies to evaluate
    "clone": -14.7,
    "parallel_universe_prompt": -18.3,
    "unhelpful": -42.1
  },
  "judge_score": 0.85,                       // Required: judge evaluation
  "oracle_label": 0.86                       // Optional: ground truth (5-10% is enough for calibration)
}

Key difference: Direct mode doesn't need logprobs! Just responses from each policy with judge scores (and optionally oracle labels for AutoCal-R calibration).

Working example: See examples/arena_sample/ for complete dataset examples with logged data and fresh draws.

Generating Log Probabilities

CJE includes built-in Fireworks API integration for computing teacher-forced log probabilities:

from cje.teacher_forcing import compute_teacher_forced_logprob

# Compute log P(response|prompt) for any model on Fireworks
result = compute_teacher_forced_logprob(
    prompt="What is 2+2?",
    response="4",
    model="accounts/fireworks/models/llama-v3p2-3b-instruct"
)
if result.status == "success":
    logprob = result.value  # e.g., -2.3

This handles chat templates, tokenization, and API calls automatically. See cje/teacher_forcing/ for details.

Choosing an Estimator

Most users should use estimator="auto" (the default) - CJE will automatically select:

  • direct when you only provide fresh_draws_dir
  • calibrated-ips when you only provide logged_data_path
  • stacked-dr when you provide both

You can override automatic selection by specifying an estimator explicitly:

# Use IPS even with fresh draws available
analyze_dataset("logs.jsonl", fresh_draws_dir="responses/", estimator="calibrated-ips")

# Use Direct mode for on-policy comparison instead of DR
analyze_dataset("logs.jsonl", fresh_draws_dir="responses/", estimator="direct")

Manual estimator options:

  • direct: On-policy comparison (Direct mode - no counterfactual inference)
  • calibrated-ips: IPS with SIMCal weight stabilization (IPS mode default)
  • stacked-dr: Ensemble of DR estimators (DR mode default - recommended for production)
    • Optimally combines: DR-CPO, TMLE, MRDR, OC-DR-CPO, TR-CPO-E
    • Provides robustness and tighter confidence intervals
  • Individual DR estimators: dr-cpo, tmle, mrdr, oc-dr-cpo, tr-cpo, tr-cpo-e (for research)

Paper terminology: "Calibrated DR" in the paper = DR mode with stacked-dr estimator in the code.

See the examples for mode-specific workflows.

Documentation

📚 Getting Started

🔧 For Engineers

  • Engineering Guide - Interface specs and patterns
  • Arena Experiment - Production pipeline example
  • Module READMEs - Each subdirectory in cje/ contains a developer-oriented README:
    • cje/estimators/README.md - Estimator implementations and hierarchy
    • cje/diagnostics/README.md - Diagnostic system architecture
    • cje/data/README.md - Data models and validation
    • cje/calibration/README.md - Calibration methods
    • cje/interface/README.md - High-level API details

📊 Additional Resources

  • API Reference - Coming soon
  • Mathematical Foundations - Coming soon
  • Troubleshooting Guide - Coming soon

Development

make install  # Install with Poetry
make test     # Run tests
make format   # Auto-format code
make lint     # Check code quality

Support

License

MIT - See LICENSE for details.


Ready to start?5-Minute Quickstart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cje_eval-0.2.0.tar.gz (250.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cje_eval-0.2.0-py3-none-any.whl (305.0 kB view details)

Uploaded Python 3

File details

Details for the file cje_eval-0.2.0.tar.gz.

File metadata

  • Download URL: cje_eval-0.2.0.tar.gz
  • Upload date:
  • Size: 250.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ce8d324afeffc3791177899c2b7d4edb313a43f948145a9b1f1bd454aa722947
MD5 b09c4cd0842ad218348e7d3f24f1c05d
BLAKE2b-256 c79daaa426d10e852ef1ea84811574acdc86672f6b9fbb6045b8e0cf6d4aa906

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.0.tar.gz:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cje_eval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cje_eval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 305.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 758fd05783380cdfb75dbe4498ec4e4338af484a2d72f7c51052a3b1668db5c2
MD5 80c4915b77a2d7ec4079350c2f8363b0
BLAKE2b-256 266da8e9a5fa7b994850508b3f8d0e3a1758ac470fb062fb6767eddb8162e3fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.0-py3-none-any.whl:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page