Skip to main content

Causal Judge Evaluation - Unbiased LLM evaluation framework

Project description

CJE Logo

CJE - Causal Judge Evaluation

Docs Python Tests License

Evaluate LLM policies with statistical rigor - as simple as comparing responses, as powerful as A/B testing.

CJE turns your LLM-judge evaluations into reliable estimates with confidence intervals. Compare policies head-to-head, or estimate counterfactual deployment value from logged data.

Why CJE?

🎯 Problem: LLM-judge scores are noisy and biased ✅ Solution: Automatic calibration (AutoCal-R) learns judge→oracle mapping to debias scores and provide reliable estimates with confidence intervals

Three modes, one interface:

  • Direct mode: Compare policies on an eval set (simplest - no logprobs needed)
  • IPS mode: Estimate counterfactual value from logged data (reuse existing logs)
  • DR mode: Combine both for maximum accuracy (doubly robust)

Installation

pip install cje-eval

For development:

git clone https://github.com/fondutech/causal-judge-evaluation.git
cd causal-judge-evaluation
poetry install  # or pip install -e .

🚀 Try it Now - Interactive Demo

No installation required! Try CJE in your browser with real Arena data:

Open In Colab

The notebook demonstrates all three analysis modes (IPS, DR, Direct) with step-by-step explanations and diagnostics interpretation.

Quick Start

Simplest workflow - Direct mode (no logprobs needed):

from cje import analyze_dataset

# Compare policies on an eval set
result = analyze_dataset(fresh_draws_dir="responses/")

# Get estimates with confidence intervals
for i, policy in enumerate(result.metadata["target_policies"]):
    est = result.estimates[i]
    se = result.standard_errors[i]
    print(f"{policy}: {est:.3f} ± {1.96*se:.3f}")

Your responses/ directory just needs JSONL files like:

{"prompt_id": "eval_0", "policy": "model_a", "judge_score": 0.85}
{"prompt_id": "eval_0", "policy": "model_b", "judge_score": 0.72}

That's it! CJE handles the rest - auto-discovers policies, applies calibration if oracle labels are present, and returns reliable estimates.

Advanced: Reuse logged data (IPS/DR modes)

If you have production logs with log probabilities, CJE can estimate counterfactual deployment value:

# IPS mode: Use logged data only
result = analyze_dataset(logged_data_path="logs.jsonl")

# DR mode: Combine logged data + fresh draws (most accurate)
result = analyze_dataset(
    logged_data_path="logs.jsonl",
    fresh_draws_dir="responses/"
)

See Data Requirements for IPS/DR data format and Teacher Forcing for computing logprobs.

Three Analysis Modes

CJE automatically selects the best mode based on your data:

Mode Data What it tells you Best for
Direct Responses from each policy Which policy is best on this eval set? Quick comparisons, A/B testing
IPS Logged data with logprobs What if we deployed policy X? (counterfactual) Reusing existing logs, fast iteration
DR Both logged + responses Counterfactual value (most accurate) High-stakes decisions, maximum accuracy

Automatic mode selection:

  • fresh_draws_dir only → Direct mode
  • logged_data_path only → IPS mode (importance sampling)
  • Both → DR mode (doubly robust)

Direct mode is the simplest - just provide responses from each policy with judge scores. No logprobs needed!

IPS/DR modes enable counterfactual inference: "What would happen if we deployed this policy?" This requires log probabilities from your models. See Generating Log Probabilities below for Fireworks API integration.

When to Use CJE

Use CJE when you need:

  • Statistical rigor (confidence intervals, p-values)
  • Debiased judge scores (automatic calibration)
  • Policy comparisons or counterfactual estimates
  • To reuse logged data for new evaluations

Don't use CJE for:

  • Online learning (CJE is offline/batch)
  • Real-time scoring (use raw judge for that)
  • Very small samples (<100 examples)

Data Requirements

Requirements depend on which mode you're using:

For Direct Mode (fresh draws only):

{
  "prompt_id": "arena_0",
  "prompt": "What is 2+2?",
  "response": "4",
  "policy": "clone",
  "judge_score": 0.85,                       // Required: judge evaluation
  "oracle_label": 0.86                       // Optional: ground truth (enables AutoCal-R)
}

AutoCal-R: If any fresh draws have oracle_label, Direct mode automatically applies AutoCal-R to learn judge→oracle calibration and uses calibrated rewards. More oracle labels = better calibration (5-10% is often sufficient).

For IPS/DR Modes (logged data):

{
  "prompt": "What is 2+2?",
  "response": "4",
  "base_policy_logprob": -14.7,              // Required: log P(response|prompt) for logging policy
  "target_policy_logprobs": {                // Required: same for policies to evaluate
    "clone": -14.7,
    "parallel_universe_prompt": -18.3,
    "unhelpful": -42.1
  },
  "judge_score": 0.85,                       // Required: judge evaluation
  "oracle_label": 0.86                       // Optional: ground truth (5-10% is enough for calibration)
}

Key difference: Direct mode doesn't need logprobs! Just responses from each policy with judge scores (and optionally oracle labels for AutoCal-R calibration).

Working example: See examples/arena_sample/ for complete dataset examples with logged data and fresh draws.

Generating Log Probabilities

For IPS/DR modes, you need log probabilities. CJE includes built-in Fireworks API integration:

from cje.teacher_forcing import compute_teacher_forced_logprob

# Compute log P(response|prompt) for any model on Fireworks
result = compute_teacher_forced_logprob(
    prompt="What is 2+2?",
    response="4",
    model="accounts/fireworks/models/llama-v3p2-3b-instruct"
)
if result.status == "success":
    logprob = result.value  # e.g., -2.3

This handles chat templates, tokenization, and API calls automatically. Supports all Fireworks models.

Don't have Fireworks access? Direct mode doesn't need logprobs - just use fresh_draws_dir with judge scores.

See cje/teacher_forcing/README.md for batch processing and advanced options.

Advanced: Choosing an Estimator

Most users: Use estimator="auto" (the default). CJE auto-selects the best estimator for your mode.

For researchers: You can specify estimators explicitly:

  • direct: On-policy comparison (no counterfactual inference)
  • calibrated-ips: IPS with variance-reduced weights (SIMCal)
  • stacked-dr: Ensemble of DR estimators (recommended for production)
  • Individual DR variants: dr-cpo, tmle, mrdr, oc-dr-cpo, tr-cpo-e

See cje/estimators/README.md for technical details on each estimator.

Documentation

📚 Getting Started

🔧 For Engineers

  • Engineering Guide - Interface specs and patterns
  • Arena Experiment - Production pipeline example
  • Module READMEs - Each subdirectory in cje/ contains a developer-oriented README:
    • cje/estimators/README.md - Estimator implementations and hierarchy
    • cje/diagnostics/README.md - Diagnostic system architecture
    • cje/data/README.md - Data models and validation
    • cje/calibration/README.md - Calibration methods
    • cje/interface/README.md - High-level API details

📊 Additional Resources

  • API Reference - Coming soon
  • Mathematical Foundations - Coming soon
  • Troubleshooting Guide - Coming soon

Development

make install  # Install with Poetry
make test     # Run tests
make format   # Auto-format code
make lint     # Check code quality

Support

License

MIT - See LICENSE for details.


Ready to start?5-Minute Quickstart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cje_eval-0.2.2.tar.gz (259.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cje_eval-0.2.2-py3-none-any.whl (315.0 kB view details)

Uploaded Python 3

File details

Details for the file cje_eval-0.2.2.tar.gz.

File metadata

  • Download URL: cje_eval-0.2.2.tar.gz
  • Upload date:
  • Size: 259.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.2.tar.gz
Algorithm Hash digest
SHA256 f307d44b6c60db2968e79c9ecb5e299f7c60e8959fdde644732d645137f1168a
MD5 6ec0a1f25c2d51186a5389954fe18040
BLAKE2b-256 64f9fc0667628c9358f1331b67f9ceec8202c2ecedca5bd6778d4dab12e9e9de

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.2.tar.gz:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cje_eval-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: cje_eval-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 315.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7c1f71cfcd4e19661e88aac35ee7d9c289f7d999f58f86a10ffc9f4392ce6e8e
MD5 75f059465b9fe1e00c818b7851fd7854
BLAKE2b-256 cad8b0ceecd6dd0631dbd85bd162728e8939199978aaf7f908705a2add737578

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.2-py3-none-any.whl:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page