Causal Judge Evaluation - Unbiased LLM evaluation framework

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

elandesberg

These details have not been verified by PyPI

Project links

Homepage

Project description

CJE - Causal Judge Evaluation

What if your AI evals looked like A/B tests with reliable confidence intervals and causal guarantees?

CJE makes it possible. Get unbiased estimates of how your new model will perform before deployment, with the statistical rigor you'd expect from production experimentation.

Why CJE?

🎯 Problem: Your LLM-judge scores are noisy, biased, and untrustworthy ✅ Solution: CJE uses AutoCal-R (Automatic Calibration for Rewards) and causal inference to debias them, giving you reliable estimates with confidence intervals without compromising on judge flexibility

Installation

pip install cje-eval

For development:

git clone https://github.com/fondutech/causal-judge-evaluation.git
cd causal-judge-evaluation
poetry install  # or pip install -e .

Quick Start

CJE automatically selects the right mode based on your data:

from cje import analyze_dataset

# Mode 1: Direct (simplest - just fresh draws)
result = analyze_dataset(fresh_draws_dir="responses/")
print(f"Policy value: {result.estimates[0]:.3f} ± {result.standard_errors[0]:.3f}")

# Mode 2: IPS (logged data with logprobs)
result = analyze_dataset(logged_data_path="logs.jsonl")  # Auto-selects IPS mode

# Mode 3: DR (logged data + fresh draws - most accurate)
result = analyze_dataset(
    logged_data_path="logs.jsonl",
    fresh_draws_dir="responses/"  # Auto-selects DR mode
)

CLI usage:

# Direct mode (fresh draws only)
python -m cje analyze --fresh-draws-dir responses/

# IPS mode (logged data) - auto-selects calibrated-ips
python -m cje analyze logs.jsonl

# DR mode (both) - auto-selects stacked-dr
python -m cje analyze logs.jsonl --fresh-draws-dir responses/

Three Analysis Modes

CJE automatically selects the best mode based on your data. Mode selection follows a simple 4-rule system based on logprob coverage (fraction of samples with complete logprobs) and data sources:

fresh_draws + coverage ≥50% → DR mode (doubly robust - most accurate)
no fresh_draws + coverage ≥50% → IPS mode (importance sampling - counterfactual)
fresh_draws + coverage <50% → Direct mode (on-policy comparison)
no fresh_draws + coverage <50% → Error (insufficient data)

See Interface README for details.

1. Direct Mode (Fresh draws only)

Use when: You have responses from target policies, no logprobs needed
Estimand: "Which policy performs best on this eval set?" (on-policy comparison)
Data needed: Fresh responses with judge scores
Example: Comparing 3 model variants on 1000 prompts

2. IPS Mode (Logged data with logprobs)

Use when: You have logged data with importance weights, no fresh draws
Estimand: "What would happen if we deployed this policy?" (counterfactual)
Data needed: Logged responses with base/target logprobs
Example: Evaluating a new model on production traffic logs

3. DR Mode (Both logged data and fresh draws)

Use when: You want maximum accuracy and have both
Estimand: Counterfactual deployment value (most accurate)
Data needed: Both logged data and fresh draws
Default estimator: stacked-dr (ensemble of DR-CPO, TMLE, MRDR, OC-DR-CPO, TR-CPO-E)
Example: High-stakes A/B decision for model deployment

Note: The paper's "Calibrated DR" refers to DR mode, which defaults to stacked-dr - an optimal convex combination of multiple DR estimators for robustness and tighter intervals.

When to Use CJE

✅ Perfect for:

Comparing LLM policies before deployment
Evaluating multiple model variants
Reusing existing data for new evaluations
High-stakes decisions needing confidence intervals

❌ Not for:

Online learning (CJE is offline)
Real-time scoring (CJE is batch)
Very small samples (<100 examples)

Data Requirements

Requirements depend on which mode you're using:

For Direct Mode (fresh draws only):

{
  "prompt_id": "arena_0",
  "prompt": "What is 2+2?",
  "response": "4",
  "policy": "clone",
  "judge_score": 0.85,                       // Required: judge evaluation
  "oracle_label": 0.86                       // Optional: ground truth (50% coverage enables calibration)
}

AutoCal-R: If 50%+ of fresh draws have oracle_label, Direct mode automatically applies AutoCal-R to learn judge→oracle calibration and uses calibrated rewards.

For IPS/DR Modes (logged data):

{
  "prompt": "What is 2+2?",
  "response": "4",
  "base_policy_logprob": -14.7,              // Required: log P(response|prompt) for logging policy
  "target_policy_logprobs": {                // Required: same for policies to evaluate
    "clone": -14.7,
    "parallel_universe_prompt": -18.3,
    "unhelpful": -42.1
  },
  "judge_score": 0.85,                       // Required: judge evaluation
  "oracle_label": 0.86                       // Optional: ground truth (5-10% is enough for calibration)
}

Key difference: Direct mode doesn't need logprobs! Just responses from each policy with judge scores (and optionally oracle labels for AutoCal-R calibration).

Working example: See examples/arena_sample/ for complete dataset examples with logged data and fresh draws.

Generating Log Probabilities

CJE includes built-in Fireworks API integration for computing teacher-forced log probabilities:

from cje.teacher_forcing import compute_teacher_forced_logprob

# Compute log P(response|prompt) for any model on Fireworks
result = compute_teacher_forced_logprob(
    prompt="What is 2+2?",
    response="4",
    model="accounts/fireworks/models/llama-v3p2-3b-instruct"
)
if result.status == "success":
    logprob = result.value  # e.g., -2.3

This handles chat templates, tokenization, and API calls automatically. See cje/teacher_forcing/ for details.

Choosing an Estimator

Most users should use estimator="auto" (the default) - CJE will automatically select:

direct when you only provide fresh_draws_dir
calibrated-ips when you only provide logged_data_path
stacked-dr when you provide both

You can override automatic selection by specifying an estimator explicitly:

# Use IPS even with fresh draws available
analyze_dataset("logs.jsonl", fresh_draws_dir="responses/", estimator="calibrated-ips")

# Use Direct mode for on-policy comparison instead of DR
analyze_dataset("logs.jsonl", fresh_draws_dir="responses/", estimator="direct")

Manual estimator options:

direct: On-policy comparison (Direct mode - no counterfactual inference)
calibrated-ips: IPS with SIMCal weight stabilization (IPS mode default)
stacked-dr: Ensemble of DR estimators (DR mode default - recommended for production)
- Optimally combines: DR-CPO, TMLE, MRDR, OC-DR-CPO, TR-CPO-E
- Provides robustness and tighter confidence intervals
Individual DR estimators: dr-cpo, tmle, mrdr, oc-dr-cpo, tr-cpo, tr-cpo-e (for research)

Paper terminology: "Calibrated DR" in the paper = DR mode with stacked-dr estimator in the code.

See the examples for mode-specific workflows.

Documentation

📚 Getting Started

5-Minute Quickstart - First analysis step-by-step
Examples - Working code samples
Full documentation coming soon on cimo-labs.com

🔧 For Engineers

Engineering Guide - Interface specs and patterns
Arena Experiment - Production pipeline example
Module READMEs - Each subdirectory in cje/ contains a developer-oriented README:
- cje/estimators/README.md - Estimator implementations and hierarchy
- cje/diagnostics/README.md - Diagnostic system architecture
- cje/data/README.md - Data models and validation
- cje/calibration/README.md - Calibration methods
- cje/interface/README.md - High-level API details

📊 Additional Resources

API Reference - Coming soon
Mathematical Foundations - Coming soon
Troubleshooting Guide - Coming soon

Development

make install  # Install with Poetry
make test     # Run tests
make format   # Auto-format code
make lint     # Check code quality

Support

🐛 Issues
💬 Discussions

License

MIT - See LICENSE for details.

Ready to start? → 5-Minute Quickstart

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

elandesberg

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.25

Apr 2, 2026

0.2.24

Mar 24, 2026

0.2.23

Mar 1, 2026

0.2.22

Feb 3, 2026

0.2.21

Feb 3, 2026

0.2.20

Feb 2, 2026

0.2.19

Jan 27, 2026

0.2.18

Jan 21, 2026

0.2.17

Jan 20, 2026

0.2.16

Jan 20, 2026

0.2.13

Jan 15, 2026

0.2.12

Jan 11, 2026

0.2.10

Jan 2, 2026

0.2.9

Dec 14, 2025

0.2.8

Dec 14, 2025

0.2.7

Dec 14, 2025

0.2.6

Dec 13, 2025

0.2.5

Oct 30, 2025

0.2.4

Oct 30, 2025

0.2.3

Oct 27, 2025

0.2.2

Oct 9, 2025

0.2.1

Oct 9, 2025

This version

0.2.0

Oct 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cje_eval-0.2.0.tar.gz (250.2 kB view details)

Uploaded Oct 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cje_eval-0.2.0-py3-none-any.whl (305.0 kB view details)

Uploaded Oct 8, 2025 Python 3

File details

Details for the file cje_eval-0.2.0.tar.gz.

File metadata

Download URL: cje_eval-0.2.0.tar.gz
Upload date: Oct 8, 2025
Size: 250.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ce8d324afeffc3791177899c2b7d4edb313a43f948145a9b1f1bd454aa722947`
MD5	`b09c4cd0842ad218348e7d3f24f1c05d`
BLAKE2b-256	`c79daaa426d10e852ef1ea84811574acdc86672f6b9fbb6045b8e0cf6d4aa906`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.0.tar.gz:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cje_eval-0.2.0.tar.gz
- Subject digest: ce8d324afeffc3791177899c2b7d4edb313a43f948145a9b1f1bd454aa722947
- Sigstore transparency entry: 595725704
- Sigstore integration time: Oct 8, 2025
Source repository:
- Permalink: cimo-labs/cje@e8ef1965768d3bc29cf5956bc45700ebae45ff90
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/cimo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e8ef1965768d3bc29cf5956bc45700ebae45ff90
- Trigger Event: release

File details

Details for the file cje_eval-0.2.0-py3-none-any.whl.

File metadata

Download URL: cje_eval-0.2.0-py3-none-any.whl
Upload date: Oct 8, 2025
Size: 305.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`758fd05783380cdfb75dbe4498ec4e4338af484a2d72f7c51052a3b1668db5c2`
MD5	`80c4915b77a2d7ec4079350c2f8363b0`
BLAKE2b-256	`266da8e9a5fa7b994850508b3f8d0e3a1758ac470fb062fb6767eddb8162e3fd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.0-py3-none-any.whl:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cje_eval-0.2.0-py3-none-any.whl
- Subject digest: 758fd05783380cdfb75dbe4498ec4e4338af484a2d72f7c51052a3b1668db5c2
- Sigstore transparency entry: 595725753
- Sigstore integration time: Oct 8, 2025
Source repository:
- Permalink: cimo-labs/cje@e8ef1965768d3bc29cf5956bc45700ebae45ff90
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/cimo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e8ef1965768d3bc29cf5956bc45700ebae45ff90
- Trigger Event: release

cje-eval 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CJE - Causal Judge Evaluation

Why CJE?

Installation

Quick Start

Three Analysis Modes

1. Direct Mode (Fresh draws only)

2. IPS Mode (Logged data with logprobs)

3. DR Mode (Both logged data and fresh draws)

When to Use CJE

Data Requirements

For Direct Mode (fresh draws only):

For IPS/DR Modes (logged data):

Generating Log Probabilities

Choosing an Estimator

Documentation

Development

Support

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance