Causal Judge Evaluation - Unbiased LLM evaluation framework

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

elandesberg

These details have not been verified by PyPI

Project links

Homepage

Project description

CJE - Causal Judge Evaluation

Evaluate LLM policies with statistical rigor - as simple as comparing responses, as powerful as A/B testing.

CJE turns your LLM-judge evaluations into reliable estimates with confidence intervals. Compare policies head-to-head, or estimate counterfactual deployment value from logged data.

Why CJE?

🎯 Problem: LLM-judge scores are noisy and biased ✅ Solution: Automatic calibration (AutoCal-R) learns judge→oracle mapping to debias scores and provide reliable estimates with confidence intervals

Three modes, one interface:

Direct mode: Compare policies on an eval set (simplest - no logprobs needed)
IPS mode: Estimate counterfactual value from logged data (reuse existing logs)
DR mode: Combine both for maximum accuracy (doubly robust)

Installation

pip install cje-eval

For development:

git clone https://github.com/fondutech/causal-judge-evaluation.git
cd causal-judge-evaluation
poetry install  # or pip install -e .

🚀 Try it Now - Interactive Demo

No installation required! Try CJE in your browser with real Arena data:

The notebook demonstrates all three analysis modes (IPS, DR, Direct) with step-by-step explanations and diagnostics interpretation.

Quick Start

Simplest workflow - Direct mode (no logprobs needed):

from cje import analyze_dataset

# Compare policies on an eval set
result = analyze_dataset(fresh_draws_dir="responses/")

# Get estimates with confidence intervals
for i, policy in enumerate(result.metadata["target_policies"]):
    est = result.estimates[i]
    se = result.standard_errors[i]
    print(f"{policy}: {est:.3f} ± {1.96*se:.3f}")

Your responses/ directory just needs JSONL files like:

{"prompt_id": "eval_0", "policy": "model_a", "judge_score": 0.85}
{"prompt_id": "eval_0", "policy": "model_b", "judge_score": 0.72}

That's it! CJE handles the rest - auto-discovers policies, applies calibration if oracle labels are present, and returns reliable estimates.

Advanced: Reuse logged data (IPS/DR modes)

If you have production logs with log probabilities, CJE can estimate counterfactual deployment value:

# IPS mode: Use logged data only
result = analyze_dataset(logged_data_path="logs.jsonl")

# DR mode: Combine logged data + fresh draws (most accurate)
result = analyze_dataset(
    logged_data_path="logs.jsonl",
    fresh_draws_dir="responses/"
)

See Data Requirements for IPS/DR data format and Teacher Forcing for computing logprobs.

Three Analysis Modes

CJE automatically selects the best mode based on your data:

Mode	Data	What it tells you	Best for
Direct	Responses from each policy	Which policy is best on this eval set?	Quick comparisons, A/B testing
IPS	Logged data with logprobs	What if we deployed policy X? (counterfactual)	Reusing existing logs, fast iteration
DR	Both logged + responses	Counterfactual value (most accurate)	High-stakes decisions, maximum accuracy

Automatic mode selection:

fresh_draws_dir only → Direct mode
logged_data_path only → IPS mode (importance sampling)
Both → DR mode (doubly robust)

Direct mode is the simplest - just provide responses from each policy with judge scores. No logprobs needed!

IPS/DR modes enable counterfactual inference: "What would happen if we deployed this policy?" This requires log probabilities from your models. See Generating Log Probabilities below for Fireworks API integration.

When to Use CJE

✅ Use CJE when you need:

Statistical rigor (confidence intervals, p-values)
Debiased judge scores (automatic calibration)
Policy comparisons or counterfactual estimates
To reuse logged data for new evaluations

❌ Don't use CJE for:

Online learning (CJE is offline/batch)
Real-time scoring (use raw judge for that)
Very small samples (<100 examples)

Data Requirements

Requirements depend on which mode you're using:

For Direct Mode (fresh draws only):

{
  "prompt_id": "arena_0",
  "prompt": "What is 2+2?",
  "response": "4",
  "policy": "clone",
  "judge_score": 0.85,                       // Required: judge evaluation
  "oracle_label": 0.86                       // Optional: ground truth (enables AutoCal-R)
}

AutoCal-R: If any fresh draws have oracle_label, Direct mode automatically applies AutoCal-R to learn judge→oracle calibration and uses calibrated rewards. More oracle labels = better calibration (5-10% is often sufficient).

For IPS/DR Modes (logged data):

{
  "prompt": "What is 2+2?",
  "response": "4",
  "base_policy_logprob": -14.7,              // Required: log P(response|prompt) for logging policy
  "target_policy_logprobs": {                // Required: same for policies to evaluate
    "clone": -14.7,
    "parallel_universe_prompt": -18.3,
    "unhelpful": -42.1
  },
  "judge_score": 0.85,                       // Required: judge evaluation
  "oracle_label": 0.86                       // Optional: ground truth (5-10% is enough for calibration)
}

Key difference: Direct mode doesn't need logprobs! Just responses from each policy with judge scores (and optionally oracle labels for AutoCal-R calibration).

Working example: See examples/arena_sample/ for complete dataset examples with logged data and fresh draws.

Generating Log Probabilities

For IPS/DR modes, you need log probabilities. CJE includes built-in Fireworks API integration:

from cje.teacher_forcing import compute_teacher_forced_logprob

# Compute log P(response|prompt) for any model on Fireworks
result = compute_teacher_forced_logprob(
    prompt="What is 2+2?",
    response="4",
    model="accounts/fireworks/models/llama-v3p2-3b-instruct"
)
if result.status == "success":
    logprob = result.value  # e.g., -2.3

This handles chat templates, tokenization, and API calls automatically. Supports all Fireworks models.

Don't have Fireworks access? Direct mode doesn't need logprobs - just use fresh_draws_dir with judge scores.

See cje/teacher_forcing/README.md for batch processing and advanced options.

Advanced: Choosing an Estimator

Most users: Use estimator="auto" (the default). CJE auto-selects the best estimator for your mode.

For researchers: You can specify estimators explicitly:

direct: On-policy comparison (no counterfactual inference)
calibrated-ips: IPS with variance-reduced weights (SIMCal)
stacked-dr: Ensemble of DR estimators (recommended for production)
Individual DR variants: dr-cpo, tmle, mrdr, oc-dr-cpo, tr-cpo-e

See cje/estimators/README.md for technical details on each estimator.

Documentation

📚 Getting Started

5-Minute Quickstart - First analysis step-by-step
Examples - Working code samples
Full documentation coming soon on cimo-labs.com

🔧 For Engineers

Engineering Guide - Interface specs and patterns
Arena Experiment - Production pipeline example
Module READMEs - Each subdirectory in cje/ contains a developer-oriented README:
- cje/estimators/README.md - Estimator implementations and hierarchy
- cje/diagnostics/README.md - Diagnostic system architecture
- cje/data/README.md - Data models and validation
- cje/calibration/README.md - Calibration methods
- cje/interface/README.md - High-level API details

📊 Additional Resources

API Reference - Coming soon
Mathematical Foundations - Coming soon
Troubleshooting Guide - Coming soon

Development

make install  # Install with Poetry
make test     # Run tests
make format   # Auto-format code
make lint     # Check code quality

Support

🐛 Issues
💬 Discussions

License

MIT - See LICENSE for details.

Ready to start? → 5-Minute Quickstart

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

elandesberg

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.25

Apr 2, 2026

0.2.24

Mar 24, 2026

0.2.23

Mar 1, 2026

0.2.22

Feb 3, 2026

0.2.21

Feb 3, 2026

0.2.20

Feb 2, 2026

0.2.19

Jan 27, 2026

0.2.18

Jan 21, 2026

0.2.17

Jan 20, 2026

0.2.16

Jan 20, 2026

0.2.13

Jan 15, 2026

0.2.12

Jan 11, 2026

0.2.10

Jan 2, 2026

0.2.9

Dec 14, 2025

0.2.8

Dec 14, 2025

0.2.7

Dec 14, 2025

0.2.6

Dec 13, 2025

0.2.5

Oct 30, 2025

0.2.4

Oct 30, 2025

0.2.3

Oct 27, 2025

This version

0.2.2

Oct 9, 2025

0.2.1

Oct 9, 2025

0.2.0

Oct 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cje_eval-0.2.2.tar.gz (259.1 kB view details)

Uploaded Oct 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cje_eval-0.2.2-py3-none-any.whl (315.0 kB view details)

Uploaded Oct 9, 2025 Python 3

File details

Details for the file cje_eval-0.2.2.tar.gz.

File metadata

Download URL: cje_eval-0.2.2.tar.gz
Upload date: Oct 9, 2025
Size: 259.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`f307d44b6c60db2968e79c9ecb5e299f7c60e8959fdde644732d645137f1168a`
MD5	`6ec0a1f25c2d51186a5389954fe18040`
BLAKE2b-256	`64f9fc0667628c9358f1331b67f9ceec8202c2ecedca5bd6778d4dab12e9e9de`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.2.tar.gz:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cje_eval-0.2.2.tar.gz
- Subject digest: f307d44b6c60db2968e79c9ecb5e299f7c60e8959fdde644732d645137f1168a
- Sigstore transparency entry: 597907985
- Sigstore integration time: Oct 9, 2025
Source repository:
- Permalink: cimo-labs/cje@70fe2b1a233cffd508cf9b275efa00df6d051032
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/cimo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@70fe2b1a233cffd508cf9b275efa00df6d051032
- Trigger Event: release

File details

Details for the file cje_eval-0.2.2-py3-none-any.whl.

File metadata

Download URL: cje_eval-0.2.2-py3-none-any.whl
Upload date: Oct 9, 2025
Size: 315.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cje_eval-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7c1f71cfcd4e19661e88aac35ee7d9c289f7d999f58f86a10ffc9f4392ce6e8e`
MD5	`75f059465b9fe1e00c818b7851fd7854`
BLAKE2b-256	`cad8b0ceecd6dd0631dbd85bd162728e8939199978aaf7f908705a2add737578`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cje_eval-0.2.2-py3-none-any.whl:

Publisher: publish.yml on cimo-labs/cje

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cje_eval-0.2.2-py3-none-any.whl
- Subject digest: 7c1f71cfcd4e19661e88aac35ee7d9c289f7d999f58f86a10ffc9f4392ce6e8e
- Sigstore transparency entry: 597907995
- Sigstore integration time: Oct 9, 2025
Source repository:
- Permalink: cimo-labs/cje@70fe2b1a233cffd508cf9b275efa00df6d051032
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/cimo-labs
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@70fe2b1a233cffd508cf9b275efa00df6d051032
- Trigger Event: release

cje-eval 0.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CJE - Causal Judge Evaluation

Why CJE?

Installation

🚀 Try it Now - Interactive Demo

Quick Start

Three Analysis Modes

When to Use CJE

Data Requirements

For Direct Mode (fresh draws only):

For IPS/DR Modes (logged data):

Generating Log Probabilities

Advanced: Choosing an Estimator

Documentation

Development

Support

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance