Multi-vendor LLM ensemble-judge framework with KEEP/REJECT/SPLIT/MERGE verdicts

These details have not been verified by PyPI

Project links

Project description

cross-judge

Multi-vendor LLM ensemble-judge framework with KEEP / REJECT / SPLIT / MERGE verdicts and Krippendorff α disagreement metrics.

cross-judge wires up a panel of LLM critics — across vendors, models, temperatures, and prompts — and aggregates their verdicts into a single consensus label, plus tells you how much they disagreed. It's a thin, focused library for the LLM-as-judge methodology with a deliberately small public surface (Critic, Ensemble, Verdict).

5-second pitch

from cross_judge import Critic, Ensemble

critics = [
    Critic(name="claude-strict",  model="anthropic/claude-sonnet-4.5", vendor="openrouter"),
    Critic(name="ds-pro-creative", model="deepseek-v4-pro",            vendor="deepseek",   temperature=0.7),
    Critic(name="kimi-rigor",     model="moonshot/kimi-k2",            vendor="openrouter"),
]
result = Ensemble(critics, voting="majority").judge(
    "Is this candidate a valid universality class?"
)
print(result.consensus)          # "KEEP"
print(result.agreement_pct)      # 0.67
print(result.krippendorff_alpha) # 0.0  (2 vs 1 = chance-level)

Why

LLM-as-judge results from a single model inherit that model's biases — anchoring, alignment, vendor-specific quirks. Running the same judgment task across multiple vendors / temperatures / prompts and aggregating verdicts is a cheap way to:

Mitigate vendor-specific bias — one model's blind spot ≠ all models'.
Surface contested items — high-disagreement items deserve human review.
Quantify confidence — Krippendorff α gives you a defensible number to put in a methods section.

This package was extracted from the structural-isomorphism project's B3 / B4 ensemble review pipeline (multi-vendor LLM review of candidate cross-domain universality classes).

Install

pip install cross-judge
# or, with the openai-python client as a convenience:
pip install 'cross-judge[openai]'

Dependencies: pydantic>=2, httpx>=0.27, pyyaml>=6. No openai-python required at v0.1 — we ship a minimal httpx-based POST to /v1/chat/completions to avoid version-coupling. Documented choice: standalone, not depending on guarded-llm at v0.1, so users can adopt cross-judge independently. (guarded-llm is a sibling package for single-call safety / cost guards.)

Quickstart: 3-model judge

from cross_judge import Critic, Ensemble

# Each critic: own model, own temperature, own prompt template.
# Templates use str.format with {query} and optional context keys.

claude = Critic(
    name="claude-strict",
    model="anthropic/claude-sonnet-4.5",
    vendor="openrouter",
    temperature=0.0,
    system_prompt="You are a strict reviewer. Output JSON only.",
    prompt_template=(
        "Judge: {query}\n\n"
        "Output JSON: {{"
        '"kind": "<KEEP|REJECT|SPLIT|MERGE|UNCLEAR>", '
        '"confidence": <0-1>, '
        '"reasoning": "<short>"'
        "}}"
    ),
)

deepseek = Critic(
    name="ds-pro-creative",
    model="deepseek-v4-pro",
    vendor="deepseek",
    temperature=0.7,
    system_prompt="You are a creative dissenter. Output JSON only.",
    prompt_template=claude.prompt_template,
)

kimi = Critic(
    name="kimi-rigor",
    model="moonshot/kimi-k2",
    vendor="openrouter",
    temperature=0.0,
    system_prompt="You are a rigorous reviewer. Output JSON only.",
    prompt_template=claude.prompt_template,
)

ensemble = Ensemble(
    critics=[claude, deepseek, kimi],
    voting="majority",
)

result = ensemble.judge(
    query="Bank failures and earthquake aftershocks both show power-law size distributions — same universality class?",
    query_id="candidate-q-001",
)

print(f"Consensus: {result.consensus}")
print(f"Agreement: {result.agreement_pct:.0%}")
print(f"Krippendorff α: {result.krippendorff_alpha:.3f}")
for v in result.verdicts:
    print(f"  {v.critic_id:20s} kind={v.kind:8s} conf={v.confidence:.2f}  {v.reasoning[:60]}")

Set API keys:

export OPENROUTER_API_KEY='sk-or-...'
export DEEPSEEK_API_KEY='sk-...'

Versioned prompts (recommended pattern)

Prompts have semantics, so they deserve semver. Ship YAML prompts under git-tracked files and pin to specific versions in production:

# prompts/b3_universality_critic.yaml
version: "0.1"
system_prompt: "You are a rigorous universality-class critic ..."
user_prompt_template: |
  Judge the following candidate ...
  {query}
  Output JSON: { "kind": ..., "confidence": ..., "reasoning": ... }

critic = Critic.from_yaml_prompt(
    name="b3-rigor",
    model="deepseek-v4-pro",
    yaml_path="prompts/b3_universality_critic.yaml",  # git-tagged
)

Bump version whenever you change wording or vocabulary. Pin in production to a git tag like prompts/b3_universality_critic.yaml@v0.1. Bundled default prompts ship under cross_judge/prompts/:

b3_universality_critic.yaml — research-grade universality-class critic
generic_universality_judge.yaml — domain-agnostic version

VerdictKind vocabulary

The default vocabulary is Literal["KEEP", "REJECT", "SPLIT", "MERGE", "UNCLEAR", "ERROR", "PARSE_FAIL"]:

kind	meaning
`KEEP`	accept candidate as-is
`REJECT`	discard (fails criteria)
`SPLIT`	accept but split into sub-classes (composite candidate)
`MERGE`	accept but merge with existing class (duplicate / overlap)
`UNCLEAR`	reviewer cannot decide
`ERROR`	LLM call failed (network / 5xx)
`PARSE_FAIL`	LLM responded but JSON couldn't be parsed

Free-form labels (PASS, FAIL, etc.) are also accepted — pass them as plain strings.

Voting strategies

name	rule	use when
`majority`	most common label wins; tie-break via `priority=[...]`	default, most ensembles
`unanimous`	only return label if all critics agree; else `fallback`	high-stakes, low false-positive
(custom)	pass any `Callable[[list[Verdict]], (str, bool)]`	weighted / domain-specific

# Conservative: only accept items all 3 critics endorsed
ensemble = Ensemble(critics, voting="unanimous", voting_kwargs={"fallback": "NEEDS_REVIEW"})

# Tie-break toward rejection (recall-leaning)
ensemble = Ensemble(critics, voting="majority", voting_kwargs={"priority": ["REJECT", "KEEP"]})

Disagreement metrics primer

cross-judge reports two metrics per ensemble call:

agreement_pct — fraction of critics that match the consensus. Simple, intuitive: "2 out of 3 critics said KEEP" → 0.67. Not adjusted for chance — 50% on a binary choice is no better than coin-flip.
krippendorff_alpha — chance-corrected inter-rater reliability (Krippendorff 2011). Values:
- α = 1.0 → perfect agreement (all critics identical)
- α = 0.0 → agreement equal to chance given the marginal distribution
- α = <0.0 → systematic disagreement (worse than random)
α > 0.667 is a common acceptance threshold for "substantial agreement" in content analysis. For LLM-as-judge ensembles, treat α < 0.4 as "the panel can't agree — surface to human review", and α > 0.8 as "ensemble is converged, ship it".

The metric is computed via the coincidence-matrix formulation for nominal data (Krippendorff 2011 eq. 4), with the small-sample (N-1) correction. Errored verdicts are excluded from the denominator.

Reproducibility

temperature=0.0 gives near-deterministic LLM behavior per vendor (cache behavior varies — DeepSeek and OpenAI are usually deterministic at temp=0; OpenRouter passes through). For paper-grade reproducibility:

critic = Critic(name="..., model="...", temperature=0.0)
# Pin the prompt YAML version:
critic = Critic.from_yaml_prompt(..., yaml_path="prompts/b3_universality_critic@v0.1.yaml")

Combine with deterministic aggregation (voting="unanimous" or voting="majority" with explicit priority) for end-to-end reproducible runs.

Error handling

Critic calls catch all exceptions and return Verdict(kind="ERROR", error="...") rather than raising. Aggregation strategies skip errored verdicts — they don't tank consensus. Inspect result.verdicts for per-critic error strings. JSON parse failures produce Verdict(kind="PARSE_FAIL", error="parse_fail") with the raw response captured in raw_response for audit.

Legacy API

The original Reviewer / JudgePanel / aggregation surface is preserved for backward compatibility with the structural-isomorphism v4/scripts/b3_ensemble.py pipeline. New code should prefer Critic / Ensemble / Verdict. See cross_judge/reviewer.py and cross_judge/panel.py for legacy docstrings.

License

MIT. See LICENSE.

Citation

If cross-judge contributes to a paper, please cite the structural-isomorphism project where the multi-vendor ensemble judging pattern + Krippendorff α reporting were developed:

dada8899. Cross-domain structural isomorphism: a universality-class taxonomy via multi-vendor LLM ensemble review. 2026. github.com/dada8899/structural-isomorphism

See also the C1 preprint (linked from the repo README) for the methodology section's full description of the ensemble + Krippendorff α reporting protocol.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cross_judge-0.1.0.tar.gz (35.5 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cross_judge-0.1.0-py3-none-any.whl (27.9 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file cross_judge-0.1.0.tar.gz.

File metadata

Download URL: cross_judge-0.1.0.tar.gz
Upload date: May 24, 2026
Size: 35.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cross_judge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4f236279bbc98bcafcfe2089553c685affd81b3daa0e22034b726175766b2cdb`
MD5	`1d05123b3b14fcb4a13dfa734585824d`
BLAKE2b-256	`d7435fb3be31f47e88bd0f1aeebc18436809c87ca4b4e4eaadcce0de8040e5b6`

See more details on using hashes here.

File details

Details for the file cross_judge-0.1.0-py3-none-any.whl.

File metadata

Download URL: cross_judge-0.1.0-py3-none-any.whl
Upload date: May 24, 2026
Size: 27.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cross_judge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`82e7c20c3c07302f4145d78b781adc679752eb523d9c8ac77444badb4135d60e`
MD5	`e7478d7362bb4242cc8dff609ffc581d`
BLAKE2b-256	`33f8c5c07e8dbe54875bb096f516f35dfcaecbd55ed0d905e4d4549af524aa04`

See more details on using hashes here.

cross-judge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cross-judge

5-second pitch

Why

Install

Quickstart: 3-model judge

Versioned prompts (recommended pattern)

VerdictKind vocabulary

Voting strategies

Disagreement metrics primer

Reproducibility

Error handling

Legacy API

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes