winference

Win rate calibration under non-transitivity via Hodge decomposition and heterogeneous group testing

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

soodoku

These details have not been verified by PyPI

Project description

winference: Win rate calibration under non-transitivity.

When you run an LLM arena and report "Model A beats Model B 62% of the time," is that number calibrated? And does it still hold when your users ask different questions than your evaluation set?

If model strengths vary across task types, aggregate win rates can exhibit non-transitive preferences: A beats B, B beats C, but C beats A. Standard Bradley-Terry / Elo assumes this doesn't happen, and when it does, your calibration breaks — especially under distribution shift.

winference provides two approaches to calibrating win rates in the presence of non-transitivity, plus diagnostics to decide which one you need.

The two approaches

A) Hodge decomposition → calibrate the transitive signal

Decomposes the pairwise comparison matrix into:

Gradient (transitive): a potential s_i per model such that the log-odds ≈ s_i − s_j. This part can be calibrated to a scalar ranking.
Curl (cyclic): rock-paper-scissors structure that cannot be represented by any linear ranking.

Calibrate win rates from the gradient component. Report the curl fraction as the share of variance your calibration ignores.

Use when: Cycles persist even after conditioning on task category — the non-transitivity is irreducible.

B) Heterogeneous groups → calibrate per category, compose

Test whether model strengths differ across prompt categories (math, creative, coding, ...) using a likelihood-ratio test. If so, fit Bradley-Terry per category. Win rates for any target distribution are then:

P(A > B | π*) = Σ_k  π*_k · σ(θ_{A,k} − θ_{B,k})

This gives you composable calibration: swap in any target distribution without refitting.

Use when: Non-transitivity dissolves when you condition on prompt category.

Quickstart

pip install numpy scipy scikit-learn matplotlib
# Then from the repo root:
pip install -e .

from winference import (
    TournamentGraph, BradleyTerry, HodgeDecomposition,
    GroupTest, GroupCalibrator, expected_calibration_error,
)
from winference.simulate import simulate_llm_arena

# 1. Simulate (or load) arena data
data = simulate_llm_arena()
comparisons = data["comparisons"]   # list of (model_a, model_b, a_wins)
categories  = data["categories"]    # list of category labels per comparison
models      = data["models"]

# 2. Graph triage: is non-transitivity a problem?
tg = TournamentGraph(models)
for a, b, w in comparisons:
    tg.add_result(a, b, w)

print(tg.summary())
# → {'nontransitivity_index': 0.83, 'cyclic_triples': 7, ...}

# 3a. Hodge decomposition
hd = HodgeDecomposition(models)
result = hd.fit(tg.win_rate_matrix(), weights=tg.counts)
print(f"Transitive: {result.transitive_variance:.0%}")
print(f"Cyclic:     {result.cyclic_variance:.0%}")

# Calibrated win probability (transitive component only)
p = hd.transitive_win_probability("ZetaMath", "DeltaWrite")

# 3b. Group heterogeneity test
groups = sorted(set(categories))
gt = GroupTest(models, groups)
gt.fit(comparisons, categories)
print(gt.test_result())
# → {'statistic': 342.1, 'p_value': 1.2e-63, 'reject_at_05': True}

# Composable win rates
gc = GroupCalibrator(gt)
p_math_heavy = gc.win_probability(
    "ZetaMath", "DeltaWrite",
    target_distribution={"reasoning": 0.7, "creative_writing": 0.15, "coding": 0.15},
)
p_creative_heavy = gc.win_probability(
    "ZetaMath", "DeltaWrite",
    target_distribution={"reasoning": 0.15, "creative_writing": 0.7, "coding": 0.15},
)

See examples/quickstart.py for the full pipeline with calibration comparison and reliability diagrams.

The diagnostic pipeline

┌─────────────────────────────┐
│  Build tournament graph     │
│  Run Tarjan's SCC           │
└──────────┬──────────────────┘
           │
     All SCCs size 1?
      ╱           ╲
    YES            NO
     │              │
  Standard BT    ┌──┴───────────────────────┐
  is fine        │  Condition on categories  │
                 │  Check: do SCCs shrink?   │
                 └──────┬───────────────┬────┘
                    YES                  NO
                     │                    │
              ┌──────┴──────┐    ┌───────┴────────┐
              │ Per-group   │    │ Hodge decomp   │
              │ BT + LRT    │    │ calibrate grad │
              │ + compose   │    │ report curl    │
              └─────────────┘    └────────────────┘

Other sources of non-transitivity

Non-transitivity in pairwise comparisons doesn't always come from heterogeneous model strengths. Before reaching for Hodge or group decomposition, rule out simpler causes:

Source	What it is	How to address
Judge noise	LLM judge gives inconsistent verdicts on the same pair	Bayesian calibration (Dawid-Skene, BWRS)
Position bias	Judge prefers whichever response appears first/second	Randomise presentation order, average both orderings
Style/length bias	Judge rewards verbosity rather than quality	Regress out length/style features (cf. AlpacaEval 2.0)
Evaluator disagreement	Individual annotators are transitive, but they disagree with each other (Condorcet cycles)	Stratify by evaluator type, or accept the Hodge curl as measuring genuine disagreement
Fine-grained interaction	Strengths differ at sub-category level (algebra vs geometry within "math")	Per-prompt routing rather than aggregate ranking
Context effects	Evaluation of A vs B depends on what other models were seen	Session-aware experimental design

winference targets the modelling layer (rows 4–5 in the table above). It assumes you've already addressed measurement issues at the design/preprocessing level.

API reference

`TournamentGraph`

Build a directed tournament graph and compute SCC structure.

.add_result(a, b, win) — record one comparison
.strongly_connected_components() — Tarjan's algorithm
.nontransitivity_index() — fraction of models in non-trivial SCCs
.count_cyclic_triples() — count A>B>C>A cycles
.summary() — quick diagnostic dict

`BradleyTerry`

Standard BT model via maximum likelihood.

.fit(comparisons) — fit from (a, b, a_wins) triples
.win_probability(a, b) — predicted P(a beats b)
.strengths() — {model: θ}
.rank() — models sorted by strength

`HodgeDecomposition`

Hodge decomposition of the pairwise log-odds matrix.

.fit(W, weights) — decompose win-rate matrix
.transitive_win_probability(a, b) — P(a>b) from gradient only
.worst_pairs(k) — pairs with largest cyclic residual
.summary() — variance fractions (transitive / cyclic / harmonic)

`GroupTest`

Likelihood-ratio test for heterogeneity across prompt groups.

.fit(comparisons, group_labels) — fit null + per-group BT
.test_result() — {statistic, df, p_value, reject_at_05}
.per_group_strengths() — {group: {model: θ}}

`GroupCalibrator`

Composable win rates from per-group BT.

.win_probability(a, b, target_distribution) — composite P(a>b)
.sensitivity_analysis(a, b) — how much does P(a>b) vary with π*?

Calibration utilities

expected_calibration_error(predicted, observed) — ECE
brier_score(predicted, observed) — Brier score
reliability_diagram(predicted, observed, ax=...) — reliability plot

Simulators

simulate_transitive(...) — pure BT (no cycles)
simulate_heterogeneous(...) — per-category strengths
simulate_rock_paper_scissors(...) — irreducible cyclic structure
simulate_llm_arena(...) — realistic six-model arena

References

Jiang, X., Lim, L.-H., Yao, Y., & Ye, Y. (2011). Statistical ranking and combinatorial Hodge theory. Mathematical Programming, 127(1), 203–244.
Dittrich, R., Hatzinger, R., & Katzenbeisser, W. (1998). Modelling the effect of subject-specific covariates in paired comparison studies. Applied Statistics, 47(4), 511–525.
Xu, Y., Ruis, L., Rocktäschel, T., & Kirk, R. (2025). Investigating non-transitivity in LLM-as-a-Judge. ICML 2025.
Li, X. & Li, S. (2025). Efficient inference for covariate-adjusted Bradley-Terry model with covariate shift. arXiv:2503.18256.
Balduzzi, D., Tuyls, K., Perolat, J., & Graepel, T. (2018). Re-evaluating evaluation. NeurIPS.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

soodoku

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winference-0.1.0.tar.gz (16.3 kB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

winference-0.1.0-py3-none-any.whl (19.8 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file winference-0.1.0.tar.gz.

File metadata

Download URL: winference-0.1.0.tar.gz
Upload date: Mar 8, 2026
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for winference-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3cb646987a696213b63572f00474340d36afd755560c661bace26ace6d1c74b7`
MD5	`3dcf0a130092832b2e05dffdfd99e7d7`
BLAKE2b-256	`1dc8106ac872f3689d0a672148132cfa62b09875104ab55d1704402f35794570`

See more details on using hashes here.

Provenance

The following attestation bundles were made for winference-0.1.0.tar.gz:

Publisher: python-publish.yml on finite-sample/winference

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: winference-0.1.0.tar.gz
- Subject digest: 3cb646987a696213b63572f00474340d36afd755560c661bace26ace6d1c74b7
- Sigstore transparency entry: 1059862263
- Sigstore integration time: Mar 8, 2026
Source repository:
- Permalink: finite-sample/winference@6d34eab24be8f7e0bbaf744211e58d46ff99dac7
- Branch / Tag: refs/heads/main
- Owner: https://github.com/finite-sample
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@6d34eab24be8f7e0bbaf744211e58d46ff99dac7
- Trigger Event: workflow_dispatch

File details

Details for the file winference-0.1.0-py3-none-any.whl.

File metadata

Download URL: winference-0.1.0-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 19.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for winference-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b9f606af4005868e2309056a0a6b7d61e5d6d91d3ee9e365a64ce1d905ee7737`
MD5	`8c8b92980fa26a7d333ad755cc1b3e71`
BLAKE2b-256	`c1878dcd2d221063f183d3814203ec21033790d9b30c460bcb8fcdeefdec3351`

See more details on using hashes here.

Provenance

The following attestation bundles were made for winference-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on finite-sample/winference

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: winference-0.1.0-py3-none-any.whl
- Subject digest: b9f606af4005868e2309056a0a6b7d61e5d6d91d3ee9e365a64ce1d905ee7737
- Sigstore transparency entry: 1059862266
- Sigstore integration time: Mar 8, 2026
Source repository:
- Permalink: finite-sample/winference@6d34eab24be8f7e0bbaf744211e58d46ff99dac7
- Branch / Tag: refs/heads/main
- Owner: https://github.com/finite-sample
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@6d34eab24be8f7e0bbaf744211e58d46ff99dac7
- Trigger Event: workflow_dispatch

winference 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

winference: Win rate calibration under non-transitivity.

The two approaches

A) Hodge decomposition → calibrate the transitive signal

B) Heterogeneous groups → calibrate per category, compose

Quickstart

The diagnostic pipeline

Other sources of non-transitivity

API reference

TournamentGraph

BradleyTerry

HodgeDecomposition

GroupTest

GroupCalibrator

Calibration utilities

Simulators

References

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`TournamentGraph`

`BradleyTerry`

`HodgeDecomposition`

`GroupTest`

`GroupCalibrator`