pairjudge

Train and serve pairwise LLM judges (A/B/tie) with budget-aware multi-turn packing and position-bias correction

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

pairjudge — train and serve pairwise LLM judges (A wins / B wins / tie): budget-aware multi-turn packing, position-bias correction, pseudo-label distillation. Kaggle Gold, 4th of 1,849 teams.

Python

pairjudge is the generalized core of the 4th-place (gold medal) solution to Kaggle's LMSYS — Chatbot Arena Human Preference Predictions (1,849 teams), extracted into a small, tested library you can run on your own preference data with any Hugging Face backbone. The exact competition artifacts are preserved untouched in competition/, and a golden test pins the library's default behavior to the medal-winning code byte for byte.

Use it when you need a model that answers: given a prompt and two candidate responses, which one would a human prefer — or is it a tie? That model is the engine behind response reranking, A/B evaluation of fine-tunes, RLHF/RLAIF reward signals, and arena-style leaderboards.

Why not just an off-the-shelf reward model?

Three problems show up the moment you train a pairwise judge on real conversations, and they are exactly what this library packages:

1. Truncation silently destroys the comparison. A judge input holds a multi-turn conversation plus two responses per turn. With naive left- or right-truncation, long inputs routinely lose response B (or the prompt) entirely — the judge then learns position artifacts instead of preferences. PairPacker packs rounds greedily and, when the budget runs out, truncates the final round proportionally (default 20% prompt / 40% response A / 40% response B), marks every cut with an explicit ellipsis, and drops rounds that can't be shown honestly. Guarantee: never exceeds max_length, and every retained round shows all three fields.

One packed example — fixed `max_length` token budget
`BOS`	Round 1 — fits in full			Round 2 — over budget → proportional truncation			verdict prompt + `EOS`
`BOS`	prompt	response A	response B	prompt `……` _{20% of remainder}	response A `……` _{40% of remainder}	response B `……` _{40% of remainder}	verdict prompt + `EOS`

_{A round that would get fewer than min_tail_budget (default 80) content tokens is dropped entirely, along with every later round; …… marks each cut. Response B can never be silently pushed out of the sequence.}

2. Pairwise judges have position bias. Swap A and B and a naive judge changes its verdict on a measurable fraction of pairs. PairwiseJudge.predict_proba(swap_debias=True) scores each pair in both orders and averages in the original frame — order-invariant by construction. position_flip_rate() measures how biased your judge is before you decide to pay the 2x compute.

3. Human preference labels are scarce and noisy. The medal recipe is a two-phase semi-supervised loop: train on human labels → pseudo-label a large unlabeled pool with full probability distributions → retrain with soft-label KL distillation (label_mode: soft). Ties are a first-class third category throughout — real human preference data is full of them, and scalar Bradley–Terry reward models (e.g. TRL's RewardTrainer, num_labels=1) cannot represent them.

Install

pip install pairjudge             # core: packing + data loaders (no torch)
pip install "pairjudge[judge]"    # + inference (torch, transformers)
pip install "pairjudge[train]"    # + LoRA training (peft, datasets, accelerate)

For an editable source checkout:

pip install -e ".[train]"

60 seconds

from pairjudge import PairPacker, PackerConfig, from_pairs

# 1. Pack pairwise conversations into a token budget — any HF tokenizer.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
packer = PairPacker(tok, PackerConfig(max_length=2048))
packed = packer.pack(
    prompts=["Explain quantum entanglement to a 10-year-old."],
    responses_a=["Imagine two magic coins..."],
    responses_b=["Quantum entanglement is a physical phenomenon..."],
)
packed.input_ids      # <= 2048 tokens, prompt + BOTH responses guaranteed visible
packed.truncated      # False — everything fit

# 2. Judge a pair with a trained model, position-bias-free.
from pairjudge import PairwiseJudge
judge = PairwiseJudge.from_pretrained("path/to/your/judge")
df = from_pairs(
    prompts=["Explain quantum entanglement to a 10-year-old."],
    responses_a=["Imagine two magic coins..."],
    responses_b=["Quantum entanglement is a physical phenomenon..."],
)
judge.predict_proba(df, swap_debias=True)   # [[p_a_wins, p_b_wins, p_tie]]
judge.position_flip_rate(df)                # how order-sensitive is my judge?

Train your own judge

# Small judge on one consumer GPU (Qwen2.5-0.5B, ungated):
python -m pairjudge.training --cfg examples/configs/quickstart.yaml

# The competition setup (gemma-2-9b-it, 4x A100):
python -m pairjudge.training --cfg examples/configs/reproduce_competition.yaml

Input is either an Arena-format CSV (the Kaggle competition schema) or a parquet with canonical columns — prompt / response_a / response_b as per-round string lists plus one-hot (or soft) winner_* columns. pairjudge.data ships loaders for Arena CSVs and UltraFeedback-style chosen/rejected data, plus from_pairs() for plain Python lists.

Hard labels are validated as strict one-hot values. Soft labels must be finite, non-negative probability distributions that sum to one; malformed batches fail before training rather than being silently truncated or relabeled.

The full two-phase distillation loop:

flowchart LR
    H["human-labeled pairs<br>Arena 55k + 33k"] -- "phase 1 · CE loss" --> J1["judge v1<br>(LoRA fine-tune)"]
    U["unlabeled pool<br>UltraFeedback 30k"] --> P["pseudo-label with judge v1<br>keep full distributions"]
    J1 --> P
    H -- "phase 2" --> J2["judge v2 — final"]
    P -- "soft labels · KL loss" --> J2
    J2 -- "swap-debias TTA" --> O["order-invariant<br>predictions"]

# Phase 1: train on human labels
python -m pairjudge.training --cfg phase1.yaml                  # label_mode: hard

# Pseudo-label an unlabeled pool with the phase-1 judge (soft labels)
python -m pairjudge.pseudo_label \
    --model ./output/judge/merged \
    --data pool.parquet --out pool_pl.parquet --swap-debias

# Phase 2: retrain from scratch on human + soft labels with KL loss
python -m pairjudge.training --cfg phase2.yaml                  # label_mode: soft

In the competition, this loop (88k human-labeled + 30k pseudo-labeled UltraFeedback conversations) was a decisive part of the gap between a good model and a gold-medal one.

Inference guardrails

Two degenerate cases are worth handling outside the model — on competition data this was worth a measurable amount of log-loss:

from pairjudge import empty_and_identical_masks

a_empty, b_empty, identical = empty_and_identical_masks(raw_df)
proba[a_empty]  = [0.04, 0.88, 0.08]   # empty response loses — but never bet 1.0
proba[b_empty]  = [0.88, 0.04, 0.08]   # labels are noisy; log-loss punishes overconfidence
proba[identical] = [0.06, 0.06, 0.88]  # identical responses are a tie

How it relates to TRL's `RewardTrainer`

	TRL `RewardTrainer`	`pairjudge`
Output	scalar reward (`num_labels=1`)	3-class distribution (A / B / tie)
Loss	Bradley–Terry (logsigmoid of reward gap)	CE on human labels, KL on soft pseudo-labels
Ties	not representable	first-class
Multi-turn pair truncation	generic	proportional, all-fields-guaranteed
Position bias	n/a at inference (scores singletons)	swap-debias averaging + flip-rate diagnostic

If you need a scalar reward for PPO-style RLHF, use TRL. If you need a judge that compares two concrete responses — for evaluation, reranking, data labeling, or arena prediction — and your data has ties, this is the recipe that placed 4th of 1,849 on exactly that task.

Measured: position bias on real preference data

How big is position bias in practice? examples/position_bias_experiment.py trains a judge end to end through the library's public API and measures it on real data — Qwen2.5-0.5B-Instruct, LoRA, 16k training pairs from the public Arena 55k dataset, 2,000 held-out pairs, one RTX 4080 (16 GB), ~25 minutes:

The judge changes its verdict on 29.2% of pairs when the same two responses are presented in the opposite order.

metric (2,000 held-out pairs)	single pass (A, B)	swap-debiased
log-loss	1.0496	1.0462
accuracy	45.6%	45.1%

Swap debiasing improves the proper scoring metric (log-loss) and, by construction, makes the verdict independent of presentation order; top-1 accuracy stays flat within noise at this model scale. The same averaging was part of the gold-medal submission at 9B scale. Reproduce with:

pip install -e .[train]
python examples/position_bias_experiment.py

Numbers above are from a small judge trained in 25 minutes — treat them as a bias measurement, not a quality ceiling; a library-native config close to the competition phase-1 recipe (gemma-2-9b-it, ~88k pairs, max_length 2048) is in examples/configs/reproduce_competition.yaml. (The gold run's pseudo-label pass used max_length 3072 over ~100k pairs; see competition/.)

Provenance & validation

The competition scripts, configs, inference notebook and certificate are preserved verbatim in competition/, including the full original write-up.
tests/test_packing.py::TestCompetitionEquivalence fuzzes 1,500 conversations against a verbatim copy of the competition tokenizer (tests/reference_impl.py) and asserts byte-identical output with default settings — the library is the medal-winning code, not a reimplementation of it.
Final leaderboard: 4th / 1,849 (gold medal, $20,000 prize).

Citation

@misc{li2024pairjudge,
  author = {Daoyuan Li},
  title  = {pairjudge: pairwise LLM judges with budget-aware packing and position-bias correction},
  year   = {2024},
  url    = {https://github.com/DaoyuanLi2816/pairjudge},
  note   = {Generalized from the 4th-place solution, Kaggle LMSYS Chatbot Arena Human Preference Predictions}
}

License

MIT — see LICENSE.

Author

Daoyuan Li — Kaggle (distiller) · lidaoyuan2816@gmail.com

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lidaoyuan

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jul 24, 2026

0.1.2

Jul 11, 2026

0.1.1

Jun 10, 2026

0.1.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pairjudge-0.2.0.tar.gz (34.0 kB view details)

Uploaded Jul 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pairjudge-0.2.0-py3-none-any.whl (25.5 kB view details)

Uploaded Jul 24, 2026 Python 3

File details

Details for the file pairjudge-0.2.0.tar.gz.

File metadata

Download URL: pairjudge-0.2.0.tar.gz
Upload date: Jul 24, 2026
Size: 34.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.14

File hashes

Hashes for pairjudge-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`2406cbf3723d2d8756f7c02dd7a3c846e56617d0c3283806835c44008665ecd3`
MD5	`da4fb0e0f120b5edeaf166bbaf1f228f`
BLAKE2b-256	`2932401684bef4c08c92dc677d0cc1b3f79fa1c29d8fa8089a11439a22bc3e4d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pairjudge-0.2.0.tar.gz:

Publisher: release.yml on DaoyuanLi2816/pairjudge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pairjudge-0.2.0.tar.gz
- Subject digest: 2406cbf3723d2d8756f7c02dd7a3c846e56617d0c3283806835c44008665ecd3
- Sigstore transparency entry: 2230530560
- Sigstore integration time: Jul 24, 2026
Source repository:
- Permalink: DaoyuanLi2816/pairjudge@06f77034918e7c891e1d6a1f763737750bb50e1b
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/DaoyuanLi2816
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@06f77034918e7c891e1d6a1f763737750bb50e1b
- Trigger Event: release

File details

Details for the file pairjudge-0.2.0-py3-none-any.whl.

File metadata

Download URL: pairjudge-0.2.0-py3-none-any.whl
Upload date: Jul 24, 2026
Size: 25.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.14

File hashes

Hashes for pairjudge-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`091040b7a4cce2947850e146deac76b54cd6a20d3910d729ca780d9f73c2e71d`
MD5	`be4d62279a6f1e2bfa85d118cececb55`
BLAKE2b-256	`f49b796b939aeb18cb7a5e9554b6d784448c022d4a460c4de57bfe84b06f573e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pairjudge-0.2.0-py3-none-any.whl:

Publisher: release.yml on DaoyuanLi2816/pairjudge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pairjudge-0.2.0-py3-none-any.whl
- Subject digest: 091040b7a4cce2947850e146deac76b54cd6a20d3910d729ca780d9f73c2e71d
- Sigstore transparency entry: 2230530917
- Sigstore integration time: Jul 24, 2026
Source repository:
- Permalink: DaoyuanLi2816/pairjudge@06f77034918e7c891e1d6a1f763737750bb50e1b
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/DaoyuanLi2816
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@06f77034918e7c891e1d6a1f763737750bb50e1b
- Trigger Event: release

pairjudge 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Why not just an off-the-shelf reward model?

Install

60 seconds

Train your own judge

Inference guardrails

How it relates to TRL's RewardTrainer

Measured: position bias on real preference data

Provenance & validation

Citation

License

Author

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

How it relates to TRL's `RewardTrainer`