Evaluation tools for TREC AutoJudge: meta-evaluate, qrel-evaluate, leaderboard statistics

These details have not been verified by PyPI

Project links

Project description

autojudge-evaluate

Evaluation tools for the TREC AutoJudge framework. Computes leaderboard correlations, inter-annotator agreement on qrels, leaderboard statistics, and format conversion for evaluation result files.

Installation

uv pip install autojudge-evaluate

CLI Commands

All commands are available via auto-judge-evaluate <command>.

`meta-evaluate` — Leaderboard correlation

Correlate predicted leaderboards against a ground-truth leaderboard.

auto-judge-evaluate meta-evaluate \
    --truth-leaderboard truth.eval.jsonl --truth-format jsonl \
    --eval-format tot -i results/*eval.txt \
    --correlation kendall --correlation spearman --correlation tauap_b \
    --truth-measure nugget_coverage --truth-measure f1 \
    --on-missing default \
    --output correlations.jsonl

Key options:

Option	Description
`--truth-leaderboard FILE`	Ground-truth leaderboard file (required)
`--truth-format FMT`	Format: `trec_eval`, `tot`, `ir_measures`, `ranking`, `jsonl`
`--eval-format FMT`	Format of input leaderboard files
`-i FILE` / positional	Input leaderboard file(s), supports globs. Repeatable
`--correlation METHOD`	Correlation method. Repeatable. Supports `kendall`, `pearson`, `spearman`, `tauap_b`, and top-k variants like `kendall@15`
`--truth-measure NAME`	Truth measure(s) to correlate against. Repeatable. Omit for all
`--eval-measure NAME`	Eval measure(s) to include. Repeatable. Omit for all
`--on-missing MODE`	Handle run mismatches: `error`, `warn`, `skip`, `default` (fill 0.0)
`--only-shared-topics`	Intersect topics across truth and eval (default: `--all-topics`)
`--only-shared-runs`	Intersect runs across truth and eval (default: `--all-runs`)
`--truth-drop-aggregate`	Recompute aggregates from per-topic data
`--output FILE`	Output `.jsonl` or `.txt`
`--out-format FMT`	`jsonl` (default) or `table`
`--aggregate`	Report only mean across all judges

Output: One row per (Judge, TruthMeasure, EvalMeasure) with correlation values as columns.

`qrel-evaluate` — Inter-annotator agreement on qrels

Compare predicted relevance judgments (qrels) against truth qrels. Computes set overlap (precision, recall, F1) and agreement metrics (Cohen's Kappa, Krippendorff's Alpha, Jaccard, ARI).

auto-judge-evaluate qrel-evaluate \
    --truth-qrels official.qrels \
    --predict-qrels predicted.qrels

Key options:

Option	Description
`--truth-qrels FILE`	Truth qrels in TREC format
`--truth-nugget-docs DIR`	Alternative: truth as nugget-docs directory
`--predict-qrels FILE`	Predicted qrels in TREC format
`--predict-nugget-docs DIR`	Alternative: predicted as nugget-docs directory
`--truth-max-grade N`	Grade scale upper bound for truth (default: 1 = binary)
`--predict-max-grade N`	Grade scale upper bound for predicted (default: 1)
`--truth-relevance-threshold N`	Binary threshold for truth side (default: 1)
`--predict-relevance-threshold N`	Binary threshold for predicted side (default: 1)
`--on-missing MODE`	Handle topics in only one side: `error`, `warn`, `default`, `skip`
`--output FILE`	Output `.jsonl` or `.txt`

Output: Per-topic table with Precision, Recall, F1, Jaccard, Kappa, Krippendorff's Alpha, ARI, plus a MEAN row.

`leaderboard` — Leaderboard statistics

Compute per-run statistics (mean, stderr, stdev, min, max) from leaderboard files.

auto-judge-evaluate leaderboard \
    --eval-format tot -i results/*eval.txt --sort

Key options:

Option	Description
`--eval-format FMT`	Input format (required)
`-i FILE` / positional	Input file(s), supports globs. Repeatable
`--eval-measure NAME`	Filter to specific measures. Repeatable
`--sort`	Sort runs by mean score (descending)
`--output FILE`	Output `.jsonl` or `.csv`

Output: One row per (Judge, RunID, Measure) with Topics, Mean, Stderr, Stdev, Min, Max.

Analysis Module

Post-hoc analysis, tables, plots of meta-evaluate output. Produces correlation tables and bar plots with judge categorization.

python -m autojudge_evaluate.analysis.correlation_table \
    -d ragtime:ragtime-correlations.jsonl \
    -d rag:rag-correlations.jsonl \
    -d dragun:dragun-correlations.jsonl \
    --judges judges.yml \
    --correlation kendall \
    --truth-measure nugget_coverage \
    --format latex \
    --plot-dir plots/

Judge configuration (judges.yml) maps cryptic filenames to display names and categories, with optional plot styling:

styles:
  colors:
    pointwise: "#4A90D9"
    pairwise:  "#D94A4A"
  hatches:
    gpt-4o:    ""
    llama-3:   "//"

judges:
  my-judge-A.eval:
    name: System A
    method: pointwise     # category column
    model: gpt-4o         # category column
  my-judge-B.eval:
    name: System B
    method: pairwise
    model: llama-3

styles.colors: maps category values to fill colors (any matplotlib color string)
styles.hatches: maps category values to hatch patterns (//, .., xx, \\, etc). Values combine across categories.
Color is picked from the first matching category value; hatches are combined from all matches.
Without a styles: section, bars use a sequential grayscale fallback.
Judges not in the YAML are excluded unless --all-judges is passed.

Key options: --format (github, latex, tsv, plain, html, pipe), --columns (correlations or measures), --summary (add mean/max rows), --aggregate (aggregate across datasets), --same THRESHOLD (highlight near-equal values).

`eval-result` — Format conversion and verification

Clean and convert evaluation result files.

# Convert tot to jsonl
auto-judge-evaluate eval-result data.txt -if tot -of jsonl -o data.jsonl

# Filter to specific runs and topics
auto-judge-evaluate eval-result data.txt -if tot -of jsonl -o filtered.jsonl \
    --filter-runs system_A --filter-runs system_B \
    --filter-topics topic_1

Key options:

Option	Description
`-if FMT`	Input format: `trec_eval`, `tot`, `ir_measures`, `ranking`, `jsonl`
`-of FMT`	Output format (defaults to input format)
`-o FILE`	Output file. Omit for roundtrip test to temp file
`--filter-runs ID`	Keep only these runs. Repeatable
`--filter-topics ID`	Keep only these topics. Repeatable
`--filter-measures NAME`	Keep only these measures. Repeatable
`--compare-aggregates`	Compare file aggregates vs recomputed from per-topic data
`--drop-aggregates`	Drop existing aggregate rows
`--recompute-aggregates`	Recompute from per-topic data (implies `--drop-aggregates`)
`--roundtrip` / `--no-roundtrip`	Enable/disable roundtrip verification (default: on)

Supported formats:

Format	Columns
`trec_eval`	measure topic value (3 cols, run_id from filename)
`tot`	run measure topic value (4 cols)
`ir_measures`	run topic measure value (4 cols)
`ranking`	topic Q0 doc_id rank score run (6 cols)
`jsonl`	JSON lines with run_id, topic_id, measure, value

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.14

Apr 21, 2026

0.3.13

Apr 17, 2026

0.3.12

Apr 16, 2026

0.3.11

Apr 9, 2026

0.3.10

Apr 2, 2026

0.3.9

Apr 2, 2026

0.3.8

Apr 2, 2026

0.3.7

Mar 29, 2026

This version

0.3.2

Feb 15, 2026

0.3.1

Feb 9, 2026

0.2.1

Feb 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autojudge_evaluate-0.3.2.tar.gz (56.5 kB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autojudge_evaluate-0.3.2-py3-none-any.whl (49.9 kB view details)

Uploaded Feb 15, 2026 Python 3

File details

Details for the file autojudge_evaluate-0.3.2.tar.gz.

File metadata

Download URL: autojudge_evaluate-0.3.2.tar.gz
Upload date: Feb 15, 2026
Size: 56.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autojudge_evaluate-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`2b9c5e0bb45372a0a2eb9d4596a2050e187e011a0e7c17e26973049cbec2b718`
MD5	`7db8568e836b123f15403cc483322163`
BLAKE2b-256	`00865258f32d2e5b252091285f5da700f1df7e78a8a2311fdfa8cd130076e6f5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autojudge_evaluate-0.3.2.tar.gz:

Publisher: publish.yml on trec-auto-judge/auto-judge-evaluate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autojudge_evaluate-0.3.2.tar.gz
- Subject digest: 2b9c5e0bb45372a0a2eb9d4596a2050e187e011a0e7c17e26973049cbec2b718
- Sigstore transparency entry: 953577252
- Sigstore integration time: Feb 15, 2026
Source repository:
- Permalink: trec-auto-judge/auto-judge-evaluate@34c5ef9873695d68db19bc0b1cd49cbb3f21a71f
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/trec-auto-judge
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@34c5ef9873695d68db19bc0b1cd49cbb3f21a71f
- Trigger Event: push

File details

Details for the file autojudge_evaluate-0.3.2-py3-none-any.whl.

File metadata

Download URL: autojudge_evaluate-0.3.2-py3-none-any.whl
Upload date: Feb 15, 2026
Size: 49.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autojudge_evaluate-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7683ba40c3c4c5d369ba748c930e17a706a07f78cf3dfdc2498e86584f39bca`
MD5	`dc1faea9795aff0cf9af1410d7b23350`
BLAKE2b-256	`94af55ab95a7dd16fd7a4ca30f0f15c7c76b7c01bc18a6c76fb439bd246a78a9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autojudge_evaluate-0.3.2-py3-none-any.whl:

Publisher: publish.yml on trec-auto-judge/auto-judge-evaluate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autojudge_evaluate-0.3.2-py3-none-any.whl
- Subject digest: d7683ba40c3c4c5d369ba748c930e17a706a07f78cf3dfdc2498e86584f39bca
- Sigstore transparency entry: 953577253
- Sigstore integration time: Feb 15, 2026
Source repository:
- Permalink: trec-auto-judge/auto-judge-evaluate@34c5ef9873695d68db19bc0b1cd49cbb3f21a71f
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/trec-auto-judge
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@34c5ef9873695d68db19bc0b1cd49cbb3f21a71f
- Trigger Event: push

autojudge-evaluate 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

autojudge-evaluate

Installation

CLI Commands

`meta-evaluate` — Leaderboard correlation

`qrel-evaluate` — Inter-annotator agreement on qrels

`leaderboard` — Leaderboard statistics

Analysis Module

`eval-result` — Format conversion and verification

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

autojudge-evaluate 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

autojudge-evaluate

Installation

CLI Commands

meta-evaluate — Leaderboard correlation

qrel-evaluate — Inter-annotator agreement on qrels

leaderboard — Leaderboard statistics

Analysis Module

eval-result — Format conversion and verification

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`meta-evaluate` — Leaderboard correlation

`qrel-evaluate` — Inter-annotator agreement on qrels

`leaderboard` — Leaderboard statistics

`eval-result` — Format conversion and verification