Evaluation tools for TREC AutoJudge: meta-evaluate, qrel-evaluate, leaderboard statistics
Project description
autojudge-evaluate
Evaluation tools for the TREC AutoJudge framework. Computes leaderboard correlations, inter-annotator agreement on qrels, leaderboard statistics, and format conversion for evaluation result files.
Installation
uv pip install autojudge-evaluate
CLI Commands
All commands are available via auto-judge-evaluate <command>.
meta-evaluate — Leaderboard correlation
Correlate predicted leaderboards against a ground-truth leaderboard.
auto-judge-evaluate meta-evaluate \
--truth-leaderboard truth.eval.jsonl --truth-format jsonl \
--eval-format tot -i results/*eval.txt \
--correlation kendall --correlation spearman --correlation tauap_b \
--truth-measure nugget_coverage --truth-measure f1 \
--on-missing default \
--output correlations.jsonl
Key options:
| Option | Description |
|---|---|
--truth-leaderboard FILE |
Ground-truth leaderboard file (required) |
--truth-format FMT |
Format: trec_eval, tot, ir_measures, ranking, jsonl |
--eval-format FMT |
Format of input leaderboard files |
-i FILE / positional |
Input leaderboard file(s), supports globs. Repeatable |
--correlation METHOD |
Correlation method. Repeatable. Supports kendall, pearson, spearman, tauap_b, and top-k variants like kendall@15 |
--truth-measure NAME |
Truth measure(s) to correlate against. Repeatable. Omit for all |
--eval-measure NAME |
Eval measure(s) to include. Repeatable. Omit for all |
--on-missing MODE |
Handle run mismatches: error, warn, skip, default (fill 0.0) |
--only-shared-topics |
Intersect topics across truth and eval (default: --all-topics) |
--only-shared-runs |
Intersect runs across truth and eval (default: --all-runs) |
--truth-drop-aggregate |
Recompute aggregates from per-topic data |
--output FILE |
Output .jsonl or .txt |
--out-format FMT |
jsonl (default) or table |
--aggregate |
Report only mean across all judges |
Output: One row per (Judge, TruthMeasure, EvalMeasure) with correlation values as columns.
qrel-evaluate — Inter-annotator agreement on qrels
Compare predicted relevance judgments (qrels) against truth qrels. Computes set overlap (precision, recall, F1) and agreement metrics (Cohen's Kappa, Krippendorff's Alpha, Jaccard, ARI).
auto-judge-evaluate qrel-evaluate \
--truth-qrels official.qrels \
--predict-qrels predicted.qrels
Key options:
| Option | Description |
|---|---|
--truth-qrels FILE |
Truth qrels in TREC format |
--truth-nugget-docs DIR |
Alternative: truth as nugget-docs directory |
--predict-qrels FILE |
Predicted qrels in TREC format |
--predict-nugget-docs DIR |
Alternative: predicted as nugget-docs directory |
--truth-max-grade N |
Grade scale upper bound for truth (default: 1 = binary) |
--predict-max-grade N |
Grade scale upper bound for predicted (default: 1) |
--truth-relevance-threshold N |
Binary threshold for truth side (default: 1) |
--predict-relevance-threshold N |
Binary threshold for predicted side (default: 1) |
--on-missing MODE |
Handle topics in only one side: error, warn, default, skip |
--output FILE |
Output .jsonl or .txt |
Output: Per-topic table with Precision, Recall, F1, Jaccard, Kappa, Krippendorff's Alpha, ARI, plus a MEAN row.
leaderboard — Leaderboard statistics
Compute per-run statistics (mean, stderr, stdev, min, max) from leaderboard files.
auto-judge-evaluate leaderboard \
--eval-format tot -i results/*eval.txt --sort
Key options:
| Option | Description |
|---|---|
--eval-format FMT |
Input format (required) |
-i FILE / positional |
Input file(s), supports globs. Repeatable |
--eval-measure NAME |
Filter to specific measures. Repeatable |
--sort |
Sort runs by mean score (descending) |
--output FILE |
Output .jsonl or .csv |
Output: One row per (Judge, RunID, Measure) with Topics, Mean, Stderr, Stdev, Min, Max.
Analysis Module
Post-hoc analysis, tables, plots of meta-evaluate output. Produces correlation tables and bar plots with judge categorization.
python -m autojudge_evaluate.analysis.correlation_table \
-d ragtime:ragtime-correlations.jsonl \
-d rag:rag-correlations.jsonl \
-d dragun:dragun-correlations.jsonl \
--judges judges.yml \
--correlation kendall \
--truth-measure nugget_coverage \
--format latex \
--plot-dir plots/
Judge configuration (judges.yml) maps cryptic filenames to display names and categories, with optional plot styling:
styles:
colors:
pointwise: "#4A90D9"
pairwise: "#D94A4A"
hatches:
gpt-4o: ""
llama-3: "//"
judges:
my-judge-A.eval:
name: System A
method: pointwise # category column
model: gpt-4o # category column
my-judge-B.eval:
name: System B
method: pairwise
model: llama-3
styles.colors: maps category values to fill colors (any matplotlib color string)styles.hatches: maps category values to hatch patterns (//,..,xx,\\, etc). Values combine across categories.- Color is picked from the first matching category value; hatches are combined from all matches.
- Without a
styles:section, bars use a sequential grayscale fallback. - Judges not in the YAML are excluded unless
--all-judgesis passed.
Key options: --format (github, latex, tsv, plain, html, pipe), --columns (correlations or measures), --summary (add mean/max rows), --aggregate (aggregate across datasets), --same THRESHOLD (highlight near-equal values).
eval-result — Format conversion and verification
Clean and convert evaluation result files.
# Convert tot to jsonl
auto-judge-evaluate eval-result data.txt -if tot -of jsonl -o data.jsonl
# Filter to specific runs and topics
auto-judge-evaluate eval-result data.txt -if tot -of jsonl -o filtered.jsonl \
--filter-runs system_A --filter-runs system_B \
--filter-topics topic_1
Key options:
| Option | Description |
|---|---|
-if FMT |
Input format: trec_eval, tot, ir_measures, ranking, jsonl |
-of FMT |
Output format (defaults to input format) |
-o FILE |
Output file. Omit for roundtrip test to temp file |
--filter-runs ID |
Keep only these runs. Repeatable |
--filter-topics ID |
Keep only these topics. Repeatable |
--filter-measures NAME |
Keep only these measures. Repeatable |
--compare-aggregates |
Compare file aggregates vs recomputed from per-topic data |
--drop-aggregates |
Drop existing aggregate rows |
--recompute-aggregates |
Recompute from per-topic data (implies --drop-aggregates) |
--roundtrip / --no-roundtrip |
Enable/disable roundtrip verification (default: on) |
Supported formats:
| Format | Columns |
|---|---|
trec_eval |
measure topic value (3 cols, run_id from filename) |
tot |
run measure topic value (4 cols) |
ir_measures |
run topic measure value (4 cols) |
ranking |
topic Q0 doc_id rank score run (6 cols) |
jsonl |
JSON lines with run_id, topic_id, measure, value |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autojudge_evaluate-0.3.10.tar.gz.
File metadata
- Download URL: autojudge_evaluate-0.3.10.tar.gz
- Upload date:
- Size: 58.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab5d928cc230139e6c2dcf0db8a7f368bfcb687409f52b02bc0fe40325a42029
|
|
| MD5 |
523c2d116f57b147358ee1658214ffc2
|
|
| BLAKE2b-256 |
23674bfce796dd72e0756af238eb0da6218876b803793b8e5cb1c610b7e37125
|
Provenance
The following attestation bundles were made for autojudge_evaluate-0.3.10.tar.gz:
Publisher:
publish.yml on trec-auto-judge/auto-judge-evaluate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autojudge_evaluate-0.3.10.tar.gz -
Subject digest:
ab5d928cc230139e6c2dcf0db8a7f368bfcb687409f52b02bc0fe40325a42029 - Sigstore transparency entry: 1218383599
- Sigstore integration time:
-
Permalink:
trec-auto-judge/auto-judge-evaluate@06ccb8cf928829c52c91494ecb2af7d8f831a314 -
Branch / Tag:
refs/tags/v0.3.10 - Owner: https://github.com/trec-auto-judge
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@06ccb8cf928829c52c91494ecb2af7d8f831a314 -
Trigger Event:
push
-
Statement type:
File details
Details for the file autojudge_evaluate-0.3.10-py3-none-any.whl.
File metadata
- Download URL: autojudge_evaluate-0.3.10-py3-none-any.whl
- Upload date:
- Size: 52.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90645d48c019bc73498f459ba219c8d0900f13258248f8e03b4e00d44ee99b11
|
|
| MD5 |
0ab0211d667ff01ee0c55c7b2d3d6fab
|
|
| BLAKE2b-256 |
4f38c5e968f28184d955a89e0b3c58d033c60ff7215d660a1ecf57f231cae3d0
|
Provenance
The following attestation bundles were made for autojudge_evaluate-0.3.10-py3-none-any.whl:
Publisher:
publish.yml on trec-auto-judge/auto-judge-evaluate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autojudge_evaluate-0.3.10-py3-none-any.whl -
Subject digest:
90645d48c019bc73498f459ba219c8d0900f13258248f8e03b4e00d44ee99b11 - Sigstore transparency entry: 1218383658
- Sigstore integration time:
-
Permalink:
trec-auto-judge/auto-judge-evaluate@06ccb8cf928829c52c91494ecb2af7d8f831a314 -
Branch / Tag:
refs/tags/v0.3.10 - Owner: https://github.com/trec-auto-judge
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@06ccb8cf928829c52c91494ecb2af7d8f831a314 -
Trigger Event:
push
-
Statement type: