Skip to main content

Tie-aware Retrieval Metrics (TRM) for reliable evaluation of retrieval systems under tied relevance scores.

Project description

Tie-aware Retrieval Metrics (TRM)

A lightweight Python library for reliable evaluation of retrieval systems in the presence of tied relevance scores.

When retrieval models operate in low numerical precision (e.g., BF16, FP16), many candidate documents receive identical scores, creating spurious ties. Conventional tie-oblivious evaluation arbitrarily breaks these ties, leading to unstable and potentially misleading metric values. TRM resolves this by computing expected metric values over all possible orderings of tied candidates, along with score range and bias diagnostics.

Reference Kisu Yang, Yoonna Jang, Hwanseok Jang, Kenneth Choi, Isabelle Augenstein, Heuiseok Lim. Reliable Evaluation Protocol for Low-Precision Retrieval. ACL 2026.

Installation

pip install tie-aware-retrieval-metrics

Or install from source:

git clone https://github.com/KisuYang/tie-aware-retrieval-metrics.git
cd tie-aware-retrieval-metrics
pip install -e .

Quick Start

import trm

# Per-query relevance scores and labels
scores = [
    [0.99, 0.97, 0.97, 0.97, 0.95],  # query 1: three docs share score 0.97
]
is_relevant = [
    [False, True, False, True, False],  # query 1: docs 1, 3 are relevant
]

result = trm.evaluate(
    scores=scores,
    is_relevant=is_relevant,
    metrics=["ndcg", "mrr", "recall"],
    k_list=[3, 5],
)

# Macro-averaged results
for metric in ["ndcg", "mrr", "recall"]:
    for k in [3, 5]:
        r = result.metrics[metric][k]
        print(f"{metric}@{k}: E[M]={r.expected:.4f}  "
              f"M_obl={r.oblivious:.4f}  "
              f"M_max={r.maximum:.4f}  "
              f"M_min={r.minimum:.4f}  "
              f"range={r.range:.4f}  "
              f"bias={r.bias:.4f}")

Output:

ndcg@3: E[M]=0.4623  M_obl=0.3869  M_max=0.6934  M_min=0.3066  range=0.3869  bias=-0.0754
ndcg@5: E[M]=0.6383  M_obl=0.6509  M_max=0.6934  M_min=0.5706  range=0.1228  bias=0.0126
mrr@3:  E[M]=0.4444  M_obl=0.5000  M_max=0.5000  M_min=0.3333  range=0.1667  bias=0.0556
mrr@5:  E[M]=0.4444  M_obl=0.5000  M_max=0.5000  M_min=0.3333  range=0.1667  bias=0.0556
recall@3: E[M]=0.6667  M_obl=0.5000  M_max=1.0000  M_min=0.5000  range=0.5000  bias=-0.1667
recall@5: E[M]=1.0000  M_obl=1.0000  M_max=1.0000  M_min=1.0000  range=0.0000  bias=0.0000

API Reference

You can also selectively import:

from trm import evaluate, build_tie_groups

trm.evaluate(scores, is_relevant, metrics=None, k_list=None)

Compute tie-aware retrieval metrics over a set of queries.

Parameters:

  • scores (list of list of float): Per-query relevance scores for each candidate document.
  • is_relevant (list of list of bool): Per-query binary relevance labels.
  • metrics (list of str, optional): Metrics to compute. Supported: "ndcg", "mrr", "map", "recall", "precision", "f1", "hits". Default: ["ndcg", "mrr", "map", "recall"].
  • k_list (list of int, optional): Cutoff values. Default: [1, 3, 5, 10, 20, 50, 100].

Returns: EvaluationOutput with:

  • .metrics[metric_name][k]TieAwareResult (macro-averaged)
  • .per_query[metric_name][k] → list of per-query TieAwareResult
  • .to_dict() → flat dictionary for logging

TieAwareResult

Attribute Description
.expected E[M] — expected score over all tie orderings
.oblivious M_obl — tie-oblivious (index-preserving) score
.maximum M_max — best-case score
.minimum M_min — worst-case score
.range M_max - M_min (Eq. 4)
.bias M_obl - E[M] (Eq. 5)

trm.build_tie_groups(scores, is_relevant)

Build tie groups from raw scores and relevance labels.

Returns: list of (group_size, num_relevant) tuples sorted by descending score.

Supported Metrics

Metric Key Paper Reference
nDCG@k "ndcg" Eq. 14-16
MRR@k "mrr" Eq. 17-21
MAP@k "map" Eq. 22-24
Recall@k "recall" Eq. 10
Precision@k "precision" Eq. 11
F1@k "f1" Eq. 12
Hits@k "hits" Eq. 9

Citation

@inproceedings{yang2026reliable,
    title     = {Reliable Evaluation Protocol for Low-Precision Retrieval},
    author    = {Yang, Kisu and Jang, Yoonna and Jang, Hwanseok and Choi, Kenneth and Augenstein, Isabelle and Lim, Heuiseok},
    booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
    year      = {2026},
}

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tie_aware_retrieval_metrics-0.1.0.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tie_aware_retrieval_metrics-0.1.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file tie_aware_retrieval_metrics-0.1.0.tar.gz.

File metadata

File hashes

Hashes for tie_aware_retrieval_metrics-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6e700cefc0a9b6bab8dea63df64b66fe6bea0e34655470abf7fe9a008447cf6b
MD5 3428cb61500c2490f47148cf7ff2df86
BLAKE2b-256 9c58aa44ff2cd7cd81ef369afc932eaefceb6a515abd6f028f04eca93481a053

See more details on using hashes here.

File details

Details for the file tie_aware_retrieval_metrics-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tie_aware_retrieval_metrics-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 704371db7e3b6629d268d595ee016389bff77eff9d0ac82cfb197f4647fe075a
MD5 889223eca6f98e2f6004987f9eda93e9
BLAKE2b-256 59faf0a424f3f93c819f82a6c8fba5f17b1aebec67c4f6594c935e6705307caa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page