Tie-aware Retrieval Metrics (TRM) for reliable evaluation of retrieval systems under tied relevance scores.
Project description
Tie-aware Retrieval Metrics (TRM)
A lightweight Python library for reliable evaluation of retrieval systems in the presence of tied relevance scores.
When retrieval models operate in low numerical precision (e.g., BF16, FP16), many candidate documents receive identical scores, creating spurious ties. Conventional tie-oblivious evaluation arbitrarily breaks these ties, leading to unstable and potentially misleading metric values. TRM resolves this by computing expected metric values over all possible orderings of tied candidates, along with score range and bias diagnostics.
Reference Kisu Yang, Yoonna Jang, Hwanseok Jang, Kenneth Choi, Isabelle Augenstein, Heuiseok Lim. Reliable Evaluation Protocol for Low-Precision Retrieval. ACL 2026.
Installation
pip install tie-aware-retrieval-metrics
Or install from source:
git clone https://github.com/KisuYang/tie-aware-retrieval-metrics.git
cd tie-aware-retrieval-metrics
pip install -e .
Quick Start
import trm
# Per-query relevance scores and labels
scores = [
[0.99, 0.97, 0.97, 0.97, 0.95], # query 1: three docs share score 0.97
]
is_relevant = [
[False, True, False, True, False], # query 1: docs 1, 3 are relevant
]
result = trm.evaluate(
scores=scores,
is_relevant=is_relevant,
metrics=["ndcg", "mrr", "recall"],
k_list=[3, 5],
)
# Macro-averaged results
for metric in ["ndcg", "mrr", "recall"]:
for k in [3, 5]:
r = result.metrics[metric][k]
print(f"{metric}@{k}: E[M]={r.expected:.4f} "
f"M_obl={r.oblivious:.4f} "
f"M_max={r.maximum:.4f} "
f"M_min={r.minimum:.4f} "
f"range={r.range:.4f} "
f"bias={r.bias:.4f}")
Output:
ndcg@3: E[M]=0.4623 M_obl=0.3869 M_max=0.6934 M_min=0.3066 range=0.3869 bias=-0.0754
ndcg@5: E[M]=0.6383 M_obl=0.6509 M_max=0.6934 M_min=0.5706 range=0.1228 bias=0.0126
mrr@3: E[M]=0.4444 M_obl=0.5000 M_max=0.5000 M_min=0.3333 range=0.1667 bias=0.0556
mrr@5: E[M]=0.4444 M_obl=0.5000 M_max=0.5000 M_min=0.3333 range=0.1667 bias=0.0556
recall@3: E[M]=0.6667 M_obl=0.5000 M_max=1.0000 M_min=0.5000 range=0.5000 bias=-0.1667
recall@5: E[M]=1.0000 M_obl=1.0000 M_max=1.0000 M_min=1.0000 range=0.0000 bias=0.0000
API Reference
You can also selectively import:
from trm import evaluate, build_tie_groups
trm.evaluate(scores, is_relevant, metrics=None, k_list=None)
Compute tie-aware retrieval metrics over a set of queries.
Parameters:
scores(list of list of float): Per-query relevance scores for each candidate document.is_relevant(list of list of bool): Per-query binary relevance labels.metrics(list of str, optional): Metrics to compute. Supported:"ndcg","mrr","map","recall","precision","f1","hits". Default:["ndcg", "mrr", "map", "recall"].k_list(list of int, optional): Cutoff values. Default:[1, 3, 5, 10, 20, 50, 100].
Returns: EvaluationOutput with:
.metrics[metric_name][k]→TieAwareResult(macro-averaged).per_query[metric_name][k]→ list of per-queryTieAwareResult.to_dict()→ flat dictionary for logging
TieAwareResult
| Attribute | Description |
|---|---|
.expected |
E[M] — expected score over all tie orderings |
.oblivious |
M_obl — tie-oblivious (index-preserving) score |
.maximum |
M_max — best-case score |
.minimum |
M_min — worst-case score |
.range |
M_max - M_min (Eq. 4) |
.bias |
M_obl - E[M] (Eq. 5) |
trm.build_tie_groups(scores, is_relevant)
Build tie groups from raw scores and relevance labels.
Returns: list of (group_size, num_relevant) tuples sorted by descending score.
Supported Metrics
| Metric | Key | Paper Reference |
|---|---|---|
| nDCG@k | "ndcg" |
Eq. 14-16 |
| MRR@k | "mrr" |
Eq. 17-21 |
| MAP@k | "map" |
Eq. 22-24 |
| Recall@k | "recall" |
Eq. 10 |
| Precision@k | "precision" |
Eq. 11 |
| F1@k | "f1" |
Eq. 12 |
| Hits@k | "hits" |
Eq. 9 |
Citation
@inproceedings{yang2026reliable,
title = {Reliable Evaluation Protocol for Low-Precision Retrieval},
author = {Yang, Kisu and Jang, Yoonna and Jang, Hwanseok and Choi, Kenneth and Augenstein, Isabelle and Lim, Heuiseok},
booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
year = {2026},
}
License
Apache License 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tie_aware_retrieval_metrics-0.1.0.tar.gz.
File metadata
- Download URL: tie_aware_retrieval_metrics-0.1.0.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e700cefc0a9b6bab8dea63df64b66fe6bea0e34655470abf7fe9a008447cf6b
|
|
| MD5 |
3428cb61500c2490f47148cf7ff2df86
|
|
| BLAKE2b-256 |
9c58aa44ff2cd7cd81ef369afc932eaefceb6a515abd6f028f04eca93481a053
|
File details
Details for the file tie_aware_retrieval_metrics-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tie_aware_retrieval_metrics-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
704371db7e3b6629d268d595ee016389bff77eff9d0ac82cfb197f4647fe075a
|
|
| MD5 |
889223eca6f98e2f6004987f9eda93e9
|
|
| BLAKE2b-256 |
59faf0a424f3f93c819f82a6c8fba5f17b1aebec67c4f6594c935e6705307caa
|