Find informative examples to efficiently (human-)evaluate NLG models.

These details have not been verified by PyPI

Project links

Project description

subset2evaluate

Package to select informative samples to human-evaluate for NLG tasks such as machine translation or summarization. It is based on a paper by Vilém Zouhar, Peng Cui, and Mrinmaya Sachan from ETH Zürich.

Selecting Examples to Efficiently Human-Evaluate Models: Human evaluation for language generation is the gold-standard but expensive. To fit the budgetary constraints, often only a random subset of the test set is chosen for evaluation. The random selection is grossly inefficient and in this work we formalize the task of selecting most informative items for evaluation. We show that methods based on variance in automated metric scores or diversity in system outputs, outperform the commonly used, yet inefficient, random selection. However, these methods are not applicable for test set creation where the system outputs are not yet available. This is applicable to blind test set creation or for selecting from a very large set of items. To this end, we introduce PreCOMET which predicts item usefulness for human evaluation just based on the input alone. We demonstrate the efficacy of our methods on two common language generations tasks, machine translation and summarization. We show that only 30%-60% of human annotations are needed to produce the same evaluation result.

Usage

In short, you put list of items in the package and the package sorts the list in descending order (first is better) based on how suitable each item is for evaluation, such as with human annotations. In addition to the sorting, the package also returns the item utility stored in the subset2evalute_utility field of each item. General recommendations based on MT evaluation:

When to use?	What is it?	How to use?
Good automated metric available, such as `MetricX-23`.	Variance in metric scores.	`method="metric_var", metric="MetricX-23"`
Metric not available but system outputs available.	Diversity of system outputs.	`method="diversity_bleu"`
System outputs not available, only sources.	Estimated diversity in system outputs.	`method="precomet_diversity"`

The package supports multiple methods. We show benchmark of the methods on machine translation evaluation:

Method	Requirements	Accuracy	Cluster count
Random		91.0%	2.25
Output-based selection
MetricX-23 var	MetricX-23 scores	92.0%	3.22
MetricX-23 avg	MetricX-23 scores	91.8%	3.16
Diversity BLEU	Outputs	92.1%	2.99
Diversity unigram	Outputs	91.1%	2.62
IRT diff.×disc.	MetricX-23 scores	91.2%	3.14
Source-based selection
PreCOMET var model	Sources	91.2%	2.58
PreCOMET avg model	Sources	91.1%	2.68
PreCOMET diversity [model]	Sources	92.1%	2.86
PreCOMET diff.×disc. [model1, model2]	Sources	93.1%	3.22

And benchmark of the methods for summarization:

Method	Requirements	Accuracy	Cluster count
Random		90.5%	2.00
Output-based selection
Coverage var	Coverage scores	92.2%	2.30
Coverage avg	Coverage scores	91.8%	2.20
IRT diff.×disc.	Coverage scores	92.6%	2.44
Diversity BLEU	Outputs	89.3%	2.90
Diversity unigram	Outputs	87.2%	2.80

Example for Machine Translation

Install the package and download WMT data:

pip3 install subset2evaluate
# optionally these two packages for IRT and PreCOMET based selections
pip3 install git+https://github.com/zouharvi/PreCOMET.git git+https://github.com/zouharvi/py-irt.git

Then in Python we compute the baseline:

import subset2evaluate

data_full = subset2evaluate.utils.load_data("wmt23/en-cs")
len(data_full)
> 1098

# take only top 100 segments to "human-evaluate"
data_new = subset2evaluate.select_subset.run_select_subset(data_full, method="random")
subset2evaluate.evalute.eval_subset_clusters(data_new[:100])
> 1

# compare it to something better:
data_new = subset2evaluate.select_subset.run_select_subset(data_full, method="metric_var" metric="MetricX-23")
subset2evaluate.evaluate.eval_subset_clusters(data_new[:100])
> 3

Example for Summarization

import subset2evaluate

data_full = subset2evaluate.utils.load_data("summeval")
len(data_full)
> 100

# take only top 25 segments to "human-evaluate"
data_new = subset2evaluate.select_subset.run_select_subset(data_full, method="random")
subset2evaluate.evaluate.eval_subset_clusters(data_new[:25], metric="human_relevance")
> 2

data_new = subset2evaluate.select_subset.run_select_subset(data_full, method="diversity_bleu")
subset2evaluate.evaluate.eval_subset_clusters(data_new[:25], metric="human_relevance")
> 3

Example for Custom Dataset

The intended usage is for your own custom datasets where you wish to choose which to evaluate. The input to subset2evaluate needs to be a list of items. What each item needs to contain depends on the method. For example, diversity requires tgt on each item such that the output diversity can be computed. As another texample var requires scores/metric on each item such that the metric variance can be computed. The item can contain any additional extra fields even if they're not explicitly used. As an example, look at the existing loaders:

import subset2evaluate
import json
data = subset2evaluate.utils.load_data("wmt23/en-de")

len(data)
> 549

json.dumps(data[0], indent=2)
> {
>   "i": 0,
>   "src": "Police arrest 15 after violent protest outside UK refugee hotel",
>   "ref": "Polizei verhaftet 15 Menschen nach gewalttätigen Protesten vor einer Flüchtlingsunterkunft in Großbritannien",
>   "tgt": {
>     "Lan-BridgeMT": "Polizei verhaftet 15 nach gewalttätigem Protest vor britischem Flüchtlingshotel",
>     "NLLB_MBR_BLEU": "Polizei verhaftet 15 nach gewaltsamen Protesten vor einem britischen Flüchtlingshotel",
>     "ZengHuiMT": "Die Polizei verhaftet 15 Personen nach gewalttätigem Protest vor britischem Flüchtlingshotel.",
>     "ONLINE-A": "Polizei nimmt 15 nach gewalttätigen Protesten vor britischem Flüchtlingshotel fest",
>     "ONLINE-W": "Polizei nimmt 15 Personen nach gewaltsamen Protesten vor einem britischen Flüchtlingshotel fest",
>     "ONLINE-B": "Polizei verhaftet 15 Personen nach gewalttätigem Protest vor britischem Flüchtlingshotel",
>     "NLLB_Greedy": "Polizei verhaftet 15 nach gewalttätigen Protesten vor einem Flüchtlingshotel in Großbritannien",
>     "ONLINE-M": "Polizei verhaftet 15 nach gewalttätigem Protest vor britischem Flüchtlingshotel",
>     "AIRC": "Polizeiverhaftung 15 nach gewaltsamen Protesten außerhalb des britischen Flüchtlingshotels",
>     "ONLINE-Y": "Die Polizei verhaftet 15 Personen nach gewaltsamen Protesten vor einem britischen Flüchtlingshotel",
>     "GPT4-5shot": "Die Polizei nimmt 15 Personen nach gewalttätigen Protesten vor einem britischen Flüchtlingshotel fest.",
>     "ONLINE-G": "Polizei verhaftet 15 nach gewalttätigem Protest vor britischem Flüchtlingshotel"
>   },
>   "time": 0.2119810263850096,
>   "domain": "news",
>   "doc": "aj-english.33941",
>   "scores": {
>     "Lan-BridgeMT": {
>       "human": 0.9175257731958762,
>       "XCOMET-XL": 0.9867596612701105,
>       "f200spBLEU": 0.2759278681802151,
>       ...
>     },
>     "GPT4-5shot": {
>       "human": 0.9948453608247423,
>       "XCOMET-XL": 0.988012809964431,
>       "f200spBLEU": 0.3275118410766353,
>       ...
>     },
>     "ONLINE-G": {
>       "human": 0.8762886597938144,
>       "XCOMET-XL": 0.9867596612701105,
>       "f200spBLEU": 0.2759278681802151,
>       ...
>     }
>   }
> }

Command-line Interface

We recommend using the Python interface but the package can also be used from the command line:

subset2evaluate wmt23/en-cs --method metric_var --args "{'metric': 'MetricX-23'}" > wmt23_encs_sorted.jsonl
subset2evaluate-eval wmt23/en-cs wmt23_encs_sorted.jsonl 
> Clusters: 2.30
> Accuracy: 86.7%

Contact & Contributions

We are look forward to contributions, especially (1) using subset2evaluate for other tasks, (2) adding new methods, (3) finding bugs and increasing package usability. Please file a GitHub issue or send us an email.

The repository is structured as follows:

subset2evaluate/ contains the primary package and all methods
experiments/ contains scripts to run experiments in the paper

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.27

Apr 22, 2026

1.0.26

Feb 27, 2026

1.0.25

Feb 9, 2026

1.0.24

Feb 6, 2026

1.0.23

Feb 6, 2026

1.0.22

Jan 20, 2026

1.0.21

Jan 20, 2026

1.0.20

Jan 20, 2026

1.0.19

Jan 20, 2026

1.0.18

Jan 16, 2026

1.0.17

Jan 13, 2026

1.0.16

Dec 4, 2025

1.0.15

Nov 28, 2025

1.0.14

Jun 29, 2025

1.0.13

Jun 9, 2025

1.0.12

Jun 9, 2025

1.0.11

Apr 29, 2025

1.0.10

Mar 17, 2025

1.0.9

Mar 14, 2025

1.0.8

Mar 8, 2025

1.0.6

Feb 19, 2025

1.0.5

Feb 19, 2025

1.0.4

Feb 17, 2025

1.0.3

Feb 3, 2025

1.0.2

Jan 31, 2025

1.0.1

Jan 31, 2025

1.0.0

Jan 31, 2025

0.1.0

Jan 28, 2025

0.0.1a11 pre-release

Jan 28, 2025

0.0.1a10 pre-release

Jan 20, 2025

0.0.1a8 pre-release

Jan 15, 2025

This version

0.0.1a7 pre-release

Jan 15, 2025

0.0.1a6 pre-release

Jan 14, 2025

0.0.1a5 pre-release

Jan 13, 2025

0.0.1a4 pre-release

Jan 13, 2025

0.0.1a2 pre-release

Jan 12, 2025

0.0.1a1 pre-release

Jan 12, 2025

0.0.1a0 pre-release

Jan 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subset2evaluate-0.0.1a7.tar.gz (22.0 kB view details)

Uploaded Jan 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

subset2evaluate-0.0.1a7-py3-none-any.whl (21.3 kB view details)

Uploaded Jan 15, 2025 Python 3

File details

Details for the file subset2evaluate-0.0.1a7.tar.gz.

File metadata

Download URL: subset2evaluate-0.0.1a7.tar.gz
Upload date: Jan 15, 2025
Size: 22.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for subset2evaluate-0.0.1a7.tar.gz
Algorithm	Hash digest
SHA256	`4953e76ba311601bc6df92e6fd644e65b5d3f7014415b953d8005938f2b1a2ec`
MD5	`72b06e676c9306e7f72b2be8bd9169ba`
BLAKE2b-256	`b73bff6f338832ac872944667134f7c2bdfb390da19a53867614b97de19af64d`

See more details on using hashes here.

File details

Details for the file subset2evaluate-0.0.1a7-py3-none-any.whl.

File metadata

Download URL: subset2evaluate-0.0.1a7-py3-none-any.whl
Upload date: Jan 15, 2025
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for subset2evaluate-0.0.1a7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0acd8c352ccf2794b97c1f889d8628f2c51eda8c5e3ed42e2714e28414d01829`
MD5	`eb40a4f678f5d0ee18acc0be43f698de`
BLAKE2b-256	`19dffe0d7a77f4a2d8eb29ef4616cbe29010302096f9495867ea9a83816ac60d`

See more details on using hashes here.

subset2evaluate 0.0.1a7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

subset2evaluate

Usage

Example for Machine Translation

Example for Summarization

Example for Custom Dataset

Command-line Interface

Contact & Contributions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes