MENLI metrics v1

Project description

MENLI

This repository contains the code and data for our TACL paper: MENLI: Robust Evaluation Metrics from Natural Language Inference.

Abstract: Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%-30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).

🚀 MENLI Benchmark

We release our adversarial datasets. Please check here and the evaluation script for more details about how to run metrics on them.

2023-4-11 Update: we uploaded a new version of adversarial datasets for ref-based MT evaluation, which fixes some space and case errors (more details).

🚀 MENLI Metrics

2023-11-09 Update: You can now try MENLI from pypi!!

pip install menli

We provide the demo implementation of the ensemble metrics; however, the implementation is still imperfect.

Example of Usage

#from MENLI import MENLI
from menli.MENLI import MENLI

scorer = MENLI(direction="rh", formula="e", nli_weight=0.2, \
                combine_with="MoverScore", model="D", cross_lingual=False)
# refs and hyps in form of list of String
scorer.score_all(refs=refs, hyps=hyps)

E.g., run XNLI-D on WMT15:

python wmt.py --year 2015 --cross_lingual --direction avg --formula e --model D

Run the combined metric with BERTScore F1 on wmt17:

python wmt.py --year 2017 --combine_with BERTScore-F --nli_weight 0.2 --model R

We implemented the combination with MoverScore, BERTScore-F1, and XMoverScore here, to combine with other metrics, just fit the code into metric_utils.py.

Specifically, in init_scorer() function, you need to initialize a scorer like

def init_scorer(**metric_config):

    from bert_score.scorer import BERTScorer
    scorer = BERTScorer(lang='en', idf=True)
    metric_hash = scorer.hash

    return scorer, metric_hash

Then call the metric scoring function in scoring():

def scoring(scorer, refs, hyps, sources):

    if scorer.idf:
        scorer.compute_idf(refs)
    scores = scorer.score(hyps, refs)[2].detach().numpy().tolist()  # F1
    # Note: the outputs of the metric should be a list.
    return scores

🚀 Experiments

To reproduce the experiments conducted in this work, please check the folder experiments.

Citation

If you use the code or data from this work, please include the following citation:

@article{chen_menli:2023,
    author = {Chen, Yanran and Eger, Steffen},
    title = "{MENLI: Robust Evaluation Metrics from Natural Language Inference}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {11},
    pages = {804-825},
    year = {2023},
    month = {07},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00576},
}

If you have any questions, feel free to contact us!

Yanran Chen (yanran.chen@stud.tu-darmstadt.de) and Steffen Eger (steffen.eger@uni-bielefeld.de)

Check our group page (NLLG) for other ongoing projects!

Project details

Release history Release notifications | RSS feed

This version

1.0.1

Nov 9, 2023

1.0.0

Nov 8, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

menli-1.0.1.tar.gz (13.7 kB view details)

Uploaded Nov 9, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

menli-1.0.1-py3-none-any.whl (13.3 kB view details)

Uploaded Nov 9, 2023 Python 3

File details

Details for the file menli-1.0.1.tar.gz.

File metadata

Download URL: menli-1.0.1.tar.gz
Upload date: Nov 9, 2023
Size: 13.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for menli-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`d36640a03a43d9f88a4ca685137afb51822d870511183d8503c9e1dfc771a98c`
MD5	`6aa111e76307be410603d6478c09a63d`
BLAKE2b-256	`5c767cf31579c4f44a7db1c84f648aea1e960713ceaf16e8b11dfd4733646f3c`

See more details on using hashes here.

File details

Details for the file menli-1.0.1-py3-none-any.whl.

File metadata

Download URL: menli-1.0.1-py3-none-any.whl
Upload date: Nov 9, 2023
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for menli-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`31736950a79d5f56d7068e43f36fd0305899ef755b0d2e1ad49e4f0736881574`
MD5	`851ea672d8fe8b724720f3fe9831b0bb`
BLAKE2b-256	`43af47bef49d3108e971e64dbeb5f98417fb4c18af7de5e4bf41f9eb6fceb3a9`

See more details on using hashes here.

menli 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

MENLI

🚀 MENLI Benchmark

🚀 MENLI Metrics

Example of Usage

🚀 Experiments

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes