Skip to main content

MENLI metrics v1

Project description

MENLI

This repository contains the code and data for our TACL paper: MENLI: Robust Evaluation Metrics from Natural Language Inference.

Abstract: Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%-30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).

🚀 MENLI Benchmark

We release our adversarial datasets. Please check here and the evaluation script for more details about how to run metrics on them.

2023-4-11 Update: we uploaded a new version of adversarial datasets for ref-based MT evaluation, which fixes some space and case errors (more details).

🚀 MENLI Metrics

2023-11-09 Update: You can now try MENLI from pypi!!

pip install menli

We provide the demo implementation of the ensemble metrics; however, the implementation is still imperfect.

Example of Usage

#from MENLI import MENLI
from menli.MENLI import MENLI

scorer = MENLI(direction="rh", formula="e", nli_weight=0.2, \
                combine_with="MoverScore", model="D", cross_lingual=False)
# refs and hyps in form of list of String
scorer.score_all(refs=refs, hyps=hyps) 

E.g., run XNLI-D on WMT15:

python wmt.py --year 2015 --cross_lingual --direction avg --formula e --model D

Run the combined metric with BERTScore F1 on wmt17:

python wmt.py --year 2017 --combine_with BERTScore-F --nli_weight 0.2 --model R

We implemented the combination with MoverScore, BERTScore-F1, and XMoverScore here, to combine with other metrics, just fit the code into metric_utils.py.

Specifically, in init_scorer() function, you need to initialize a scorer like

def init_scorer(**metric_config):

    from bert_score.scorer import BERTScorer
    scorer = BERTScorer(lang='en', idf=True)
    metric_hash = scorer.hash

    return scorer, metric_hash

Then call the metric scoring function in scoring():

def scoring(scorer, refs, hyps, sources):

    if scorer.idf:
        scorer.compute_idf(refs)
    scores = scorer.score(hyps, refs)[2].detach().numpy().tolist()  # F1
    # Note: the outputs of the metric should be a list.
    return scores

🚀 Experiments

To reproduce the experiments conducted in this work, please check the folder experiments.

Citation

If you use the code or data from this work, please include the following citation:

@article{chen_menli:2023,
    author = {Chen, Yanran and Eger, Steffen},
    title = "{MENLI: Robust Evaluation Metrics from Natural Language Inference}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {11},
    pages = {804-825},
    year = {2023},
    month = {07},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00576},
}

If you have any questions, feel free to contact us!

Yanran Chen (yanran.chen@stud.tu-darmstadt.de) and Steffen Eger (steffen.eger@uni-bielefeld.de)

Check our group page (NLLG) for other ongoing projects!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

menli-1.0.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

menli-1.0.1-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file menli-1.0.1.tar.gz.

File metadata

  • Download URL: menli-1.0.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for menli-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d36640a03a43d9f88a4ca685137afb51822d870511183d8503c9e1dfc771a98c
MD5 6aa111e76307be410603d6478c09a63d
BLAKE2b-256 5c767cf31579c4f44a7db1c84f648aea1e960713ceaf16e8b11dfd4733646f3c

See more details on using hashes here.

File details

Details for the file menli-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: menli-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for menli-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 31736950a79d5f56d7068e43f36fd0305899ef755b0d2e1ad49e4f0736881574
MD5 851ea672d8fe8b724720f3fe9831b0bb
BLAKE2b-256 43af47bef49d3108e971e64dbeb5f98417fb4c18af7de5e4bf41f9eb6fceb3a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page