PyTorch implementation of BERT score

These details have not been verified by PyPI

Project links

Homepage

Project description

BERTScore

Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020).

News:

Updated to version 0.3.2
- Bug fixed: fixing the bug in v0.3.1 when having multiple reference sentences.
- Supporting multiple reference sentences with our command line tool.
Updated to version 0.3.1
- A new BERTScorer object that caches the model to avoid re-loading it multiple times. Please see our jupyter notebook example for the usage.
- Supporting multiple reference sentences for each example. The score function now can take a list of lists of strings as the references and return the score between the candidate sentence and its closest reference sentence.
Updated to version 0.3.0
- Supporting Baseline Rescaling: we apply a simple linear transformation to enhance the readability of BERTscore using pre-computed "baselines". It has been pointed out (e.g. by #20, #23) that the numercial range of BERTScore is exceedingly small when computed with RoBERTa models. In other words, although BERTScore correctly distinguish examples through ranking, the numerical scores of good and bad examples are very similar. We detail our approach in a separate post.
Updated to version 0.2.3
- Supporting DistilBERT (Sanh et al.), ALBERT (Lan et al.), and XLM-R (Conneau et al.) models.
- Including the version of huggingface's transformers in the hash code for reproducibility
BERTScore gets accepted in ICLR 2020. Please come to our poster in Addis Ababa, Ethiopia!
Updated to version 0.2.2
- Bug fixed: when using RoBERTaTokenizer, we now set add_prefix_space=True which was the default setting in huggingface's pytorch_transformers (when we ran the experiments in the paper) before they migrated it to transformers. This breaking change in transformers leads to a lower correlation with human evalutation. To reproduce our RoBERTa results in the paper, please use version 0.2.2.
- The best number of layers for DistilRoBERTa is included
- Supporting loading a custom model
Updated to version 0.2.1
- SciBERT (Beltagy et al.) models are now included. Thanks to AI2 for sharing the models. By default, we use the 9th layer (the same as BERT-base), but this is not tuned.
Our arXiv paper has been updated to v2 with more experiments and analysis.
Updated to version 0.2.0
- Supporting BERT, XLM, XLNet, and RoBERTa models using huggingface's Transformers library
- Automatically picking the best model for a given language
- Automatically picking the layer based a model
- IDF is not set as default as we show in the new version that the improvement brought by importance weighting is not consistent

Authors:

*: Equal Contribution

Overview

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

For an illustration, BERTScore precision can be computed as

If you find this repo useful, please cite:

@inproceedings{bert-score,
  title={BERTScore: Evaluating Text Generation with BERT},
  author={Tianyi Zhang* and Varsha Kishore* and Felix Wu* and Kilian Q. Weinberger and Yoav Artzi},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=SkeHuCVFDr}
}

Installation

Python version >= 3.6
PyTorch version >= 1.0.0

Install from pypi with pip by

pip install bert-score

Install latest unstable version from the master branch on Github by:

pip install git+https://github.com/Tiiiger/bert_score

Install it from the source by:

git clone https://github.com/Tiiiger/bert_score
cd bert_score
pip install .

and you may test your installation by:

python -m unittest discover

Usage

Command Line Interface (CLI)

We provide a command line interface (CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:

To evaluate English text files:

We provide example inputs under ./example.

bert-score -r example/refs.txt -c example/hyps.txt --lang en

You will get the following output at the end:

roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0) P: 0.957378 R: 0.961325 F1: 0.959333

where "roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0)" is the hash code.

Starting from version 0.3.0, we support rescaling the scores with baseline scores

bert-score -r example/refs.txt -c example/hyps.txt --lang en --rescale-with-baseline

You will get:

roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0)-rescaled P: 0.747044 R: 0.770484 F1: 0.759045

This makes the range of the scores larger and more human-readable. Please see this post for details.

When having multiple reference sentences, please use

bert-score -r example/refs.txt example/refs2.txt -c example/hyps.txt --lang en

where the -r argument supports an arbitrary number of reference files. Each reference file should have the same number of lines as your candidate/hypothesis file. The i-th line in each reference file corresponds to the i-th line in the candidate file.

To evaluate text files in other languages:

We currently support the 104 languages in multilingual BERT (full list).

Please specify the two-letter abbreviation of the language. For instance, using --lang zh for Chinese text.

See more options by bert-score -h.

To load your own custom model: Please specify the path to the model and the number of layers to use by --model and --num_layers.

bert-score -r example/refs.txt -c example/hyps.txt --model path_to_my_bert --num_layers 9

To visualize matching scores:

bert-score-show --lang en -r "There are two bananas on the table." -c "On the table are two apples." -f out.png

The figure will be saved to out.png.

Python Function

For the python module, we provide a demo. Please refer to bert_score/score.py for more details.

Running BERTScore can be computationally intensive (because it uses BERT :p). Therefore, a GPU is usually necessary. If you don't have access to a GPU, you can try our demo on Google Colab

Practical Tips

Report the hash code (e.g., roberta-large_L17_no-idf_version=0.2.1) in your paper so that people know what setting you use. This is inspired by sacreBLEU.
Unlike BERT, RoBERTa uses GPT2-style tokenizer which creates addition " " tokens when there are multiple spaces appearing together. It is recommended to remove addition spaces by sent = re.sub(r' +', ' ', sent) or sent = re.sub(r'\s+', ' ', sent).
Using inverse document frequency (idf) on the reference sentences to weigh word importance may correlate better with human judgment. However, when the set of reference sentences become too small, the idf score would become inaccurate/invalid. We now make it optional. To use idf, please set --idf when using the CLI tool or idf=True when calling bert_score.score function.
When you are low on GPU memory, consider setting batch_size when calling bert_score.score function.
To use a particular model please set -m MODEL_TYPE when using the CLI tool or model_type=MODEL_TYPE when calling bert_score.score function.
We tune layer to use based on WMT16 metric evaluation dataset. You may use a different layer by setting -l LAYER or num_layers=LAYER
Limitation: Because BERT, RoBERTa, and XLM with learned positional embeddings are pre-trained on sentences with max length 512, BERTScore is undefined between sentences longer than 510 (512 after adding [CLS] and [SEP] tokens). The sentences longer than this will be truncated. Please consider using XLNet which can support much longer inputs.

Default Behavior

Default Model

Language	Model
en	roberta-large
en-sci	scibert-scivocab-uncased
zh	bert-base-chinese
others	bert-base-multilingual-cased

Default Layers

Model	Best Layer	Max Length
bert-base-uncased	9	512
bert-large-uncased	18	512
bert-base-cased-finetuned-mrpc	9	512
bert-base-multilingual-cased	9	512
bert-base-chinese	8	512
roberta-base	10	512
roberta-large	17	512
roberta-large-mnli	19	512
roberta-base-openai-detector	7	512
roberta-large-openai-detector	19	512
xlnet-base-cased	5	1000000000000
xlnet-large-cased	7	1000000000000
xlm-mlm-en-2048	7	512
xlm-mlm-100-1280	11	512
scibert-scivocab-uncased	9*	512
scibert-scivocab-cased	9*	512
scibert-basevocab-uncased	9*	512
scibert-basevocab-cased	9*	512
distilroberta-base	5	512
distilbert-base	5	512
distilbert-base-uncased	5	512
distilbert-base-uncased-distilled-squad	4	512
distilbert-base-multilingual-cased	5	512
albert-base-v1	10	512
albert-large-v1	17	512
albert-xlarge-v1	16	512
albert-xxlarge-v1	8	512
albert-base-v2	9	512
albert-large-v2	14	512
albert-xlarge-v2	13	512
albert-xxlarge-v2	8	512
xlm-roberta-base	9	512
xlm-roberta-large	17	512

*: Not tuned

Acknowledgement

This repo wouldn't be possible without the awesome bert, fairseq, and transformers.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.13

Feb 20, 2023

0.3.12

Oct 14, 2022

0.3.11

Dec 10, 2021

0.3.10

Aug 5, 2021

0.3.9

Apr 17, 2021

0.3.8

Mar 3, 2021

0.3.7

Dec 6, 2020

0.3.6

Sep 3, 2020

0.3.5

Jul 17, 2020

0.3.4

Jun 10, 2020

0.3.3

May 10, 2020

This version

0.3.2

Apr 18, 2020

0.3.1

Mar 5, 2020

0.3.0

Jan 14, 2020

0.2.3

Dec 22, 2019

0.2.2

Nov 30, 2019

0.2.1

Oct 29, 2019

0.2.0

Oct 2, 2019

0.1.2

Apr 27, 2019

0.1.1

Apr 27, 2019

0.1.0

Apr 23, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bert_score-0.3.2.tar.gz (39.7 kB view details)

Uploaded Apr 18, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bert_score-0.3.2-py3-none-any.whl (52.2 kB view details)

Uploaded Apr 18, 2020 Python 3

File details

Details for the file bert_score-0.3.2.tar.gz.

File metadata

Download URL: bert_score-0.3.2.tar.gz
Upload date: Apr 18, 2020
Size: 39.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3

File hashes

Hashes for bert_score-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`0acfc11d93c3514eca8ad95a034293c2ee852aabd704d0babc97f4e6880a5521`
MD5	`4910784becf90d1123903e4fa2493c58`
BLAKE2b-256	`8af2fe7f8090da39c908082f08f2f65cb5bf0e9d3144d594efbcabd5dbafe421`

See more details on using hashes here.

File details

Details for the file bert_score-0.3.2-py3-none-any.whl.

File metadata

Download URL: bert_score-0.3.2-py3-none-any.whl
Upload date: Apr 18, 2020
Size: 52.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.3

File hashes

Hashes for bert_score-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07ddc7a7995188201f107300f877474f88208c82b9bb92172288de621868e7fe`
MD5	`d5b24857273d9c1b6de62d39b7fa8200`
BLAKE2b-256	`5bdb391c067946b946ab6818d49bd38fe5c53c7d7108bea4f0135bde614f41bb`

See more details on using hashes here.

bert-score 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BERTScore

News:

Authors:

Overview

Installation

Usage

Command Line Interface (CLI)

Python Function

Practical Tips

Default Behavior

Default Model

Default Layers

Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes