PyTorch implementation of BERT score
Project description
BERTScore
Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020).
News:
- Updated to version 0.2.3
- Supporting DistilBERT (Sanh et al.), ALBERT (Lan et al.), and XLM-R (Conneau et al.) models.
- Including the version of huggingface's transformers in the hash code for reproducibility
- BERTScore gets accepted in ICLR 2020. Please come to our poster in Addis Ababa, Ethiopia!
- Updated to version 0.2.2
- Bug fixed: when using RoBERTaTokenizer, we now set
add_prefix_space=True
which was the default setting in huggingface'spytorch_transformers
(when we ran the experiments in the paper) before they migrated it totransformers
. This breaking change intransformers
leads to a lower correlation with human evalutation. To reproduce our RoBERTa results in the paper, please use version0.2.2
. - The best number of layers for DistilRoBERTa is included
- Supporting loading a custom model
- Bug fixed: when using RoBERTaTokenizer, we now set
- Updated to version 0.2.1
- SciBERT (Beltagy et al.) models are now included. Thanks to AI2 for sharing the models. By default, we use the 9th layer (the same as BERT-base), but this is not tuned.
- Our arXiv paper has been updated to v2 with more experiments and analysis.
- Updated to version 0.2.0
- Supporting BERT, XLM, XLNet, and RoBERTa models using huggingface's Transformers library
- Automatically picking the best model for a given language
- Automatically picking the layer based a model
- IDF is not set as default as we show in the new version that the improvement brought by importance weighting is not consistent
Authors:
*: Equal Contribution
Overview
BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on setence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
For an illustration, BERTScore precision can be computed as
If you find this repo useful, please cite:
@inproceedings{bert-score,
title={BERTScore: Evaluating Text Generation with BERT},
author={Tianyi Zhang* and Varsha Kishore* and Felix Wu* and Kilian Q. Weinberger and Yoav Artzi},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=SkeHuCVFDr}
}
Installation
- Python version >= 3.6
- PyTorch version >= 1.0.0
Install from pip by
pip install bert-score
Install it from the source by:
git clone https://github.com/Tiiiger/bert_score
cd bert_score
pip install .
and you may test your installation by:
python -m unittest discover
Usage
Command Line Interface (CLI)
We provide a command line interface (CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:
- To evaluate English text files:
We provide example inputs under ./example
.
bert-score -r example/refs.txt -c example/hyps.txt --lang en
You will get the following output at the end:
roberta-large_L17_no-idf_version=0.2.3(hug_trans=2.3.0) BERT-P: 0.957378 BERT-R: 0.961325 BERT-F1: 0.959333
where "roberta-large_L17_no-idf_version=0.2.3(hug_trans=2.3.0)" is the hash code.
- To evaluate text files in other languages:
We currently support the 104 languages in multilingual BERT (full list).
Please specify the two-letter abbrevation of the language. For instance, using --lang zh
for Chinese text.
See more options by bert-score -h
.
- To load your own custom model:
Please specify the path to the model and the number of layers to use by
--model
and--num_layers
.
bert-score -r example/refs.txt -c example/hyps.txt --model path_to_my_bert --num_layers 9
- To visualize matching scores:
bert-score-show --lang en -r "There are two bananas on the table." -c "On the table are two apples." -f out.png
The figure will be saved to out.png.
Python Function
For the python module, we provide a demo.
Please refer to bert_score/score.py
for more details.
Running BERTScore can be computationally intensive (because it uses BERT :p). Therefore, a GPU is usually necessary. If you don't have access to a GPU, you can try our demo on Google Colab
Practical Tips
- Report the hash code (e.g.,
roberta-large_L17_no-idf_version=0.2.1
) in your paper so that people know what setting you use. This is inspired by sacreBLEU. - Unlike BERT, RoBERTa uses GPT2-style tokenizer which creates addition " " tokens when there are multiple spaces appearing together. It is recommended to remove addition spaces by
sent = re.sub(r' +', ' ', sent)
orsent = re.sub(r'\s+', ' ', sent)
. - Using inverse document frequency (idf) on the reference
sentences to weigh word importance may correlate better with human judgment.
However, when the set of reference sentences become too small, the idf score
would become inaccurate/invalid.
We now make it optional. To use idf,
please set
--idf
when using the CLI tool oridf=True
when callingbert_score.score
function. - When you are low on GPU memory, consider setting
batch_size
when callingbert_score.score
function. - To use a particular model please set
-m MODEL_TYPE
when using the CLI tool ormodel_type=MODEL_TYPE
when callingbert_score.score
function. - We tune layer to use based on WMT16 metric evaluation dataset. You may use a
different layer by setting
-l LAYER
ornum_layers=LAYER
- Limitation: Because BERT, RoBERTa, and XLM with learned positional embeddings are pre-trained on sentences with max length 512, BERTScore is undefined between sentences longer than 510 (512 after adding [CLS] and [SEP] tokens). The sentences longer than this will be truncated. Please consider using XLNet which can support much longer inputs.
Default Behavior
Default Model
Language | Model |
---|---|
en | roberta-large |
en-sci | scibert-scivocab-uncased |
zh | bert-base-chinese |
others | bert-base-multilingual-cased |
Default Layers
Model | Best Layer | Max Length |
---|---|---|
bert-base-uncased | 9 | 512 |
bert-large-uncased | 18 | 512 |
bert-base-cased-finetuned-mrpc | 9 | 512 |
bert-base-multilingual-cased | 9 | 512 |
bert-base-chinese | 8 | 512 |
roberta-base | 10 | 512 |
roberta-large | 17 | 512 |
roberta-large-mnli | 19 | 512 |
roberta-base-openai-detector | 7 | 512 |
roberta-large-openai-detector | 19 | 512 |
xlnet-base-cased | 5 | 1000000000000 |
xlnet-large-cased | 7 | 1000000000000 |
xlm-mlm-en-2048 | 7 | 512 |
xlm-mlm-100-1280 | 11 | 512 |
scibert-scivocab-uncased | 9* | 512 |
scibert-scivocab-cased | 9* | 512 |
scibert-basevocab-uncased | 9* | 512 |
scibert-basevocab-cased | 9* | 512 |
distilroberta-base | 5 | 512 |
distilbert-base | 5 | 512 |
distilbert-base-uncased | 5 | 512 |
distilbert-base-uncased-distilled-squad | 4 | 512 |
distilbert-base-multilingual-cased | 5 | 512 |
albert-base-v1 | 10 | 512 |
albert-large-v1 | 17 | 512 |
albert-xlarge-v1 | 16 | 512 |
albert-xxlarge-v1 | 8 | 512 |
albert-base-v2 | 9 | 512 |
albert-large-v2 | 14 | 512 |
albert-xlarge-v2 | 13 | 512 |
albert-xxlarge-v2 | 8 | 512 |
xlm-roberta-base | 9 | 512 |
xlm-roberta-large | 17 | 512 |
*: Not tuned
Acknowledgement
This repo wouldn't be possible without the awesome bert, fairseq, and transformers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bert_score-0.2.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ebf9d59408a7465e9ef33f14a9ed180b6b373fbb1c8ad10aeda42e053c3ab927 |
|
MD5 | 13cca2860cab352178255aa60c676c5c |
|
BLAKE2b-256 | 51620580ae4df9720e3306127880e4d7703ec80e63c454023ddb242f4491dd14 |