Skip to main content

Toolkit for summarization evaluation

Project description

Summarization Repository

Authors: Alex Fabbri*, Wojciech Kryściński*, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

This project is a collaboration work between Yale LILY Lab and Salesforce Research.

LILY Logo             Salesforce Logo

* - Equal contributions from authors

Table of Contents

  1. Updates
  2. Data
  3. Evaluation Toolkit
  4. Citation
  5. Get Involved

Updates

04/09/2020 - Please see this comment with code for computing system-level metric correlations!
11/12/2020 - Added the reference-less BLANC and SUPERT metrics!
7/16/2020 - Initial commit! :)

Data

As part of this release, we share summaries generated by recent summarization model trained on the CNN/DailyMail dataset here.
We also share human annotations, collected from both crowdsource workers and experts here.

Both datasets are shared WITHOUT the source articles that were used to generate the summaries.
To recreate the full dataset please follow the instructions listed here.

Model Outputs

Model Paper Outputs Type
M0 Lead-3 Baseline Link Extractive
M1 Neural Document Summarization by Jointly Learning to Score and Select Sentences Link Extractive
M2 BANDITSUM: Extractive Summarization as a Contextual Bandit Link Extractive
M3 Neural Latent Extractive Document Summarization Link Extractive
M4 Ranking Sentences for Extractive Summarization with Reinforcement Learning Link Extractive
M5 Learning to Extract Coherent Summary via Deep Reinforcement Learning Link Extractive
M6 Neural Extractive Text Summarization with Syntactic Compression Link Extractive
M7 STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings Link Extractive
M8 Get To The Point: Summarization with Pointer-Generator Networks Link Abstractive
M9 Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting Link Abstractive
M10 Bottom-Up Abstractive Summarization Link Abstractive
M11 Improving Abstraction in Text Summarization Link Abstractive
M12 A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss Link Abstractive
M13 Multi-Reward Reinforced Summarization with Saliency and Entailment Link Abstractive
M14 Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation Link Abstractive
M15 Closed-Book Training to Improve Summarization Encoder Memory Link Abstractive
M16 An Entity-Driven Framework for Abstractive Summarization Link Abstractive
M17 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Link Abstractive
M18 Better Rewards Yield Better Summaries: Learning to Summarise Without References Link Abstractive
M19 Text Summarization with Pretrained Encoders Link Abstractive
M20 Fine-Tuning GPT-2 from Human Preferences Link Abstractive
M21 Unified Language Model Pre-training for Natural Language Understanding and Generation Link Abstractive
M22 BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Link Abstractive
M23 PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization Link Abstractive

IMPORTANT:

All model outputs were obtained from the original authors of the models and shared with their consent.
When using any of the model outputs, please also cite the original paper.

Human annotations

Human annotations of model generated summaries can be found here.

The annotations include summaries generated by 16 models from 100 source news articles (1600 examples in total).
Each of the summaries was annotated by 5 indepedent crowdsource workers and 3 independent experts (8 annotations in total).
Summaries were evaluated across 4 dimensions: coherence, consistency, fluency, relevance.
Each source news article comes with the original reference from the CNN/DailyMail dataset and 10 additional crowdsources reference summaries.

Data preparation

Both model generated outputs and human annotated data require pairing with the original CNN/DailyMail articles.

To recreate the datasets follow the instructions:

  1. Download CNN Stories and Daily Mail Stories from https://cs.nyu.edu/~kcho/DMQA/
  2. Create a cnndm directory and unpack downloaded files into the directory
  3. Download and unpack model outputs or human annotations.
  4. Run the pair_data.py script to pair the data with original articles

Example call for model outputs:

python3 data_processing/pair_data.py --model_outputs <file-with-data-annotations> --story_files <dir-with-stories>

Example call for human annotations:

python3 data_processing/pair_data.py --data_annotations <file-with-data-annotations> --story_files <dir-with-stories>

Evaluation Toolkit

We provide a toolkit for summarization evaluation to unify metrics and promote robust comparison of summarization systems. The toolkit contains popular and recent metrics for summarization as well as several machine translation metrics.

Metrics

Below are the metrics included in the tookit, followed by the associated paper and code used within the toolkit:

Metric Paper Code
ROUGE ROUGE: A Package for Automatic Evaluation of Summaries Link
ROUGE-we Better Summarization Evaluation with Word Embeddings for ROUGE Link
MoverScore MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance Link
BertScore BertScore: Evaluating Text Generation with BERT Link
Sentence Mover's Similarity Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts Link
SummaQA Answers Unite! Unsupervised Metrics for Reinforced Summarization Models Link
BLANC Fill in the BLANC: Human-free quality estimation of document summaries Link
SUPERT SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization Link
METEOR METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments Link
S3 Learning to Score System Summaries for Better Content Selection Evaluation Link
Misc. statistics
(extractiveness, novel n-grams, repetition, length)
Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies Link
Syntactic Evaluation Automatic Analysis of Syntactic Complexity in Second Language writing Link
CIDer CIDEr: Consensus-based Image Description Evaluation Link
CHRF CHRF++: words helping character n-grams Link
BLEU BLEU: a Method for Automatic Evaluation of Machine Translation Link

SETUP

First install the summ_eval toolkit:

git clone https://github.com/Yale-LILY/SummEval.git
cd evaluation
pip install -e .

To finish the setup, please run and follow the prompts in this script:

python setup_finalize.py

You can test your installation and get familiar with the library through tests/

python -m unittest discover

Command-line interface

We provide a command-line interface calc-scores which makes use of gin config files to set metric parameters.

Examples

Run ROUGE on given source and target files and write to rouge.jsonl, analogous to files2rouge.

calc-scores --config-file=examples/basic.config --metrics "rouge" --summ-file summ_eval/1.summ --ref-file summ_eval/1.ref --output-file rouge.jsonl --eos " . " --aggregate True

NOTE: if you're seeing slow-ish startup time, try commenting out the metrics you're not using in the config; otherwise this will load all modules.

Run ROUGE and BertScore on a jsonl file which contains "reference" and "summary" keys and write to output.jsonl.

calc-scores --config-file=examples/basic.config --metrics "rouge, bert_score" --jsonl-file data.jsonl --output-file rouge_bertscore.jsonl

For a full list of options, please run:

calc-scores --help

For use in scripts

If you want to use the evaluation metrics as part of other scripts, we have you covered!

from summ_eval.rouge_metric import RougeMetric
rouge = RougeMetric()

Evaluate on a batch

summaries = ["This is one summary", "This is another summary"]
references = ["This is one reference", "This is another"]

rouge_dict = rouge.evaluate_batch(summaries, references)

Evaluate on a single example

rouge_dict = rouge.evaluate_example(summaries[0], references[0])

Evaluate with multiple references

Currently the command-line tool does not use multiple references for simplicity. Each metric has a supports_multi_ref property to tell you if it supports multiple references.

print(rouge.supports_multi_ref) # True
multi_references = [["This is ref 1 for summ 1", "This is ref 2 for summ 1"], ["This is ref 1 for summ 2", "This is ref 2 for summ 2"]]
rouge_dict = rouge.evaluate_batch(summaries, multi_references)

Citation

@article{fabbri2020summeval,
  title={SummEval: Re-evaluating Summarization Evaluation},
  author={Fabbri, Alexander R and Kry{\'s}ci{\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},
  journal={arXiv preprint arXiv:2007.12626},
  year={2020}
}

Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

summ_eval-0.10.tar.gz (77.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

summ_eval-0.10-py3-none-any.whl (110.3 kB view details)

Uploaded Python 3

File details

Details for the file summ_eval-0.10.tar.gz.

File metadata

  • Download URL: summ_eval-0.10.tar.gz
  • Upload date:
  • Size: 77.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.5.0.1 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.10

File hashes

Hashes for summ_eval-0.10.tar.gz
Algorithm Hash digest
SHA256 614e63ca3ba8010e0b48084063a74c4cb6583800f99372db6fb7cb27b1face1e
MD5 3204bf0558c3019dd7d867c44a38457c
BLAKE2b-256 a3f907374487ab4d6996259ff44bc6a090202b9eb43054823ef09d4c0f12350f

See more details on using hashes here.

File details

Details for the file summ_eval-0.10-py3-none-any.whl.

File metadata

  • Download URL: summ_eval-0.10-py3-none-any.whl
  • Upload date:
  • Size: 110.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.5.0.1 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.10

File hashes

Hashes for summ_eval-0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 8685da113e80dacb1fff164327b572fe48ed0491c02ee97ea6abfe62de4f605d
MD5 eb7a44873ce9e52f50a72b91b2273c21
BLAKE2b-256 4ae45141b932eba163da5af72744cd91f2a672b4e61f6d095f6e84b3f9462aa9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page