Skip to main content

Toolkit for summarization evaluation

Project description

b'# Summarization Repository \nAuthors: Alex Fabbri*, Wojciech Kry\xc5\x9bci\xc5\x84ski*, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev
\n\nThis project is a collaboration work between Yale LILY Lab and Salesforce Research.

\n\n

\nLILY Logo\n           \nSalesforce Logo \n

\n\n\* - Equal contributions from authors\n\n## Table of Contents\n\n1. Updates\n2. Data\n3. Evaluation Toolkit\n4. Citation\n5. Get Involved\n\n## Updates\n_04/19/2020_ - Updated the human annotation file to include all models from paper and metric scores.
\n_04/19/2020_ - SummEval is now pip-installable! Check out the pypi page.
\n_04/09/2020_ - Please see this comment with code for computing system-level metric correlations!
\n_11/12/2020_ - Added the reference-less BLANC and SUPERT metrics!
\n_7/16/2020_ - Initial commit! :) \n\n## Data\nAs part of this release, we share summaries generated by recent summarization model trained on the CNN/DailyMail dataset here.
\nWe also share human annotations, collected from both crowdsource workers and experts here.\n\nBoth datasets are shared WITHOUT the source articles that were used to generate the summaries.
\nTo recreate the full dataset please follow the instructions listed here. \n\n### Model Outputs\n\n|Model|Paper|Outputs|Type|\n|-|-|-|-|\n|M0|Lead-3 Baseline|Link|Extractive|\n|M1|Neural Document Summarization by Jointly Learning to Score and Select Sentences|Link|Extractive|\n|M2|BANDITSUM: Extractive Summarization as a Contextual Bandit|Link|Extractive|\n|M3|Neural Latent Extractive Document Summarization|Link|Extractive|\n|M4|Ranking Sentences for Extractive Summarization with Reinforcement Learning|Link|Extractive|\n|M5|Learning to Extract Coherent Summary via Deep Reinforcement Learning|Link|Extractive|\n|M6|Neural Extractive Text Summarization with Syntactic Compression|Link|Extractive|\n|M7|STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings|Link|Extractive|\n|M8|Get To The Point: Summarization with Pointer-Generator Networks|Link|Abstractive|\n|M9|Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting|Link|Abstractive|\n|M10|Bottom-Up Abstractive Summarization|Link|Abstractive|\n|M11|Improving Abstraction in Text Summarization|Link|Abstractive|\n|M12|A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss|Link|Abstractive|\n|M13|Multi-Reward Reinforced Summarization with Saliency and Entailment|Link|Abstractive|\n|M14|Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation|Link|Abstractive|\n|M15|Closed-Book Training to Improve Summarization Encoder Memory|Link|Abstractive|\n|M16|An Entity-Driven Framework for Abstractive Summarization|Link|Abstractive|\n|M17|Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer|Link|Abstractive|\n|M18|Better Rewards Yield Better Summaries: Learning to Summarise Without References|Link|Abstractive|\n|M19|Text Summarization with Pretrained Encoders|Link|Abstractive|\n|M20|Fine-Tuning GPT-2 from Human Preferences|Link|Abstractive|\n|M21|Unified Language Model Pre-training for Natural Language Understanding and Generation|Link|Abstractive|\n|M22|BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension|Link|Abstractive|\n|M23|PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization|Link|Abstractive|\n\nIMPORTANT: \n\nAll model outputs were obtained from the original authors of the models and shared with their consent.
\nWhen using any of the model outputs, please also cite the original paper.\n\n\n### Human annotations\n\nHuman annotations of model generated summaries can be found here.\n\nThe annotations include summaries generated by 16 models from 100 source news articles (1600 examples in total).
\nEach of the summaries was annotated by 5 indepedent crowdsource workers and 3 independent experts (8 annotations in total).
\nSummaries were evaluated across 4 dimensions: coherence, consistency, fluency, relevance.
\nEach source news article comes with the original reference from the CNN/DailyMail dataset and 10 additional crowdsources reference summaries.\n\n### Data preparation\n\nBoth model generated outputs and human annotated data require pairing with the original CNN/DailyMail articles.\n\nTo recreate the datasets follow the instructions:\n1. Download CNN Stories and Daily Mail Stories from https://cs.nyu.edu/~kcho/DMQA/\n2. Create a cnndm directory and unpack downloaded files into the directory\n3. Download and unpack model outputs or human annotations.\n4. Run the pair_data.py script to pair the data with original articles\n\nExample call for model outputs:\n\npython3 data_processing/pair_data.py --model_outputs <file-with-data-annotations> --story_files <dir-with-stories>\n\nExample call for human annotations:\n\npython3 data_processing/pair_data.py --data_annotations <file-with-data-annotations> --story_files <dir-with-stories>\n\n\n## Evaluation Toolkit\n\nWe provide a toolkit for summarization evaluation to unify metrics and promote robust comparison of summarization systems. The toolkit contains popular and recent metrics for summarization as well as several machine translation metrics.\n\n### Metrics ###\nBelow are the metrics included in the tookit, followed by the associated paper and code used within the toolkit:\n|Metric|Paper|Code|\n|-|-|-|\n|ROUGE|ROUGE: A Package for Automatic Evaluation of Summaries|Link|\n|ROUGE-we|Better Summarization Evaluation with Word Embeddings for ROUGE|Link|\n|MoverScore|MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance|Link|\n|BertScore|BertScore: Evaluating Text Generation with BERT|Link|\n|Sentence Mover's Similarity|Sentence Mover\xe2\x80\x99s Similarity: Automatic Evaluation for Multi-Sentence Texts|Link|\n|SummaQA|Answers Unite! Unsupervised Metrics for Reinforced Summarization Models|Link|\n|BLANC|Fill in the BLANC: Human-free quality estimation of document summaries|Link|\n|SUPERT|SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization|Link|\n|METEOR|METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments |Link|\n|S3|Learning to Score System Summaries for Better Content Selection Evaluation|Link|\n|Misc. statistics
(extractiveness, novel n-grams, repetition, length)|Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies| Link|\n|Syntactic Evaluation|Automatic Analysis of Syntactic Complexity in Second Language writing|Link|\n|CIDer|CIDEr: Consensus-based Image Description Evaluation|Link|\n|CHRF|CHRF++: words helping character n-grams|Link|\n|BLEU|BLEU: a Method for Automatic Evaluation of Machine Translation|Link|\n\n\n#### SETUP ####\n\nYou can install summ_eval via pip:\nbash\npip install summ-eval\n\n\nYou can also install summ_eval from source:\n\n\ngit clone https://github.com/Yale-LILY/SummEval.git\ncd evaluation\npip install -e .\n\n\nYou can test your installation (assuming you're in the ./summ_eval folder) and get familiar with the library through tests/\n\n\npython -m unittest discover\n\n\n\n### Command-line interface\nWe provide a command-line interface calc-scores which makes use of gin config files to set metric parameters. \n\n##### Examples\nRun ROUGE on given source and target files and write to rouge.jsonl, analogous to files2rouge. \n\ncalc-scores --config-file=examples/basic.config --metrics "rouge" --summ-file summ_eval/1.summ --ref-file summ_eval/1.ref --output-file rouge.jsonl --eos " . " --aggregate True\n\n\nNOTE: if you're seeing slow-ish startup time, try commenting out the metrics you're not using in the config; otherwise this will load all modules. \n\n\nRun ROUGE and BertScore on a .jsonl file which contains reference and decoded (i.e., system output) keys and write to output.jsonl.\n\ncalc-scores --config-file=examples/basic.config --metrics "rouge, bert_score" --jsonl-file data.jsonl --output-file rouge_bertscore.jsonl\n\n\nFor a full list of options, please run:\n\ncalc-scores --help\n\n\n\n### For use in scripts\nIf you want to use the evaluation metrics as part of other scripts, we have you covered!\n\n\nfrom summ_eval.rouge_metric import RougeMetric\nrouge = RougeMetric()\n\n\n#### Evaluate on a batch\n\nsummaries = ["This is one summary", "This is another summary"]\nreferences = ["This is one reference", "This is another"]\n\nrouge_dict = rouge.evaluate_batch(summaries, references)\n\n\n#### Evaluate on a single example\n\nrouge_dict = rouge.evaluate_example(summaries[0], references[0])\n\n\n\n#### Evaluate with multiple references\nCurrently the command-line tool does not use multiple references for simplicity. Each metric has a supports_multi_ref property to tell you if it supports multiple references. \n\n\nprint(rouge.supports_multi_ref) # True\nmulti_references = [["This is ref 1 for summ 1", "This is ref 2 for summ 1"], ["This is ref 1 for summ 2", "This is ref 2 for summ 2"]]\nrouge_dict = rouge.evaluate_batch(summaries, multi_references)\n\n\n\n\n\n\n## Citation\n\n\n@article{fabbri2020summeval,\n title={SummEval: Re-evaluating Summarization Evaluation},\n author={Fabbri, Alexander R and Kry{\\\'s}ci{\\\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},\n journal={arXiv preprint arXiv:2007.12626},\n year={2020}\n}\n\n\n### Get Involved\n\nPlease create a GitHub issue if you have any questions, suggestions, requests or bug-reports. \nWe welcome PRs!\n'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

summ_eval-0.88.tar.gz (80.5 kB view hashes)

Uploaded Source

Built Distribution

summ_eval-0.88-py3-none-any.whl (111.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page