Summarize long document in multiple languages
Project description
Generating Extended and Multilingual Summaries with Pre-trained Transformers
Code for the paper Generating Extended and Multilingual Summaries with Pre-trained Transformers accepted at LREC 2022.
Getting started
Create the environnement, activate and install requirements.
conda create -n mdmls python=3.7
conda activate mdmls
pip install -r requirements.txt
WikinewsSum dataset
Please refer to https://github.com/airklizz/wikinewssum to download the dataset.
Place the train.json
, validation.json
, and test.json
files in the wikinewssum/
folder.
Preprocessing
Prepare the dataset to fine-tune an abstractive model using an extractive pre-abstractive step.
python mdmls/main.py preprocess extractive-bert \
wikinewssum/train.json \
wikinewssum/train_pre_abstractive.json \
--model-checkpoint distilbert-base-multilingual-cased \
--pre-abstractive \
--abstractive-model-checkpoint google/mt5-small
Tokenize the dataset.
python mdmls/main.py preprocess tokenize \
wikinewssum/train_pre_abstractive.json \
wikinewssum/train_pre_abstractive_tokenized.json \
--source distilbert-base-multilingual-cased_extractive_summary \
--model-checkpoint google/mt5-small
The same steps need to be performed for the validation set.
Fine-tuning
Use the command line interface to fine-tune a new model of the WikinewsSum dataset.
python mdmls/main.py train run \
--train-data-files wikinewssum/train_pre_abstractive_tokenized.json \
--validation-data-files wikinewssum/validation_pre_abstractive_tokenized.json \
--training-scenario "new-fine-tuning" \
--model-checkpoint google/mt5-base
To see all the parameters.
> python mdmls/main.py train run --help
Usage: main.py train run [OPTIONS]
Options:
--train-data-files TEXT
--validation-data-files TEXT
--training-scenario TEXT
--model-checkpoint TEXT [default: google/mt5-small]
--batch-size INTEGER [default: 8]
--gradient-accumulation-steps INTEGER
[default: 1]
--num-train-epochs INTEGER [default: 8]
--learning-rate FLOAT [default: 5.6e-05]
--weight-decay FLOAT [default: 0.01]
--save-total-limit INTEGER [default: 3]
--push-to-hub / --no-push-to-hub
[default: push-to-hub]
--language TEXT
--max-number-training-sample INTEGER
--help Show this message and exit.
option | description |
---|---|
--language | if specified, only the samples of the specified language are kept. For example: --language en to train on the English samples only |
--max-number-training-sample | if specified, limit the number of training sample to the value |
Evaluation
ROUGE scores
Methods | Metrics | English | German | French | Spanish | Portuguese | Polish | Italian | All Languages |
---|---|---|---|---|---|---|---|---|---|
Extractive Summarisation | |||||||||
DistilmBERT | R-1 F | 41.37 | 29.37 | 29.80 | 29.70 | 29.62 | 24.83 | 35.18 | 33.51 |
R-2 F | 14.35 | 8.42 | 12.57 | 12.52 | 14.33 | 10.48 | 12.59 | 12.34 | |
R-L F | 19.66 | 13.65 | 17.10 | 17.07 | 18.75 | 15.03 | 18.43 | 17.30 | |
mBERT | R-1 F | 41.37 | 29.74 | 29.74 | 35.50 | 29.66 | 24.82 | 34.93 | 33.60 |
R-2 F | 14.48 | 8.70 | 12.62 | 13.31 | 14.51 | 10.55 | 12.68 | 12.51 | |
R-L F | 19.63 | 13.83 | 17.13 | 18.10 | 18.86 | 15.07 | 18.86 | 17.36 | |
XLM-RoBERTa | R-1 F | 40.92 | 29.00 | 29.70 | 35.40 | 29.39 | 24.74 | 35.68 | 33.27 |
R-2 F | 14.22 | 8.33 | 12.52 | 13.03 | 14.13 | 10.49 | 12.54 | 12.26 | |
R-L F | 19.66 | 13.54 | 17.07 | 18.05 | 18.43 | 15.03 | 19.54 | 17.26 | |
Oracle | R-1 F | 49.50 | 37.21 | 34.41 | 42.24 | 35.32 | 29.89 | 41.85 | 40.29 |
R-2 F | 25.72 | 15.77 | 17.31 | 20.89 | 21.40 | 15.72 | 19.94 | 20.35 | |
R-L F | 22.67 | 15.93 | 17.38 | 20.54 | 19.19 | 15.33 | 18.61 | 19.16 | |
Abstractive Summarisation after Oracle Pre-Abstractive Extractive Step | |||||||||
mT5 Cross-lingual zero-shot transfer | R-1 F | 44.26 | 9.13 | 9.63 | 11.23 | 10.77 | 6.93 | 9.71 | 19.99 |
R-2 F | 21.73 | 2.85 | 2.52 | 3.71 | 3.26 | 1.76 | 2.48 | 8.53 | |
R-L F | 24.25 | 6.31 | 6.32 | 7.81 | 7.51 | 5.05 | 6.53 | 11.92 | |
mT5 In-language multi-task | R-1 F | 43.19 | 33.14 | 36.92 | 37.69 | 34.54 | 27.95 | 37.00 | 37.05 |
R-2 F | 21.33 | 13.47 | 17.40 | 17.46 | 18.05 | 13.65 | 13.87 | 17.51 | |
R-L F | 23.70 | 17.00 | 21.44 | 21.33 | 21.44 | 16.98 | 19.01 | 20.78 | |
mT5 In-language | R-1 F | 44.26 | 35.06 | 39.41 | 43.81 | 41.00 | 32.26 | 4.27 | 40.04 |
R-2 F | 21.73 | 13.63 | 17.76 | 19.29 | 20.22 | 14.34 | 0.58 | 18.23 | |
R-L F | 24.25 | 17.53 | 22.03 | 23.76 | 24.44 | 18.67 | 3.06 | 21.93 | |
Abstractive Summarisation after mBERT Pre-Abstractive Extractive Step | |||||||||
mT5 Cross-lingual zero-shot transfer | R-1 F | 37.24 | 7.19 | 9.14 | 10.02 | 9.56 | 6.30 | 12.40 | 17.08 |
R-2 F | 13.00 | 1.68 | 1.87 | 2.48 | 2.27 | 1.30 | 2.82 | 5.25 | |
R-L F | 19.68 | 5.08 | 5.97 | 6.89 | 6.74 | 4.58 | 7.37 | 10.00 | |
mT5 In-language multi-task | R-1 F | 35.56 | 27.05 | 32.59 | 32.94 | 30.01 | 23.53 | 32.90 | 31.30 |
R-2 F | 12.28 | 7.84 | 13.06 | 11.65 | 13.14 | 9.37 | 10.24 | 11.24 | |
R-L F | 18.70 | 13.71 | 18.93 | 18.16 | 18.82 | 14.22 | 16.93 | 17.25 | |
mT5 In-language | R-1 F | 37.24 | 29.65 | 36.02 | 39.79 | 37.21 | 28.47 | 4.32 | 35.03 |
R-2 F | 13.00 | 8.32 | 14.08 | 13.86 | 15.46 | 10.66 | 0.10 | 12.37 | |
R-L F | 19.68 | 14.76 | 20.08 | 21.17 | 13.20 | 16.65 | 2.80 | 18.04 |
ROUGE F-measure results of the three evaluations presented in the paper on WikinewsSum. We compare the extractive models, and mT5 in the three training scenarios and with two different pre-abstractive extractive steps (Oracle and mBERT) for each language of the WikinewsSum dataset in addiction to the all dataset.
BERTScore scores
Methods | Metrics | English | German | French | Spanish | Portuguese | Polish | Italian | All Languages |
---|---|---|---|---|---|---|---|---|---|
Extractive Summarisation | |||||||||
DistilmBERT | B-S P | 0.6920 | 0.6669 | 0.6357 | 0.6807 | 0.6680 | 0.6455 | 0.6706 | 0.6697 |
B-S R | 0.7196 | 0.6890 | 0.6846 | 0.7104 | 0.7084 | 0.6834 | 0.7068 | 0.7021 | |
B-S F | 0.7052 | 0.6774 | 0.6585 | 0.6949 | 0.6869 | 0.6633 | 0.6879 | 0.6850 | |
mBERT | B-S P | 0.6908 | 0.6679 | 0.6354 | 0.6810 | 0.6673 | 0.6456 | 0.6618 | 0.6695 |
B-S R | 0.7215 | 0.6931 | 0.6855 | 0.7124 | 0.7084 | 0.6848 | 0.7033 | 0.7041 | |
B-S F | 0.7055 | 0.6799 | 0.6587 | 0.6960 | 0.6865 | 0.6640 | 0.6816 | 0.6859 | |
XLM-RoBERTa | B-S P | 0.6900 | 0.6658 | 0.6351 | 0.6794 | 0.6660 | 0.6451 | 0.6752 | 0.6684 |
B-S R | 0.7173 | 0.6878 | 0.6834 | 0.7087 | 0.7061 | 0.6831 | 0.7099 | 0.7005 | |
B-S F | 0.7031 | 0.6762 | 0.6576 | 0.6934 | 0.6848 | 0.6629 | 0.6917 | 0.6836 | |
Oracle | B-S P | 0.7238 | 0.6947 | 0.6528 | 0.7058 | 0.6930 | 0.6638 | 0.6919 | 0.6955 |
B-S R | 0.7436 | 0.7144 | 0.6967 | 0.7228 | 0.7266 | 0.7024 | 0.7190 | 0.7217 | |
B-S F | 0.7332 | 0.7039 | 0.6731 | 0.7138 | 0.7087 | 0.6818 | 0.7047 | 0.7077 | |
Abstractive Summarisation after Oracle Pre-Abstractive Extractive Step | |||||||||
mT5 Cross-lingual zero-shot transfer | B-S P | 0.7526 | 0.6814 | 0.6687 | 0.7014 | 0.6864 | 0.6468 | 0.6820 | 0.7009 |
B-S R | 0.7199 | 0.6431 | 0.6579 | 0.6650 | 0.6641 | 0.6218 | 0.6480 | 0.6717 | |
B-S F | 0.7354 | 0.6614 | 0.6627 | 0.6824 | 0.6746 | 0.6337 | 0.6644 | 0.6855 | |
mT5 In-language multi-task | B-S P | 0.7494 | 0.7219 | 0.7130 | 0.7306 | 0.7274 | 0.6887 | 0.7203 | 0.7274 |
B-S R | 0.7190 | 0.6937 | 0.7174 | 0.7030 | 0.7140 | 0.6847 | 0.6942 | 0.7074 | |
B-S F | 0.7334 | 0.7070 | 0.7138 | 0.7161 | 0.7197 | 0.6857 | 0.7066 | 0.7165 | |
mT5 In-language | B-S P | 0.7526 | 0.7264 | 0.7164 | 0.7374 | 0.7381 | 0.6974 | 0.4603 | 0.7321 |
B-S R | 0.7199 | 0.6939 | 0.7179 | 0.7073 | 0.7194 | 0.6908 | 0.5261 | 0.7092 | |
B-S F | 0.7354 | 0.7093 | 0.7153 | 0.7216 | 0.7277 | 0.6931 | 0.4905 | 0.7196 | |
Abstractive Summarisation after mBERT Pre-Abstractive Extractive Step | |||||||||
mT5 Cross-lingual zero-shot transfer | B-S P | 0.7202 | 0.6680 | 0.6571 | 0.6858 | 0.6757 | 0.6412 | 0.6693 | 0.6828 |
B-S R | 0.7004 | 0.6363 | 0.6517 | 0.6576 | 0.6586 | 0.6180 | 0.6459 | 0.6615 | |
B-S F | 0.7098 | 0.6515 | 0.6538 | 0.6712 | 0.6666 | 0.6290 | 0.6572 | 0.6716 | |
mT5 In-language multi-task | B-S P | 0.7157 | 0.6958 | 0.6953 | 0.7069 | 0.7094 | 0.6700 | 0.7045 | 0.7022 |
B-S R | 0.6981 | 0.6774 | 0.7033 | 0.6891 | 0.7011 | 0.6702 | 0.6869 | 0.6910 | |
B-S F | 0.7065 | 0.6861 | 0.6982 | 0.6976 | 0.7046 | 0.6693 | 0.6952 | 0.6960 | |
mT5 In-language | B-S P | 0.7202 | 0.7043 | 0.7020 | 0.7151 | 0.7186 | 0.6836 | 0.4495 | 0.7091 |
B-S R | 0.7004 | 0.6807 | 0.7069 | 0.6948 | 0.7064 | 0.6803 | 0.5213 | 0.6949 | |
B-S F | 0.7098 | 0.6919 | 0.7026 | 0.7044 | 0.7116 | 0.6811 | 0.4822 | 0.7012 |
BERTScore (Zhang et al., 2020b) precision (B-S P), recall (B-S R), and F1 (B-S F) results of the three evaluations presented in the paper on WikinewsSum. We compare the extractive models, and mT5 in the three training scenarios and with two different pre-abstractive extractive steps (Oracle and mBERT) for each language of the WikinewsSum dataset in addiction to the all dataset. Hash code for the BERTScore metric: bert-base-multilingual-cased_L9_no-idf_version=0.3.11(hug_trans=4.13.0)_fast-tokenizer
Usage
The mdmls
pip package allows to run the combination of an extractive method combined with an abstractive one.
pip install mdmls
It can be used as follows in Python.
from mdmls import Summarizer
sum = Summarizer()
summary = sum(LONG_TEXT_TO_SUMMARIZE)
Or directly using the CLI.
mdmls "LONG_TEXT_TO_SUMMARIZE"
Models
All the fine-tuned abstractive models are available on the HuggingFace Hub: https://huggingface.co/models?sort=downloads&search=airklizz+mt5+wikinewssum
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mdmls-0.1.3.tar.gz
.
File metadata
- Download URL: mdmls-0.1.3.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b980466e59980135eb64efe55f699db2353fb5acefa8d3f37a6e469229c20c03 |
|
MD5 | 58cab18edeb982c75acd4e86f0ab885a |
|
BLAKE2b-256 | 2f58bf1551c07bd3e5ddd1b7d4e4efef5d0b7fbd74a3a9a79d5e4356725d0b67 |
File details
Details for the file mdmls-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: mdmls-0.1.3-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 321210d3fee8c90ccfe60fe92c481a73d1214c231e7c0e9fcf6d8be54c0a7810 |
|
MD5 | d880d4d1abc5f94352ab5f7ddd8a868d |
|
BLAKE2b-256 | 9a5618206acccec6df7a439c5d2a0af5f4aa2a213742edcb281544cc7801cafb |