UNMASS - Unsupervised NMT with Masked Sequence-to-Sequence training

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.7
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

MASS

MASS is a novel pre-training method for sequence to sequence based language generation tasks. It randomly masks a sentence fragment in the encoder, and then predicts it in the decoder.

MASS can be applied on cross-lingual tasks such as neural machine translation (NMT), and monolingual tasks such as text summarization. The current codebase supports unsupervised NMT (implemented based on XLM).

Credits: the original developers/researchers:

facebookresearch/XLM
   |---microsoft/MASS
        |---<this>

Unsupervised NMT

Unsupervised Neural Machine Translation just uses monolingual data to train the models. During MASS pre-training, the source and target languages are pre-trained in one model, with the corresponding language embeddings to differentiate the languages. During MASS fine-tuning, back-translation is used to train the unsupervised models. We provide pre-trained and fine-tuned models:

Languages	Pre-trained Model	Fine-tuned Model	BPE codes	Vocabulary
EN - FR	MODEL	MODEL	BPE codes	Vocabulary
EN - DE	MODEL	MODEL	BPE codes	Vocabulary
En - RO	MODEL	MODEL	BPE_codes	Vocabulary

We are also preparing larger models on more language pairs, and will release them in the future.

Dependencies

Currently we implement MASS for unsupervised NMT based on the codebase of XLM. The depencies are as follows:

Python 3
NumPy
PyTorch (version 0.4 and 1.0)
fastBPE (for BPE codes)
Moses (for tokenization)
Apex (for fp16 training)

Data Ready

We use the same BPE codes and vocabulary with XLM. Here we take English-French as an example.

cd MASS

wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr

./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr

Pre-training:

python train.py                                      \
--exp_name unsupMT_enfr                              \
--data_path ./data/processed/en-fr/                  \
--lgs 'en-fr'                                        \
--mass_steps 'en,fr'                                 \
--encoder_only false                                 \
--emb_dim 1024                                       \
--n_layers 6                                         \
--n_heads 8                                          \
--dropout 0.1                                        \
--attention_dropout 0.1                              \
--gelu_activation true                               \
--tokens_per_batch 3000                              \
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
--epoch_size 200000                                  \
--max_epoch 100                                      \
--eval_bleu true                                     \
--word_mass 0.5                                      \
--min_len 5                                          \

During the pre-training prcess, even without any back-translation, you can observe the model can achieve some intial BLEU scores:

epoch -> 4
valid_fr-en_mt_bleu -> 10.55
valid_en-fr_mt_bleu ->  7.81
test_fr-en_mt_bleu  -> 11.72
test_en-fr_mt_bleu  ->  8.80

Distributed Training

To use multiple GPUs e.g. 3 GPUs on same node

export NGPU=3; CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node=$NGPU train.py [...args]

To use multiple GPUS across many nodes, use Slurm to request multi-node job and launch the above command. The code automatically detects the SLURM_* environment vars to distribute the training.

Fine-tuning

After pre-training, we use back-translation to fine-tune the pre-trained model on unsupervised machine translation:

MODEL=mass_enfr_1024.pth

python train.py \
  --exp_name unsupMT_enfr                              \
  --data_path ./data/processed/en-fr/                  \
  --lgs 'en-fr'                                        \
  --bt_steps 'en-fr-en,fr-en-fr'                       \
  --encoder_only false                                 \
  --emb_dim 1024                                       \
  --n_layers 6                                         \
  --n_heads 8                                          \
  --dropout 0.1                                        \
  --attention_dropout 0.1                              \
  --gelu_activation true                               \
  --tokens_per_batch 2000                              \
  --batch_size 32	                                     \
  --bptt 256                                           \
  --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
  --epoch_size 200000                                  \
  --max_epoch 30                                       \
  --eval_bleu true                                     \
  --reload_model "$MODEL,$MODEL"                       \

We also provide a demo to use MASS pre-trained model on the WMT16 en-ro bilingual dataset. We provide pre-trained and fine-tuned models:

Model	Ro-En BLEU (with BT)
Baseline	34.0
XLM	38.5
MASS	39.1

Download dataset by the below command:

wget https://dl.fbaipublicfiles.com/XLM/codes_enro
wget https://dl.fbaipublicfiles.com/XLM/vocab_enro

./get-data-bilingual-enro-nmt.sh --src en --tgt ro --reload_codes codes_enro --reload_vocab vocab_enro

After download the mass pre-trained model from the above link. And use the following command to fine tune:

MODEL=mass_enro_1024.pth

python train.py \
	--exp_name unsupMT_enro                              \
	--data_path ./data/processed/en-ro                   \
	--lgs 'en-ro'                                        \
	--bt_steps 'en-ro-en,ro-en-ro'                       \
	--encoder_only false                                 \
	--mt_steps 'en-ro,ro-en'                             \
	--emb_dim 1024                                       \
	--n_layers 6                                         \
	--n_heads 8                                          \
	--dropout 0.1                                        \
	--attention_dropout 0.1                              \
	--gelu_activation true                               \
	--tokens_per_batch 2000                              \
	--batch_size 32                                      \
	--bptt 256                                           \
	--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
	--epoch_size 200000                                  \
	--max_epoch 50                                       \
	--eval_bleu true                                     \
	--reload_model "$MODEL,$MODEL"

Training Details

MASS-base-uncased uses 32x NVIDIA 32GB V100 GPUs and trains on (Wikipekia + BookCorpus, 16GB) for 20 epochs (float32), batch size is simulated as 4096.

Reference

If you find MASS useful in your work, you can cite the paper as below:

@inproceedings{song2019mass,
    title={MASS: Masked Sequence to Sequence Pre-training for Language Generation},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    booktitle={International Conference on Machine Learning},
    pages={5926--5936},
    year={2019}
}

Related Works

MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. GitHub: https://github.com/microsoft/MPNet

Project details

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.7
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.1.0

Aug 25, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unmass-0.1.0.tar.gz (54.5 kB view details)

Uploaded Aug 25, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

unmass-0.1.0-py3-none-any.whl (67.8 kB view details)

Uploaded Aug 25, 2020 Python 3

File details

Details for the file unmass-0.1.0.tar.gz.

File metadata

Download URL: unmass-0.1.0.tar.gz
Upload date: Aug 25, 2020
Size: 54.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for unmass-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d1aa3b436e275f98f6de427bfa5c66e203cfa5acbe1b8c581fb8b0eb7e922bbe`
MD5	`6f958881590b332fc7bc7643e9540843`
BLAKE2b-256	`c80e06607790252232db1bdafe0e6a5d586bbe773b6a6a86f9be72699f2a32f6`

See more details on using hashes here.

File details

Details for the file unmass-0.1.0-py3-none-any.whl.

File metadata

Download URL: unmass-0.1.0-py3-none-any.whl
Upload date: Aug 25, 2020
Size: 67.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for unmass-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aba11d51260a75118d3f5b1c68d32bf4c59968ae8f58c41cfe69642d02cd82de`
MD5	`89b0649a6a791f80c6d1ae88bd3823b0`
BLAKE2b-256	`9d8b00b55116bb99ef85184c9f0ab882f424097c9de89bb5681621a28f194444`

See more details on using hashes here.

unmass 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MASS

Unsupervised NMT

Dependencies

Data Ready

Pre-training:

Distributed Training

Fine-tuning

Training Details

Other questions

Reference

Related Works

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes