Skip to main content

An awesome word alignment tool

Project description

AWESOME: Aligning Word Embedding Spaces of Multilingual Encoders

awesome-align is a tool that can extract word alignments from multilingual BERT (mBERT) [Demo] and allows you to fine-tune mBERT on parallel corpora for better alignment quality (see our paper for more details).

Dependencies

First, you need to install the dependencies:

pip install -r requirements.txt
python setup.py install

Input format

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

Extracting alignments

Here is an example of extracting word alignments from multilingual BERT:

DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=bert-base-multilingual-cased
OUTPUT_FILE=/path/to/output/file

CUDA_VISIBLE_DEVICES=0 awesome-align \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --batch_size 32

This produces outputs in the i-j Pharaoh format. A pair i-j indicates that the ith word (zero-indexed) of the source sentence is aligned to the jth word of the target sentence.

You can set --output_prob_file if you want to obtain the alignment probability and set --output_word_file if you want to obtain the aligned word pairs (in the src_word<sep>tgt_word format). You can also set --cache_dir to specify where you want to cache multilingual BERT.

You can also set MODEL_NAME_OR_PATH to the path of your fine-tuned model as shown below.

Fine-tuning on parallel data

If there is parallel data available, you can fine-tune embedding models on that data.

Here is an example of fine-tuning mBERT that balances well between efficiency and effectiveness:

TRAIN_FILE=/path/to/train/file
EVAL_FILE=/path/to/eval/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_tlm \
    --train_so \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --save_steps 4000 \
    --max_steps 20000 \
    --do_eval \
    --eval_data_file=$EVAL_FILE

You can also fine-tune the model a bit longer with more training objectives for better quality:

TRAIN_FILE=/path/to/train/file
EVAL_FILE=/path/to/eval/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_mlm \
    --train_tlm \
    --train_tlm_full \
    --train_so \
    --train_psi \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --save_steps 10000 \
    --max_steps 40000 \
    --do_eval \
    --eval_data_file=$EVAL_FILE

If you want high alignment recalls, you can turn on the --train_co option, but note that the alignment precisions may drop. You can set --cache_dir to specify where you want to cache multilingual BERT.

Supervised settings

In supervised settings where gold word alignments are available for your training data, you can incorporate the supervised signals into our self-training objective (--train_so) and here is an example command:

TRAIN_FILE=/path/to/train/file
TRAIN_GOLD_FILE=/path/to/train/gold/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_so \
    --train_data_file=$TRAIN_FILE \
    --train_gold_file=$TRAIN_GOLD_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 5 \
    --learning_rate 1e-4 \
    --save_steps 200

See examples/*.gold for the example format of the gold alignments. You need to turn on the --gold_one_index option if the gold alignments are 1-indexed and you can turn on the --ignore_possible_alignments option if you want to ignore possible alignments.

Model performance

The following table shows the alignment error rates (AERs) of our models and popular statistical word aligners on five language pairs. The De-En, Fr-En, Ro-En datasets can be obtained following this repo, the Ja-En data is from this link and the Zh-En data is available at this link. The best scores are in bold.

De-En Fr-En Ro-En Ja-En Zh-En
fast_align 27.0 10.5 32.1 51.1 38.1
eflomal 22.6 8.2 25.1 47.5 28.7
Mgiza 20.6 5.9 26.4 48.0 35.1
Ours (w/o fine-tuning, softmax) 17.4 5.6 27.9 45.6 18.1
Ours (multilingually fine-tuned
w/o --train_co, softmax) [Download]
15.2 4.1 22.6 37.4 13.4
Ours (multilingually fine-tuned
w/ --train_co, softmax) [Download]
15.1 4.5 20.7 38.4 14.5

Citation

If you use our tool, we'd appreciate if you cite the following paper:

@inproceedings{dou2021word,
  title={Word Alignment by Fine-tuning Embeddings on Parallel Corpora},
  author={Dou, Zi-Yi and Neubig, Graham},
  booktitle={Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2021}
}

Acknowledgements

Some of the code is borrowed from HuggingFace Transformers licensed under Apache 2.0 and the entmax implementation is from this repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

awesome_align-0.1.7.tar.gz (81.8 kB view details)

Uploaded Source

Built Distribution

awesome_align-0.1.7-py3-none-any.whl (87.3 kB view details)

Uploaded Python 3

File details

Details for the file awesome_align-0.1.7.tar.gz.

File metadata

  • Download URL: awesome_align-0.1.7.tar.gz
  • Upload date:
  • Size: 81.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.3

File hashes

Hashes for awesome_align-0.1.7.tar.gz
Algorithm Hash digest
SHA256 efa5f450ef9ab9db7437f01f46c2da8ab226cb55ecbacbe579db5ec4e8d1b87f
MD5 d6637ea06c649ef197fe9327e66ef140
BLAKE2b-256 4fb0ef1f2a92a4a67b261a82aae887044b592cea72217fd5def45bc36ba484f0

See more details on using hashes here.

File details

Details for the file awesome_align-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: awesome_align-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 87.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0.post20200714 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.3

File hashes

Hashes for awesome_align-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 3cbcfd9798fc6b0e53c74ff198c28a1496a71e5503406313c12d4f5fe4595591
MD5 a9d61c3d8f33e68d0878ffe0c11ddcee
BLAKE2b-256 92d50c12ed0591a524c57559513b1b7b7eb2e9aeadea6f2188c826776b2af4ac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page