Word Alignments using Pretrained Language Models

These details have not been verified by PyPI

Project links

Project description

SimAlign: Similarity Based Word Aligner

Alignment Example

SimAlign is a high-quality word alignment tool that uses static and contextualized embeddings and does not require parallel training data.

The following table shows how it compares to popular statistical alignment models:

	ENG-CES	ENG-DEU	ENG-FAS	ENG-FRA	ENG-HIN	ENG-RON
fast-align	.78	.71	.46	.84	.38	.68
eflomal	.85	.77	.63	.93	.52	.72
mBERT-Argmax	.87	.81	.67	.94	.55	.65

Shown is F1, maximum across subword and word level. For more details see the Paper.

Installation and Usage

Tested with Python 3.7, Transformers 3.1.0, Torch 1.5.0. Networkx 2.4 is optional (only required for Match algorithm). For full list of dependencies see setup.py. For installation of transformers see their repo.

Download the repo for use or alternatively install with PyPi

pip install simalign

or directly with pip from GitHub

pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign

An example for using our code:

from simalign import SentenceAligner

# making an instance of our model.
# You can specify the embedding model and all alignment settings in the constructor.
myaligner = SentenceAligner(model="bert", token_type="bpe", matching_methods="mai")

# The source and target sentences should be tokenized to words.
src_sentence = ["This", "is", "a", "test", "."]
trg_sentence = ["Das", "ist", "ein", "Test", "."]

# The output is a dictionary with different matching methods.
# Each method has a list of pairs indicating the indexes of aligned words (The alignments are zero-indexed).
alignments = myaligner.get_word_aligns(src_sentence, trg_sentence)

for matching_method in alignments:
    print(matching_method, ":", alignments[matching_method])

# Expected output:
# mwmf (Match): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# inter (ArgMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# itermax (IterMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

For more examples of how to use our code see scripts/align_example.py.

Demo

An online demo is available here.

Gold Standards

Links to the gold standars used in the paper are here:

Language Pair	Citation	Type	Link
ENG-CES	Marecek et al. 2008	Gold Alignment	http://ufal.mff.cuni.cz/czech-english-manual-word-alignment
ENG-DEU	EuroParl-based	Gold Alignment	www-i6.informatik.rwth-aachen.de/goldAlignment/
ENG-FAS	Tvakoli et al. 2014	Gold Alignment	http://eceold.ut.ac.ir/en/node/940
ENG-FRA	WPT2003, Och et al. 2000,	Gold Alignment	http://web.eecs.umich.edu/~mihalcea/wpt/
ENG-HIN	WPT2005	Gold Alignment	http://web.eecs.umich.edu/~mihalcea/wpt05/
ENG-RON	WPT2005 Mihalcea et al. 2003	Gold Alignment	http://web.eecs.umich.edu/~mihalcea/wpt05/

Evaluation Script

For evaluating the output alignments use scripts/calc_align_score.py.

The gold alignment file should have the same format as SimAlign outputs. Sure alignment edges in the gold standard have a '-' between the source and the target indices and the possible edges have a 'p' between indices. For sample parallel sentences and their gold alignments from ENG-DEU, see samples.

Publication

If you use the code, please cite

@inproceedings{jalili-sabet-etal-2020-simalign,
    title = "{S}im{A}lign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings",
    author = {Jalili Sabet, Masoud  and
      Dufter, Philipp  and
      Yvon, Fran{\c{c}}ois  and
      Sch{\"u}tze, Hinrich},
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.147",
    pages = "1627--1643",
}

Feedback

Feedback and Contributions more than welcome! Just reach out to @masoudjs or @pdufter.

FAQ

Do I need parallel data to train the system?

No, no parallel training data is required.

Which languages can be aligned?

This depends on the underlying pretrained multilingual language model used. For example, if mBERT is used, it covers 104 languages as listed here.

Do I need GPUs for running this?

Each alignment simply requires a single forward pass in the pretrained language model. While this is certainly faster on GPU, it runs fine on CPU. On one GPU (GeForce GTX 1080 Ti) it takes around 15-20 seconds to align 500 parallel sentences.

License

A full copy of the license can be found in LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4

Nov 7, 2023

0.3

Sep 16, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simalign-0.4.tar.gz (7.6 kB view details)

Uploaded Nov 7, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

simalign-0.4-py3-none-any.whl (8.1 kB view details)

Uploaded Nov 7, 2023 Python 3

File details

Details for the file simalign-0.4.tar.gz.

File metadata

Download URL: simalign-0.4.tar.gz
Upload date: Nov 7, 2023
Size: 7.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.7.0

File hashes

Hashes for simalign-0.4.tar.gz
Algorithm	Hash digest
SHA256	`7010699c15987a102860df136ca6b2fd761401b06e8df6138e2033f560bb4039`
MD5	`d252c083453a3584cf5708869b2df130`
BLAKE2b-256	`c423160b158e2b70ded0a91afd5d75bd8e431a8b4a6512a817946b7501640518`

See more details on using hashes here.

File details

Details for the file simalign-0.4-py3-none-any.whl.

File metadata

Download URL: simalign-0.4-py3-none-any.whl
Upload date: Nov 7, 2023
Size: 8.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.7.0

File hashes

Hashes for simalign-0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e28942b861d99416788c773f0694257aec99b6e976bc96b61ea474d51495d99e`
MD5	`3b86fd03a797daf666924d28947224db`
BLAKE2b-256	`0676b122f58e411c79c3ec335e9c606ab57142506b704feb276697754b90f226`

See more details on using hashes here.

simalign 0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SimAlign: Similarity Based Word Aligner

Installation and Usage

Demo

Gold Standards

Evaluation Script

Publication

Feedback

FAQ

Do I need parallel data to train the system?

Which languages can be aligned?

Do I need GPUs for running this?

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes