LinguAligner is a Python library for aligning annotations in parallel corpora. It is designed to be used in the context of parallel corpora annotation alignment, where the goal is to align annotations in the source language with annotations in the target language.

Project description

LinguAligner

LinguALigner is a comprehensive corpus translation and alignment pipeline designed to facilitate the translation of corpora across different languages. It translates corpora using machine translation and aligns the translated annotations with their corresponding translated text. Initially developed for the automatic translation of ACE-2005 into Portuguese, LinguALigner has since been adapted into a versatile package for effortless translation of other corpora.

It is composed of two main components:

Text translation: We support DeepL Translator, Google Translator and Microsoft Translators APIs.
Annotations alignments: We developed an annotation alignment pipeline that uses several alignment techniques to align the translated annotations within the translated text.

Annotation Alignment Modules

Our pipeline is composed of a total of five annotation alignment components:

- Lemmatization
- Multiple word translation
- BERT-based word aligner
- Gestalt Patter Matching
- Levenstein distance

The pipeline operates sequentially, meaning that annotations aligned by earlier methods are not addressed by subsequent pipeline elements. According to our experiments, the list above corresponds to the best order sequence.

Usage

Translate Corpora You can use the Translation APIs or can translate your corpus with an external tool An API key is needed to use some of the Translation APIs.

from LinguAligner import translation

# Google Translator
translator = translation.GoogleTranslator(source_lang="en", target_lang="pt")
translated_text = translator.translate("The soldiers were ordered to fire their weapons")

# DeepL Translator
translator = translation.DeepLTranslator(source_lang="en", target_lang="pt", key="DEEPL_KEY")
translated_text = translator.translate("The soldiers were ordered to fire their weapons")

# Microsoft Translator
translator = translation.MicrosoftTranslator(source_lang="en", target_lang="pt", key="MICROSOFT_TRANSLATOR_KEY")
translated_text = translator.translate("The soldiers were ordered to fire their weapons")
print(translated_text)

Run the Annotation Alignment Pipeline Users can select the aligners they intend to use and specify the order in which they should be utilized. To find the best component order in the pipeline we experimented with all the permutations between the components and calculated the corresponding alignment results using a manually aligned corpus. According to our findings, the best sequence order is the ones presented in the example below, however, we encourage you to experiment with different orders for your specific use case.

Certain alignment methods, like multiple translations (M_Trans), necessitate the prior calculation of multiple translations for each annotation (as explained at the end of this section).

from LinguAligner import AlignmentPipeline

"""
(By default, the first method used is string matching. If unsuccessful, the alignment pipeline is employed.)
Methods:
- lemma: Lemmatization
- M_Trans: Multiple Translations of a word
- word_aligner: mBERT-based word aligner
- gestalt: Gestalt pattern matching (character-based)
- levenshtein: Levenshtein distance (character-based)
"""

config= {
    "pipeline": [ "lemma", "M_Trans", "word_aligner","gestalt","leveinstein"], # can be changed according to the desired pipeline
    "spacy_model": "pt_core_news_lg", # change according to the target language
    "WAligner_model": "bert-base-multilingual-uncased", # needed for word_aligner
}

aligner = AlignmentPipeline(config)

src_sentence = "The soldiers were ordered to fire their weapons."
src_annotation = "fire"
translated_sentence = "Os soldados receberam ordens para disparar as suas armas."
translated_annotation = "incêndio"

target_annotation = aligner.align_annotation(src_sentence, src_annotation, translated_sentence, translated_annotation)
print(target_annotation)

>>> ('disparar', (34, 41))

For example, in the sentence 'The soldiers were ordered to fire their weapons,' the word 'fire' was annotated in the source corpus. However, when this sentence is translated to 'Os soldados receberam ordens para disparar as suas armas,' the word 'fire' is translated to 'incêndio' (fire as a noun) in isolation, and to 'disparar' (as a verb) in the translated sentence.

*Spacy models must be pre-installed corresponding to the target language.

Specify source annotation start index to find the closest target annotation

src_sentence = "he was a good man because he had a kind heart"
src_annotation = "he"
translated_sentence = "ele era um bom homem porque ele tinha um bom coração" # there are multiple tokens "ele" (he)
translated_annotation = "ele"

#add src_ann_start argument
target_annotation = aligner.align_annotation(src_sentence, src_annotation, translated_sentence, translated_annotation, src_ann_start=29)
print(target_annotation)

>>> ('ele', (28, 30))

Note

To use the M_trans method, multiple translations of the annotations must be computed beforehand and passed as an argument to the align_annotation function. These translations should contained in a Python dictionary, where the source annotation serves as the key, and the corresponding value is a list of alternative translations. You can generate this dictionary using the following code (need a MICROSOFT_TRANSLATOR_KEY):

from LinguAligner import translation
translator = translation.MicrosoftTranslator(source_lang="en", target_lang="pt", auth_key="MICROSOFT_TRANSLATOR_KEY")
lookupTable = {}
annotations_list = ["war","land","fire"]
for word in annotations_list:
    lookupTable[word] = translator.getMultipleTranslations(word) # change the language codes according to the desired languages

# Then, pass the lookupTable to the align_annotation method
x = aligner.align_annotation("The soldiers were ordered to fire their weapons","fire", "Os soldados receberam ordens para disparar as suas armas","incêndio",lookupTable)

The lookup table should resemble the following example:

{
    'fire': 
        [
            'fogo',
            'incêndio',
            'demitir',
            'despedir',
            'fogueira',
            'disparar',
            'chamas',
            'dispare',
            'lareira',
            'atirar',
            'atire'
        ],
    ...
}

Evaluation

To measure the effectiveness of the alignment pipeline we tested it on ACE-2005 corpus. Manual alignments were conducted on the entire ACE-2005-PT test set, which includes 1,310 annotations. These alignments were performed by a linguist expert to ensure high-quality annotations, following the same annotation guidelines of the original ACE-2005 corpus. Then we compare the manual alignments against the ones generated by our pipeline.

The evaluation results are presented in Table 1:

Results
Table 1: Evaluation Results by pipeline component

License

This project is licensed under the MIT License.

Citation

Coming Soon.

Project details

Release history Release notifications | RSS feed

0.35

May 21, 2024

0.34

May 21, 2024

0.33

May 20, 2024

0.32

May 15, 2024

This version

0.31

May 15, 2024

0.30

Apr 29, 2024

0.29

Apr 29, 2024

0.28

Apr 29, 2024

0.27

Apr 29, 2024

0.26

Apr 29, 2024

0.25

Apr 29, 2024

0.24

Apr 29, 2024

0.23

Apr 24, 2024

0.22

Apr 24, 2024

0.21

Apr 24, 2024

0.20

Apr 24, 2024

0.19

Apr 24, 2024

0.18

Apr 24, 2024

0.17

Apr 13, 2024

0.16

Apr 13, 2024

0.15

Apr 13, 2024

0.14

Apr 13, 2024

0.13

Apr 13, 2024

0.12

Apr 13, 2024

0.11

Apr 13, 2024

0.10

Apr 13, 2024

0.9

Apr 13, 2024

0.8

Apr 13, 2024

0.7

Apr 13, 2024

0.6

Apr 13, 2024

0.5

Apr 12, 2024

0.4

Apr 12, 2024

0.3

Apr 12, 2024

0.2

Apr 12, 2024

0.1

Apr 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lingualigner-0.31.tar.gz (200.4 kB view details)

Uploaded May 15, 2024 Source

Built Distribution

lingualigner-0.31-py2.py3-none-any.whl (10.9 kB view details)

Uploaded May 15, 2024 Python 2 Python 3

File details

Details for the file lingualigner-0.31.tar.gz.

File metadata

Download URL: lingualigner-0.31.tar.gz
Upload date: May 15, 2024
Size: 200.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.31.0

File hashes

Hashes for lingualigner-0.31.tar.gz
Algorithm	Hash digest
SHA256	`30007b3937e1477eb86ad8dc5f84cab73fe82ecac3fbc08ec61b78c3aeaa8fc8`
MD5	`f0d85ffd0b1c4759690fae384bc77c8f`
BLAKE2b-256	`1f2c318bd16dd88d18e7ad4777f207056fcbe9dce788940162e3f011c3c77d5a`

See more details on using hashes here.

File details

Details for the file lingualigner-0.31-py2.py3-none-any.whl.

File metadata

Download URL: lingualigner-0.31-py2.py3-none-any.whl
Upload date: May 15, 2024
Size: 10.9 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.31.0

File hashes

Hashes for lingualigner-0.31-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`8f5d9349ef18667e739a2649820e80d8acf20272db58e5ed574abdd83a632753`
MD5	`907920d172601f036fd4896215b52857`
BLAKE2b-256	`07b36f1cb2266f9ded9e0a00ac56a722fdc1bf81fc3fe1de55e67214a86c44b4`