Skip to main content

LinguAligner is a Python library for aligning annotations in parallel corpora. It is designed to be used in the context of parallel corpora annotation alignment, where the goal is to align annotations in the source language with annotations in the target language.

Project description

LinguAligner

LinguALigner is a comprehensive corpus translation and alignment pipeline designed to facilitate the translation of corpora across different languages. It translates corpora using machine translation and aligns the translated annotations with their corresponding translated text. Initially developed for the automatic translation of ACE-2005 into Portuguese, LinguALigner has since been adapted into a versatile package for effortless translation of other corpora.

It is composed of two main components:

  • Text translation: We support DeepL Translator, Google Translator and Microsoft Translators APIs.
  • Annotations alignments: We developed an annotation alignment pipeline that uses several alignment techniques to align the translated annotations within the translated text.

Annotation Alignment Modules

Our pipeline is composed of a total of five annotation alignment components:

- Lemmatization
- Multiple word translation
- Synonyms
- BERT-based word aligner
- Fuzzy Match (Gestalt Patter Matching and Levenstein distance)

The pipeline operates sequentially, meaning that annotations aligned by earlier methods are not addressed by subsequent pipeline elements. According to our experiments, the list above corresponds to the best order sequence.

Usage

  1. Translate Corpora An API key is need in order to use the Translation APIs.

  2. Run the Annotation Alignment Pipeline

    Users can select the aligners they intend to use and must indicate the path for the alignment resources for each alignment component, such as multiple translations of annotations, previously calculated lemmas, synonyms, etc.

Evaluation

To measure the effectiveness of the alignment pipeline, manual alignments were conducted on the entire ACE-2005-PT test set, which includes 1,310 annotations (triggers and arguments). These alignments were performed by a linguist expert to ensure high-quality annotations, following the same annotation guidelines of the original ACE-2005 corpus.

The evaluation results are presented in Table 1:

Results
Table 1: Evaluation Results by pipeline component

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

License

This project is licensed under the MIT License.

Citation

Comming Soon.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lingualigner-0.5.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

lingualigner-0.5-py2.py3-none-any.whl (9.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file lingualigner-0.5.tar.gz.

File metadata

  • Download URL: lingualigner-0.5.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for lingualigner-0.5.tar.gz
Algorithm Hash digest
SHA256 8b2b1f4988dabc549019a0768cd0ceca24fb4b4d522d7d92ec8fb8f7bf242589
MD5 7b775ebf7cf9f84d2dc65c099e02635f
BLAKE2b-256 38f7b8bda8a08a3c24369c9fee7b574160ec2e5312612d0f7627ead8c2c47508

See more details on using hashes here.

File details

Details for the file lingualigner-0.5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for lingualigner-0.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 540dd3da0c237f3c2ee572371cd55b73cc257fea176a820465f8efb438dbe6bb
MD5 ed5df8b59263b33a3ce51477dff66509
BLAKE2b-256 61219663e05cae2840a2f8654217c854f4a518315eca4fd509829817f3ce751e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page