Skip to main content

An amazing sample package!dd

Project description

ACE-2005 Translation and Annotation Alignment Pipeline

This pipeline translates the ACE 2005 corpus into Portuguese and aligns the translated annotations with the corresponding text.

Overview

This repository contains a Python translation and annotation alignment pipeline that is designed to translate the ACE 2005 dataset into Portuguese using machine translation and then align the translated annotations with the corresponding text. The pipeline was developed for automatic translation of ACE-2005 into Portuguese but can also be adapted for other languages with little effort.

It is composed of two main components:

  • Text translation: We used DeepL Translator and Google Translator to translate ACE-2005 texts and annotations into European and Brazilian Portuguese.
  • Annotations alignments: We developed an annotation alignment pipeline that aligns the translated annotations within the translated text.

Prerequisites

  1. Prepare ACE 2005 dataset

    Download: (https://catalog.ldc.upenn.edu/LDC2006T06). Note that ACE 2005 dataset is not free.)

  2. ACE-2005 Pre-processing

    We adopted a commonly used ACE-2005 pre-processing that can be found in this repository.

  3. Install the packages. Create a python Env (Optional):

    python3 -m venv myenv
    myenv\Scripts\activate
    source myenv/bin/activate
    

    Intall python requirements:

    pip install -r ./src/requirements.txt
    

Annotation Alignment Modules

Our pipeline is composed of a total of five annotation alignment components:

- Lemmatization
- Multiple word translation
- Synonyms
- BERT-based word aligner
- Fuzzy Match (Gestalt Patter Matching and Levenstein distance)

The pipeline operates sequentially, meaning that annotations aligned by earlier methods are not addressed by subsequent pipeline elements. According to our experiments, the list above corresponds to the best order sequence.

Usage

  1. Translate ACE-2005 to Portuguese

    By default we use Google Translate for the translation process. An API key is need in order to use DeepL Translator.

    Usage: python3 translation.py <input_file> <output_dir>
    Example: python src/translate.py data/sample_en.json data/sample_pt.json
    
  2. Run the Annotation Alignment Pipeline

    To align the translated annotations, the alignment pipeline can be executed with the following command:

    Usage: python3 pipeline.py <input_file> <output_dir>
    Example: python src/translate.py data/sample_pt.json data/sample_pt_aligned.json
    

    The pipeline can be configured in the config.yaml file. Users can select the aligners they intend to use and must indicate the path for the alignment resources for each alignment component, such as multiple translations of annotations, previously calculated lemmas, synonyms, etc. All of these resources are already pre-calculated for the Portuguese language in the resources folder. Additionally, the input and output files can be configured in the config.yaml file as well.

Evaluation

To measure the effectiveness of the alignment pipeline, manual alignments were conducted on the entire ACE-2005-PT test set, which includes 1,310 annotations (triggers and arguments). These alignments were performed by a linguist expert to ensure high-quality annotations, following the same annotation guidelines of the original ACE-2005 corpus.

The evaluation results are presented in Table 1:

Results
Table 1: Evaluation Results by pipeline component

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lingualigner-0.1.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

lingualigner-0.1-py2.py3-none-any.whl (9.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file lingualigner-0.1.tar.gz.

File metadata

  • Download URL: lingualigner-0.1.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for lingualigner-0.1.tar.gz
Algorithm Hash digest
SHA256 83586735f49c678de056d2bffd03cf35336c20f5c99ac9cc1b10ff79cf4be87d
MD5 089f6764355294727f8c97c3f14b9f03
BLAKE2b-256 e9f0d2a151abb00ffb3b5d665c77f063d8cf3998bacb0bf449d50ff62a6d0a2e

See more details on using hashes here.

File details

Details for the file lingualigner-0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for lingualigner-0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 95ac7cb46603f9d90ff0ae9931b29ce25e54edb29d0dc3c770d7399548c9bdc6
MD5 b2a2d07dd554ad40bce67e388aee2621
BLAKE2b-256 ef4ad8854b6589cf5430dadd754997f18d486e549b61101a636c3889101cff78

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page