LinguAligner is a Python library for aligning annotations in parallel corpora. It is designed to be used in the context of parallel corpora annotation alignment, where the goal is to align annotations in the source language with annotations in the target language.

These details have not been verified by PyPI

Project links

Project description

LinguAligner

LinguAligner is a Python package for automatically translating annotated corpora while preserving their annotations. It supports multiple translation APIs and alignment strategies, making it a valuable tool for NLP researchers building multilingual datasets, particularly for low-resource languages.

Natural Language Processing (NLP) research remains heavily centered on English, creating a language imbalance in AI. One way to improve linguistic diversity is by adapting annotated corpora from high-resource languages to others. However, preserving span-based annotation quality after translation requires precise alignment of annotations between the source and translated texts, a challenging task due to lexical, syntactic and semantic divergences between languages. LinguAligner provides an automated pipeline to align annotations within translated texts using a several annotation alignment strategies.

🚀 Features

🌐 Translation Module:
Supports external translation services:
- Google Translate
- Microsoft Translator
- DeepL
🧠 Annotation Alignment Module:
Implements multiple techniques:
- Exact / Fuzzy Matching: Levenshtein, Gestalt
- Lemmatization-based Matching using spaCy
- Pre-compiled Translation Dictionaries via Microsoft Lookup API
- Multilingual Contextual Embeddings using BERT-multilingual

The pipeline operates sequentially, meaning that annotations aligned by earlier methods are not addressed by subsequent pipeline elements. According to our experiments, the list above corresponds to the best order sequence.

📦 Installation

Install via PyPI:

pip install LinguAligner

🧪 Example Usage

1. Translate Corpora

You can use the Translation APIs or can translate your corpus with an external tool (an API key is needed).

from LinguAligner import translation

# Google Translate
translator = translation.GoogleTranslator(source_lang="en", target_lang="pt")
translated_text = translator.translate("The soldiers were ordered to fire their weapons")

# DeepL
translator = translation.DeepLTranslator(source_lang="en", target_lang="pt", key="DEEPL_KEY")
translated_text = translator.translate("The soldiers were ordered to fire their weapons")

# Microsoft
translator = translation.MicrosoftTranslator(source_lang="en", target_lang="pt", key="MICROSOFT_KEY")
translated_text = translator.translate("The soldiers were ordered to fire their weapons")

print(translated_text)

2. Align Annotations

Users can select the aligner strategies they intend to use and specify the order in which they should be utilized. According to our findings, the best sequence order is the ones presented in the example below, however, we encourage you to experiment with different orders for your specific use case.

from LinguAligner import AlignmentPipeline

# Define pipeline and model configuration
config = {
    "pipeline": ["lemma", "M_trans", "w_aligner", "gestalt", "levenshtein"],
    "spacy_model": "pt_core_news_lg",
    "w_aligner_model": "bert-base-multilingual-uncased"
}

aligner = AlignmentPipeline(config)

# Source and translated data
src_sent = "The soldiers land on the shore..."
src_ann = "land"
trans_sent = "Os soldados aterraram na costa."
trans_ann = "terra"  # Expected direct translation

# Perform annotation alignment
target_annotation = aligner.align_annotation(
    src_sent, src_ann, trans_sent, trans_ann
)

print(target_annotation)
# Output: ('aterraram', (12, 21))

In this example, the word land is translated to terra (land as a noun) when considered in isolation, but as aterraram (land as a verb) when translated in context. Although terra is a valid translation of the annotation, it does not occur in the translated sentence and therefore cannot be aligned. Such misalignments highlight the need for additional processing to determine the correct annotation offsets in the translated text, in this case, mapping the word terra to aterraram .

🔧 Configuration

You can customize the alignment behavior in the config variable:

config = {
    "pipeline": ["lemma", "w_aligner", "levenshtein"], # change pipeline elements and order
    "spacy_model": "fr_core_news_md", # change spacy model
    "w_aligner_model": "bert-base-multilingual-uncased" # change multilingual model
}

🔧 Advanced Options

Specify source annotation index to resolve ambiguity (Multiple Source Matches)

src_sent = "he was a good man because he had a kind heart"
src_ann = "he"
trans_sent = "ele era um bom homem porque ele tinha um bom coração"
trans_ann = "ele"

target_annotation = aligner.align_annotation(
    src_sent, src_ann, trans_sent, trans_ann, src_ann_start=29
)

print(target_annotation)
# Output: ('ele', (28, 30))

Using the M_trans Method

The M_trans method relies on having multiple possible translations for each annotation. These must be prepared in advance and stored in a Python dictionary, where each key is a source annotation and the value is a list of alternative translations.

You can generate this translation dictionary using the Microsoft Translator API (requires a MICROSOFT_TRANSLATOR_KEY):

from LinguAligner import translation

translator = translation.MicrosoftTranslator(
    source_lang="en", target_lang="pt", auth_key="MICROSOFT_TRANSLATOR_KEY"
)

annotations_list = ["war", "land", "fire"]
lookup_table = {}

for word in annotations_list:
    lookup_table[word] = translator.getMultipleTranslations(word)

# Use the lookup table in align_annotation
aligner.align_annotation(
    "The soldiers were ordered to fire their weapons",
    "fire",
    "Os soldados receberam ordens para disparar as suas armas",
    "incêndio",
    M_trans_dict=lookup_table
)

🔎 Example output of a lookup table:

{
  "fire": [
    "fogo",
    "incêndio",
    "demitir",
    "despedir",
    "fogueira",
    "disparar",
    "chamas",
    "dispare",
    "lareira",
    "atirar",
    "atire"
  ]
}

📚 Use Cases

LinguAligner was used to create translated versions of the following annotated corpora:

ACE-2005 (EN → PT): Event extraction benchmark, now available in Portuguese via the LDC
T2S LUSA (PT → EN): Portuguese news event corpus adapted to English 10.25747/ESFS-1P16
MAVEN: (EN → PT) High-coverage event trigger corpus from Wikipedia translated to Portuguese (available in this repository)
WikiEvents: (EN → PT) Document-level event extraction dataset translated to Portuguese (available in this repository)

🧩 References

Coming soon...

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.7

Sep 11, 2025

1.0.6

Sep 11, 2025

1.0.5

Jun 18, 2025

1.0.4

Jun 18, 2025

1.0.3

Jun 18, 2025

1.0.2

Jun 17, 2025

1.0.1

Jun 17, 2025

This version

1.0

Jun 17, 2025

0.35

May 21, 2024

0.34

May 21, 2024

0.33

May 20, 2024

0.32

May 15, 2024

0.31

May 15, 2024

0.30

Apr 29, 2024

0.29

Apr 29, 2024

0.28

Apr 29, 2024

0.27

Apr 29, 2024

0.26

Apr 29, 2024

0.25

Apr 29, 2024

0.24

Apr 29, 2024

0.23

Apr 24, 2024

0.22

Apr 24, 2024

0.21

Apr 24, 2024

0.20

Apr 24, 2024

0.19

Apr 24, 2024

0.18

Apr 24, 2024

0.17

Apr 13, 2024

0.16

Apr 13, 2024

0.15

Apr 13, 2024

0.14

Apr 13, 2024

0.13

Apr 13, 2024

0.12

Apr 13, 2024

0.11

Apr 13, 2024

0.10

Apr 13, 2024

0.9

Apr 13, 2024

0.8

Apr 13, 2024

0.7

Apr 13, 2024

0.6

Apr 13, 2024

0.5

Apr 12, 2024

0.4

Apr 12, 2024

0.3

Apr 12, 2024

0.2

Apr 12, 2024

0.1

Apr 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

LinguAligner-1.0.tar.gz (4.3 MB view details)

Uploaded Jun 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lingualigner-1.0-py2.py3-none-any.whl (11.0 kB view details)

Uploaded Jun 17, 2025 Python 2Python 3

File details

Details for the file LinguAligner-1.0.tar.gz.

File metadata

Download URL: LinguAligner-1.0.tar.gz
Upload date: Jun 17, 2025
Size: 4.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for LinguAligner-1.0.tar.gz
Algorithm	Hash digest
SHA256	`7576ddcc57652f47ba886707704f4b8961643fdbf2d2576aceefe4f546ad871b`
MD5	`4d298a0e204ca1c3d726d6dd69ffc342`
BLAKE2b-256	`272cfde2a1a202c417bc0fae659a143e5fdfeb8633464a4cd8744d499355fb26`

See more details on using hashes here.

File details

Details for the file lingualigner-1.0-py2.py3-none-any.whl.

File metadata

Download URL: lingualigner-1.0-py2.py3-none-any.whl
Upload date: Jun 17, 2025
Size: 11.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for lingualigner-1.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`0128455afb413ab95f5c98c737659d0e02b66f31dacbddbde1a0b5014f67c7c8`
MD5	`ac7004700aabdb30fc5cc7d9a389d305`
BLAKE2b-256	`13420af17b5bdb7f1f6156d5d0cc3f602fecb0d4410dfa194de2e7ec1790702f`

See more details on using hashes here.

LinguAligner 1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LinguAligner

🚀 Features

📦 Installation

🧪 Example Usage

1. Translate Corpora

2. Align Annotations

🔧 Configuration

🔧 Advanced Options

Specify source annotation index to resolve ambiguity (Multiple Source Matches)

Using the M_trans Method

🔎 Example output of a lookup table:

📚 Use Cases

🧩 References

Coming soon...

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes