Skip to main content

Word alignment between two languages using structured generation

Project description

lexi-align

Word alignment between two languages using structured generation with Large Language Models.

Installation

Install from PyPI:

pip install lexi-align

(or your favorite method)

The library is API-backend agnostic and only directly depends on Pydantic, so you will need to bring your own API code or use the provided litellm integration.

For LLM support via litellm (recommended), install with the optional dependency:

pip install lexi-align[litellm]

Using uv:

uv add lexi-align --extra litellm

Usage

Basic Usage

The library expects pre-tokenized input - it does not perform any tokenization. You must provide tokens as lists of strings:

from lexi_align.adapters.litellm_adapter import LiteLLMAdapter
from lexi_align.core import align_tokens

# Initialize the LLM adapter
llm_adapter = LiteLLMAdapter(model_params={
    "model": "gpt-4",
    "temperature": 0.0
})

# Provide pre-tokenized input with repeated tokens
source_tokens = ["the", "big", "cat", "saw", "the", "cat"]  # Note: "the" and "cat" appear twice
target_tokens = ["le", "gros", "chat", "a", "vu", "le", "chat"]

alignment = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French"
)

# Example output will show the uniquified tokens:
# the₁ -> le₁
# big -> gros
# cat₁ -> chat₁
# saw -> a
# saw -> vu
# the₂ -> le₂
# cat₂ -> chat₂

Performance

Here are some preliminary results on the test EN-SL subset of XL-WA:

gpt-4o-2024-08-06 (1shot) (seed=42)

Language Pair Precision Recall F1
EN-SL 0.863 0.829 0.846
Average 0.863 0.829 0.846

claude-3-haiku-20240307 (1shot)

Language Pair Precision Recall F1
EN-SL 0.651 0.630 0.640
Average 0.651 0.630 0.640

For reference, the 1-shot (1 example) gpt-4o-2024-08-06 results are bettern than all systems presented in the paper (Table 2).

Pharaoh Format Export

While the core alignment functions work with pre-tokenized input, the Pharaoh format utilities currently assume space-separated tokens when parsing/exporting. If your tokens contain spaces or require special tokenization, you'll need to handle this separately.

from lexi_align.utils import export_pharaoh_format

# Note: Pharaoh format assumes space-separated tokens
pharaoh_format = export_pharaoh_format(
    source_tokens,  # Pre-tokenized list of strings
    target_tokens,  # Pre-tokenized list of strings
    alignment
)

print(pharaoh_format)
# Output (will differ depending on chosen model):
# The cat sat on the mat    Le chat était assis sur le tapis    0-0 1-1 2-2 2-3 3-4 4-5 5-6

The Pharaoh format consists of three tab-separated fields:

  1. Source sentence (space-separated tokens)
  2. Target sentence (space-separated tokens)
  3. Alignments as space-separated pairs of indices (source-target)

Running Evaluations

The package includes scripts to evaluate alignment performance on the XL-WA dataset (CC BY-NC-SA 4.0):

# Install dependencies
pip install lexi-align[litellm]

# Basic evaluation on a single language pair
python evaluations/xl-wa.py --lang-pairs EN-SL

# Evaluate on all language pairs
python evaluations/xl-wa.py --lang-pairs all

# Full evaluation with custom parameters
python evaluations/xl-wa.py \
    --lang-pairs EN-FR EN-DE \
    --model gpt-4 \
    --temperature 0.0 \
    --seed 42 \
    --num-train-examples 3 \
    --output results.json

Available command-line arguments:

  • --lang-pairs: Language pairs to evaluate (e.g., EN-SL EN-DE) or "all"
  • --model: LLM model to use (default: gpt-4)
  • --temperature: Temperature for LLM sampling (default: 0.0)
  • --seed: Random seed for example selection (default: 42)
  • --model-seed: Seed for LLM sampling (optional)
  • --num-train-examples: Number of training examples for few-shot learning
  • --sample-size: Number of test examples to evaluate per language pair
  • --output: Path to save results JSON file
  • --verbose: Enable verbose logging

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this software in your research, please cite:

@software{lexi_align,
  title = {lexi-align: Word Alignment via Structured Generation},
  author = {Hodošček, Bor},
  year = {2024},
  url = {https://github.com/borh-lab/lexi-align}
}

References

We use the XL-WA dataset (repository) to perform evaluations:

@InProceedings{martelli-EtAl:2023:clicit,
  author    = {Martelli, Federico  and  Bejgu, Andrei Stefan  and  Campagnano, Cesare  and  Čibej, Jaka  and  Costa, Rute  and  Gantar, Apolonija  and  Kallas, Jelena  and  Koeva, Svetla  and  Koppel, Kristina  and  Krek, Simon  and  Langemets, Margit  and  Lipp, Veronika  and  Nimb, Sanni  and  Olsen, Sussi  and  Pedersen, Bolette Sandford  and  Quochi, Valeria  and  Salgado, Ana  and  Simon, László  and  Tiberius, Carole  and  Ureña-Ruiz, Rafael-J  and  Navigli, Roberto},
  title     = {XL-WA: a Gold Evaluation Benchmark for Word Alignment in 14 Language Pairs},
  booktitle      = {Procedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)},
  month          = {November},
  year           = {2023}
}

This code was spun out of the hachidaishu-translation project, presented at JADH2024.

Development

Contributions are welcome! Please feel free to submit a Pull Request.

To set up the development environment:

git clone https://github.com/borh-lab/lexi-align.git
cd lexi-align
pip install -e ".[dev]"

Run tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexi_align-0.1.0.tar.gz (138.0 kB view details)

Uploaded Source

Built Distribution

lexi_align-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file lexi_align-0.1.0.tar.gz.

File metadata

  • Download URL: lexi_align-0.1.0.tar.gz
  • Upload date:
  • Size: 138.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for lexi_align-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3abe0b5d87db3955fe8e73754e68b25857f110192930ee616394c6c6a65d94a0
MD5 af72513f17cd225b5052bbf72abe0527
BLAKE2b-256 e380b2fa75026ef4cb73c969c70ba7c488b79820dca291c570faa4de66289d70

See more details on using hashes here.

File details

Details for the file lexi_align-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lexi_align-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for lexi_align-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 873fdb279e35f51b05235d64e7adcb80871e8542198dbdc7ee18d473afa82390
MD5 32f78ade7eec1c5a2ec2ec4ee9c143c0
BLAKE2b-256 31fdd2326fcef260465dd67d6aadf72ef7a8a4d29ba20c8ade9ee56912502c4a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page