Word alignment between two languages using structured generation

Project description

lexi-align

Word alignment of multilingual sentences using structured generation with Large Language Models.

Installation

Install from PyPI:

pip install lexi-align

(or your favorite method)

The library is API-backend agnostic and only directly depends on Pydantic, so you will need to bring your own API code or use the provided litellm integration.

For LLM support via litellm (recommended), install with the optional dependency:

pip install lexi-align[litellm]

Using uv:

uv add lexi-align --extra litellm

For LLM support via Outlines (for local models), install with:

pip install lexi-align[outlines]

Using uv:

uv add lexi-align --extra outlines

For LLM support via llama.cpp (for local models), install with:

pip install lexi-align[llama]

Using uv:

uv add lexi-align --extra llama

Usage

Basic Usage

The library expects pre-tokenized input--it does not perform any tokenization. You must provide tokens as lists of strings:

from lexi_align.adapters.litellm_adapter import LiteLLMAdapter
from lexi_align.core import align_tokens

# Initialize the LLM adapter
llm_adapter = LiteLLMAdapter(model_params={
    "model": "gpt-4",
    "temperature": 0.0
})

# Provide pre-tokenized input with repeated tokens
source_tokens = ["the", "big", "cat", "saw", "the", "cat"]  # Note: "the" and "cat" appear twice
target_tokens = ["le", "gros", "chat", "a", "vu", "le", "chat"]

alignment = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French"
)

# Example output will show the uniquified tokens:
# the₁ -> le₁
# big -> gros
# cat₁ -> chat₁
# saw -> a
# saw -> vu
# the₂ -> le₂
# cat₂ -> chat₂

Using Custom Guidelines and Examples

You can provide custom alignment guidelines and examples to improve alignment quality:

from lexi_align.adapters.litellm_adapter import LiteLLMAdapter
from lexi_align.core import align_tokens
from lexi_align.models import TextAlignment, TokenAlignment

# Initialize adapter as before
llm_adapter = LiteLLMAdapter(model_params={
    "model": "gpt-4",
    "temperature": 0.0
})

# Define custom guidelines
guidelines = """
1. Align content words (nouns, verbs, adjectives) first
2. Function words should be aligned when they have clear correspondences
3. Handle idiomatic expressions by aligning all components
4. One source token can align to multiple target tokens and vice versa
"""

# Provide examples to demonstrate desired alignments
examples = [
    (
        "The cat".split(),  # source tokens
        "Le chat".split(),  # target tokens
        TextAlignment(      # gold alignment
            alignment=[
                TokenAlignment(source_token="The", target_token="Le"),
                TokenAlignment(source_token="cat", target_token="chat"),
            ]
        )
    ),
    # Add more examples as needed
]

# Use guidelines and examples in alignment
alignment = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French",
    guidelines=guidelines,
    examples=examples
)

Raw Message Control

For more control over the prompt, you can use align_tokens_raw to provide custom messages:

from lexi_align.core import align_tokens_raw

custom_messages = [
    {"role": "system", "content": "You are an expert translator aligning English to French."},
    {"role": "user", "content": "Follow these guidelines:\n" + guidelines},
    # Add any other custom messages
]

alignment = align_tokens_raw(
    llm_adapter,
    source_tokens,
    target_tokens,
    custom_messages
)

Token Uniquification

The library automatically handles repeated tokens by adding unique markers:

from lexi_align.utils import make_unique, remove_unique

# Tokens with repeats
tokens = ["the", "cat", "the", "mat"]

# Add unique markers
unique_tokens = make_unique(tokens)
print(unique_tokens)  # ['the₁', 'cat', 'the₂', 'mat']

# Remove markers
original_tokens = remove_unique(unique_tokens)
print(original_tokens)  # ['the', 'cat', 'the', 'mat']

You can also customize the marker style:

from lexi_align.text_processing import create_underscore_generator

# Use underscore markers instead of subscripts
marker_gen = create_underscore_generator()
unique_tokens = make_unique(tokens, marker_gen)
print(unique_tokens)  # ['the_1', 'cat', 'the_2', 'mat']

Using Local Models with Outlines

For running local models, you can use the Outlines adapter:

from lexi_align.adapters.outlines_adapter import OutlinesAdapter
from lexi_align.core import align_tokens

# Initialize the Outlines adapter with a local model
llm_adapter = OutlinesAdapter(
    model_name="Qwen/Qwen2.5-1.5B-Instruct",  # or any local model path
    dtype="bfloat16",  # optional: choose quantization
    device="cuda"      # optional: specify device
)

# Use the same API as with other adapters
alignment = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French"
)

Using Local Models with llama.cpp

For running local models with llama.cpp:

from lexi_align.adapters.llama_cpp_adapter import LlamaCppAdapter
from lexi_align.core import align_tokens

# Initialize the llama.cpp adapter with a local model
llm_adapter = LlamaCppAdapter(
    model_path="path/to/model.gguf",
    n_gpu_layers=-1,  # Use GPU acceleration
)

# Note that for some GGUF models the pre-tokenizer might fail,
# in which case you can specify the tokenizer_repo_id, which
# should point to the base model's repo_id on Huggingface.

# Use the same API as with other adapters
alignment = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French"
)

Performance

Here are some preliminary results on the test EN-SL subset of XL-WA:

gpt-4o-2024-08-06 (1shot) (seed=42)

Language Pair	Precision	Recall	F1
EN-SL	0.863	0.829	0.846
Average	0.863	0.829	0.846

claude-3-haiku-20240307 (1shot)

Language Pair	Precision	Recall	F1
EN-SL	0.651	0.630	0.640
Average	0.651	0.630	0.640

meta-llama/Llama-3.2-3B-Instruct (1shot)

Language Pair	Precision	Recall	F1
EN-SL	0.606	0.581	0.593
Average	0.606	0.581	0.593

For reference, the 1-shot (1 example) gpt-4o-2024-08-06 results for EN-SL outperform all systems presented in the paper (Table 2). Smaller LLMs perform below SOTA.

Pharaoh Format Export

While the core alignment functions work with pre-tokenized input, the Pharaoh format utilities currently assume space-separated tokens when parsing/exporting. If your tokens contain spaces or require special tokenization, you'll need to handle this separately.

from lexi_align.utils import export_pharaoh_format

# Note: Pharaoh format assumes space-separated tokens
pharaoh_format = export_pharaoh_format(
    source_tokens,  # Pre-tokenized list of strings
    target_tokens,  # Pre-tokenized list of strings
    alignment
)

print(pharaoh_format)
# Output (will differ depending on chosen model):
# The cat sat on the mat    Le chat était assis sur le tapis    0-0 1-1 2-2 2-3 3-4 4-5 5-6

The Pharaoh format consists of three tab-separated fields:

Source sentence (space-separated tokens)
Target sentence (space-separated tokens)
Alignments as space-separated pairs of indices (source-target)

Running Evaluations

The package includes scripts to evaluate alignment performance on the XL-WA dataset (CC BY-NC-SA 4.0):

# Install dependencies
pip install lexi-align[litellm]

# Basic evaluation on a single language pair
python evaluations/xl-wa.py --lang-pairs EN-SL

# Evaluate on all language pairs
python evaluations/xl-wa.py --lang-pairs all

# Full evaluation with custom parameters
python evaluations/xl-wa.py \
    --lang-pairs EN-FR EN-DE \
    --model gpt-4 \
    --temperature 0.0 \
    --seed 42 \
    --num-train-examples 3 \
    --output results.json

Available command-line arguments:

--lang-pairs: Language pairs to evaluate (e.g., EN-SL EN-DE) or "all"
--model: LLM model to use (default: gpt-4)
--temperature: Temperature for LLM sampling (default: 0.0)
--seed: Random seed for example selection (default: 42)
--model-seed: Seed for LLM sampling (optional)
--num-train-examples: Number of training examples for few-shot learning
--sample-size: Number of test examples to evaluate per language pair
--output: Path to save results JSON file
--verbose: Enable verbose logging

Planned improvements

structured generation support (adapter additions) for local models via Outlines and llama.cpp GBNF
retries on errors or invalid alignments

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this software in your research, please cite:

@software{lexi_align,
  title = {lexi-align: Word Alignment via Structured Generation},
  author = {Hodošček, Bor},
  year = {2024},
  url = {https://github.com/borh-lab/lexi-align}
}

References

We use the XL-WA dataset (repository) to perform evaluations:

@InProceedings{martelli-EtAl:2023:clicit,
  author    = {Martelli, Federico  and  Bejgu, Andrei Stefan  and  Campagnano, Cesare  and  Čibej, Jaka  and  Costa, Rute  and  Gantar, Apolonija  and  Kallas, Jelena  and  Koeva, Svetla  and  Koppel, Kristina  and  Krek, Simon  and  Langemets, Margit  and  Lipp, Veronika  and  Nimb, Sanni  and  Olsen, Sussi  and  Pedersen, Bolette Sandford  and  Quochi, Valeria  and  Salgado, Ana  and  Simon, László  and  Tiberius, Carole  and  Ureña-Ruiz, Rafael-J  and  Navigli, Roberto},
  title     = {XL-WA: a Gold Evaluation Benchmark for Word Alignment in 14 Language Pairs},
  booktitle      = {Procedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)},
  month          = {November},
  year           = {2023}
}

This code was spun out of the hachidaishu-translation project, presented at JADH2024.

Development

Contributions are welcome! Please feel free to submit a Pull Request.

To set up the development environment:

git clone https://github.com/borh-lab/lexi-align.git
cd lexi-align
pip install -e ".[dev]"

Run tests:

pytest

Project details

Release history Release notifications | RSS feed

0.3.0

Nov 11, 2024

0.2.3

Nov 7, 2024

0.2.1

Nov 7, 2024

This version

0.2.0

Nov 7, 2024

0.1.0

Oct 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexi_align-0.2.0.tar.gz (172.7 kB view details)

Uploaded Nov 7, 2024 Source

Built Distribution

lexi_align-0.2.0-py3-none-any.whl (20.4 kB view details)

Uploaded Nov 7, 2024 Python 3

File details

Details for the file lexi_align-0.2.0.tar.gz.

File metadata

Download URL: lexi_align-0.2.0.tar.gz
Upload date: Nov 7, 2024
Size: 172.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for lexi_align-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`4b41c0c5f7707974996f8c2fa36275cca3a5fa79ea4bf832ff1101fa33d5850e`
MD5	`37bbfe635feee3225adcce61db5877b5`
BLAKE2b-256	`cd5d422184d94365f3c53aa994586d4082c617f4212b1a440e6d505265e5bbfd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lexi_align-0.2.0.tar.gz:

Publisher: publish.yaml on borh-lab/lexi-align

Attestations:

Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lexi_align-0.2.0.tar.gz
- Subject digest: 4b41c0c5f7707974996f8c2fa36275cca3a5fa79ea4bf832ff1101fa33d5850e
- Sigstore transparency entry: 147206070
- Sigstore integration time: Nov 7, 2024

File details

Details for the file lexi_align-0.2.0-py3-none-any.whl.

File metadata

Download URL: lexi_align-0.2.0-py3-none-any.whl
Upload date: Nov 7, 2024
Size: 20.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for lexi_align-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80ac151900f932e360a907a20d4c5e8b6e5ca41cdb1b33c7363522b14a476abb`
MD5	`0561083c17dd92a6f327f67ee61882b8`
BLAKE2b-256	`22bb4834aa65376206a67f090c6b24db13e77a41db40a515646ff2dafe15c93e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lexi_align-0.2.0-py3-none-any.whl:

Publisher: publish.yaml on borh-lab/lexi-align

Attestations:

Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lexi_align-0.2.0-py3-none-any.whl
- Subject digest: 80ac151900f932e360a907a20d4c5e8b6e5ca41cdb1b33c7363522b14a476abb
- Sigstore transparency entry: 147206072
- Sigstore integration time: Nov 7, 2024

lexi-align 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

lexi-align

Installation

Usage

Basic Usage

Using Custom Guidelines and Examples

Raw Message Control

Token Uniquification

Using Local Models with Outlines

Using Local Models with llama.cpp

Performance

gpt-4o-2024-08-06 (1shot) (seed=42)

claude-3-haiku-20240307 (1shot)

meta-llama/Llama-3.2-3B-Instruct (1shot)

Pharaoh Format Export

Running Evaluations

Planned improvements

License

Citation

References

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance