Skip to main content

A text alignment library with special support for Hebrew texts

Project description

TRAligner

A Python package for text alignment with specialized support for Hebrew text processing.

Installation

pip install traligner

Features

  • Text alignment using Smith-Waterman algorithm
  • Support for Hebrew text with specialized matching methods
  • Handling of abbreviations, gematria, and other Hebrew-specific text features
  • Sequence scoring and merging capabilities
  • Word embedding support for semantic similarity

Quick Start

import traligner as ta

# Align two texts
suspect_tokens = ["בראשית", "ברא", "אלהים", ]
source_tokens = ["בראשית", "ברא", "אלוהים",  ]

alignment_sequences, df_alignment, suspect_matrix, source_matrix = ta.alignment(
    suspect_tokens,
    source_tokens,
    match_score=3,
    mismatch_score=1,
    methods={}
)

# Score the alignment
score, sequences = ta.alignmentScore(alignment_sequences)
print(f"Alignment score: {score}")

The Results

The alignment_sequences will look like this: [[(0, 0, 1, 'exact_match'), (1, 1, 1, 'exact_match'), (2, 2, 1, 'exact_match')]]

The alignment_sequences variable is a list of lists, where each inner list represents a local alignment between the two texts. Each local alignment is a list of tuples. Each tuple contains four elements: a. The sequence of aligned tokens from the first input list b. The sequence of aligned tokens from the second input list c. The alignment score assigned to these tokens d. The reason for the alignment

Advanced Usage

Using Word Embeddings and Lexicons

# Initialize embedding model
import fasttext
embeding_model = fasttext.load_model("path/to/fasttext/model.bin")

# Initialize Lexicons
import trelasticext as ee # If you would like to use synonyms in Elasticsearch, you may load them from a file.
synonyms = ee.load_synonyms("path/to/elasticsearch/analysis/your_lexicon')


# In the following alignment example, the two texts exhibit word boundary errors,
# typographical mistakes, orthographic variations, differences in Gematria, 
# and the use of synonyms. However they are exactly similar. 

suspect_tokens = ["בראשית", "כרא", "ה'", "ח", "השמים", "ואת", "הארץ"]
source_tokens = ["בראשית", "ברא", "אלוהים", "שמונה", "השמיים", "ואתהארץ" ]


methods = {"ortography": ["י", "ו"],
...            "extra_seperators": [""],
...            "missing_seperators": [""],
...            "abbreviation": ["'"],
...            "morphology-embeding": [(embeding_model, 0.702)],
...           }

alignment_sequences, df_alignment, suspect_matrix, source_matrix = ta.alignment(
    suspect_tokens,
    source_tokens,
    methods=methods
)

Results

[[(0, 0, 1, 'exact_match'), (1, 1, 0.8, 'ocr_replacables'), (2, 2, 1.0, 'synonym_simple_match'), (3, 3, 0.75, 'single_gematria_match'), (4, 4, 0.8280513747171923, 'morphology_embeding_match'), (5, 5, 0.8, 'missing_spaces_match'), (6, 5, 0.8, 'missing_spaces_match')]]

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traligner-0.1.0.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traligner-0.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file traligner-0.1.0.tar.gz.

File metadata

  • Download URL: traligner-0.1.0.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for traligner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8b503a6c75916436641cb3832fae27dee288346a95024a1a1b1cd17eaa2d3af8
MD5 f6260f79f72c8f4f0b73a5dbddd0e181
BLAKE2b-256 daf334e83265e33418183c409f46d026ac703f9e4dfc19e6da2f831c15bc45b0

See more details on using hashes here.

File details

Details for the file traligner-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: traligner-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for traligner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5a686ee17cd5ef0fe582d33369130438d200c5d61204c3f7a013da210b0e532f
MD5 4965f3bc421dc8d977a05abd587710a8
BLAKE2b-256 9fb9dcbb5277123b324a5579825a7f39c5a75a8073e05eca795aaa7e855c8308

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page