A text alignment library with special support for Hebrew texts
Project description
TRAligner
A Python package for text alignment with specialized support for Hebrew text processing.
Installation
pip install traligner
Features
- Text alignment using Smith-Waterman algorithm
- Support for Hebrew text with specialized matching methods
- Handling of abbreviations, gematria, and other Hebrew-specific text features
- Sequence scoring and merging capabilities
- Word embedding support for semantic similarity
Quick Start
import traligner as ta
# Align two texts
suspect_tokens = ["בראשית", "ברא", "אלהים", ]
source_tokens = ["בראשית", "ברא", "אלוהים", ]
alignment_sequences, df_alignment, suspect_matrix, source_matrix = ta.alignment(
suspect_tokens,
source_tokens,
match_score=3,
mismatch_score=1,
methods={}
)
# Score the alignment
score, sequences = ta.alignmentScore(alignment_sequences)
print(f"Alignment score: {score}")
The Results
The alignment_sequences will look like this: [[(0, 0, 1, 'exact_match'), (1, 1, 1, 'exact_match'), (2, 2, 1, 'exact_match')]]
The alignment_sequences variable is a list of lists, where each inner list represents a local alignment between the two texts. Each local alignment is a list of tuples. Each tuple contains four elements: a. The sequence of aligned tokens from the first input list b. The sequence of aligned tokens from the second input list c. The alignment score assigned to these tokens d. The reason for the alignment
Advanced Usage
Using Word Embeddings and Lexicons
# Initialize embedding model
import fasttext
embeding_model = fasttext.load_model("path/to/fasttext/model.bin")
# Initialize Lexicons
import trelasticext as ee # If you would like to use synonyms in Elasticsearch, you may load them from a file.
synonyms = ee.load_synonyms("path/to/elasticsearch/analysis/your_lexicon')
# In the following alignment example, the two texts exhibit word boundary errors,
# typographical mistakes, orthographic variations, differences in Gematria,
# and the use of synonyms. However they are exactly similar.
suspect_tokens = ["בראשית", "כרא", "ה'", "ח", "השמים", "ואת", "הארץ"]
source_tokens = ["בראשית", "ברא", "אלוהים", "שמונה", "השמיים", "ואתהארץ" ]
methods = {"ortography": ["י", "ו"],
... "extra_seperators": [""],
... "missing_seperators": [""],
... "abbreviation": ["'"],
... "morphology-embeding": [(embeding_model, 0.702)],
... }
alignment_sequences, df_alignment, suspect_matrix, source_matrix = ta.alignment(
suspect_tokens,
source_tokens,
methods=methods
)
Results
[[(0, 0, 1, 'exact_match'), (1, 1, 0.8, 'ocr_replacables'), (2, 2, 1.0, 'synonym_simple_match'), (3, 3, 0.75, 'single_gematria_match'), (4, 4, 0.8280513747171923, 'morphology_embeding_match'), (5, 5, 0.8, 'missing_spaces_match'), (6, 5, 0.8, 'missing_spaces_match')]]
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traligner-0.1.0.tar.gz.
File metadata
- Download URL: traligner-0.1.0.tar.gz
- Upload date:
- Size: 20.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b503a6c75916436641cb3832fae27dee288346a95024a1a1b1cd17eaa2d3af8
|
|
| MD5 |
f6260f79f72c8f4f0b73a5dbddd0e181
|
|
| BLAKE2b-256 |
daf334e83265e33418183c409f46d026ac703f9e4dfc19e6da2f831c15bc45b0
|
File details
Details for the file traligner-0.1.0-py3-none-any.whl.
File metadata
- Download URL: traligner-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a686ee17cd5ef0fe582d33369130438d200c5d61204c3f7a013da210b0e532f
|
|
| MD5 |
4965f3bc421dc8d977a05abd587710a8
|
|
| BLAKE2b-256 |
9fb9dcbb5277123b324a5579825a7f39c5a75a8073e05eca795aaa7e855c8308
|