Text Reuse Alignment for Hebrew and multi-language texts

These details have not been verified by PyPI

Project links

Project description

TRAligner Documentation

TRAligner (Text Reuse Aligner) is a sophisticated Python package designed for detecting and analyzing text reuse, particularly optimized for Hebrew and other Semitic languages. The package implements advanced sequence alignment algorithms, including the Smith-Waterman algorithm, to identify similarities between suspect and source texts.

Overview
Features
Installation
Quick Start
Core Components
API Reference
Understanding Results
Advanced Usage
Examples
Dependencies
Performance Considerations

Overview

TRAligner is particularly powerful for academic research in text reuse detection, plagiarism detection, and comparative textual analysis. It provides multiple matching methods and can handle complex linguistic features including:

Hebrew text processing with support for gematria, abbreviations, and number conversion
Multiple alignment algorithms including Smith-Waterman and custom matching methods
Flexible scoring systems with customizable parameters
Comprehensive output formats including DataFrames and HTML visualizations

Features

Core Alignment Features

Smith-Waterman Algorithm: Optimal local sequence alignment
Multi-method Matching: Combines multiple matching strategies
Gap Handling: Sophisticated gap penalty systems
Internal Word Swapping: Detects transpositions within alignment spans

Hebrew Language Support

Gematria Matching: Numerical value-based word comparison
Hebrew Number Conversion: Convert Hebrew text numbers to integers
Abbreviation Detection: Identify and expand Hebrew abbreviations
Orthographic Variations: Handle different spelling conventions
Final Letters (Sofiot): Manage Hebrew final letter variations

Advanced Matching Methods

Edit Distance: Levenshtein distance-based similarity
Stemming: Support for multiple languages including Greek
Embedding Similarity: Vector-based word similarity
LLM Integration: Large language model-based comparisons
Synonym Detection: Semantic similarity matching

Output and Visualization

DataFrame Integration: Pandas-compatible result structures
HTML Visualization: Rich web-based alignment display
Scoring Metrics: Comprehensive alignment quality assessment
Alignment Matrices: Detailed position-based analysis

Installation

Prerequisites

pip install numpy pandas python-Levenshtein hebrew-numbers

Package Structure

TRAligner/
├── __init__.py
├── text_alignment_clean.py    # Main alignment algorithms
├── alignment_tools.py         # Hebrew analysis tools
└── README.md                  # This documentation

Import

import TRAligner.text_alignment_clean as ta
from TRAligner import alignment_tools

Quick Start

Basic Alignment Example

import TRAligner.text_alignment_clean as ta

# Simple Hebrew text alignment
suspect_tokens = ["בראשית", "ברא", "אלהים"]
source_tokens = ["בראשית", "ברא", "אלוהים"]

alignment_sequences, df_alignment, suspect_matrix, source_matrix = ta.alignment(
    suspect_tokens,
    source_tokens,
    match_score=3,
    mismatch_score=1,
    methods={}
)

# Score the alignment
score, sequences = ta.alignmentScore(alignment_sequences)
print(f"Alignment score: {score}")

The Results:

# alignment_sequences will look like this:
[[(0, 0, 1, 'exact_match'),
  (1, 1, 1, 'exact_match'),
  (2, 2, 1, 'exact_match')]]

The alignment_sequences variable is a list of lists, where each inner list represents a local alignment between the two texts. Each local alignment is a list of tuples containing four elements:

Position in suspect text (0-indexed)
Position in source text (0-indexed)
Alignment score assigned to these tokens
Reason for alignment (matching method used)

Core Components

1. Main Alignment Function

alignment(suspect_t, src_t, match_score=3, mismatch_score=1, methods={}, gap_score=1, minimum_alignment_size=2)

The primary function for performing text alignment between two token sequences.

Parameters:

suspect_t: List of tokens from the suspect text
src_t: List of tokens from the source text
match_score: Score for matching tokens (default: 3)
mismatch_score: Penalty for mismatching tokens (default: 1)
methods: Dictionary of matching methods and their parameters
gap_score: Penalty for gaps in alignment (default: 1)
minimum_alignment_size: Minimum length of valid alignments (default: 2)

Returns:

alignment_sequences: List of alignment sequences
df_alignment: Pandas DataFrame with detailed alignment information
suspect_matrix: Binary matrix indicating aligned positions in suspect text
source_matrix: Binary matrix indicating aligned positions in source text

2. Smith-Waterman Algorithm

smith_waterman(suspect_t, src_t, match_score=10, mismatch_score=1, methods={}, swap=False, gap_score=1, minimum_alignment_size=2)

Implements the classic Smith-Waterman algorithm for local sequence alignment.

3. Word Comparison Engine

compare_words(sus_t, src_t, loc_sus, loc_src, methods={})

Compares individual words using multiple matching strategies.

Supported Methods:

exact: Exact string matching
edit_distance: Levenshtein distance threshold
gematria: Hebrew numerical value matching
stemming: Root word comparison
embedding: Vector similarity
orthography: Spelling variation handling
sofiot: Hebrew final letter normalization

API Reference

Hebrew Text Processing Functions

`hebtext2num(txt)`

Converts Hebrew text numbers to integers.

# Examples
ta.hebtext2num("שלושה")  # Returns: 3
ta.hebtext2num("עשרים")  # Returns: 20
ta.hebtext2num("מאה")    # Returns: 100

`is_abbreviation(token, get_spliter=False, indicator="'")`

Detects Hebrew abbreviations and optionally splits them.

# Examples
is_abbrev, tokens = ta.is_abbreviation("ר'משה", get_spliter=True)
# Returns: (True, ["ר", "משה"])

`replace_chars(exchange, replacables, s)`

Replaces characters in a string based on mapping rules.

Scoring and Analysis Functions

`alignmentScore(alignment_sequences, increment2one=0.3, decrement_gap=0.1, verbose=False, prune=0.0)`

Calculates comprehensive scores for alignment sequences.

Parameters:

increment2one: Bonus for consecutive alignments
decrement_gap: Penalty for gaps between alignments
prune: Minimum score threshold for inclusion

`word_edit_distance(tokens1, tokens2, mode='distance')`

Calculates edit distance between token sequences.

Modes:

'distance': Raw edit distance
'ratio': Normalized similarity ratio

Visualization Functions

`synopsis_2_html(src_t, df_suspect_alignment)`

Generates HTML visualization of alignments.

suspect_html, source_html = ta.synopsis_2_html(source_tokens, df_alignment)

`synopsis2htmlTable(text1_t, text2_t, align_sequenses)`

Creates HTML table representation of alignments.

Understanding Results

TRAligner provides multiple output formats that offer different perspectives on the alignment analysis. Understanding these results is crucial for effective text reuse detection and analysis.

1. Alignment Sequences

Structure: List of alignment sequences, where each sequence contains tuples of matched positions.

alignment_sequences = [
    [(sus_pos1, src_pos1, score1, method1), (sus_pos2, src_pos2, score2, method2), ...],
    [(sus_pos3, src_pos3, score3, method3), ...]
]

Real Example from TRAligner:

# Simple case:
[[(0, 0, 1, 'exact_match'),
  (1, 1, 1, 'exact_match'),
  (2, 2, 1, 'exact_match')]]

# Complex case with multiple matching methods:
[[(0, 0, 1, 'exact_match'),                    # Perfect match
  (1, 1, 0.8, 'ocr_replacables'),             # OCR correction: כרא → ברא
  (2, 2, 1.0, 'synonym_simple_match'),        # Abbreviation: ה' → אלוהים
  (3, 3, 0.75, 'single_gematria_match'),      # Gematria: ח (8) → שמונה
  (4, 4, 0.828, 'morphology_embeding_match'), # Morphological similarity
  (5, 5, 0.8, 'missing_spaces_match'),        # Missing space handling
  (6, 5, 0.8, 'missing_spaces_match')]]       # Continuation of missing space

Interpretation:

Each sequence represents a continuous alignment span
Each tuple represents a matched word pair:
- sus_pos: Position in suspect text (0-indexed)
- src_pos: Position in source text (0-indexed)
- score: Match confidence (0.0-1.0)
- method: Matching method used

Key Matching Methods:

'exact_match': Perfect string match (score = 1.0)
'ocr_replacables': OCR error correction (score ~0.8)
'synonym_simple_match': Synonym or abbreviation expansion (score = 1.0)
'single_gematria_match': Hebrew numerical value match (score ~0.75)
'morphology_embeding_match': Embedding-based similarity (score variable)
'missing_spaces_match': Word boundary error correction (score ~0.8)

2. Alignment DataFrame

Structure: Pandas DataFrame with detailed token-level information.

Column	Type	Description
`token`	str	The actual token text
`position`	int	Position in the suspect text sequence
`match`	float	Match score (0.0 = no match, 1.0 = perfect match)
`match_procesure`	str	Method used for matching
`suspect_pos`	int	Position in suspect text (-1 if unmatched)
`source_pos`	int	Position in source text (-1 if unmatched)

Example DataFrame:

    token  position  match match_procesure  suspect_pos  source_pos
0  בראשית         0   1.00           exact            0           0
1     ברא         1   1.00           exact            1           1
2   אלהים         2   1.00           exact            2           2
3      את         3   0.00            none           -1          -1
4   השמים         4   1.00           exact            4           4
5      את         5   0.00            none           -1          -1
6    הארץ         6   1.00           exact            6           6
7   והארץ         7   0.85    edit_distance            7           8
8    היתה         8   0.92        gematria            8          10

Key Insights from DataFrame:

High match scores (0.8-1.0): Strong evidence of text reuse
Medium scores (0.5-0.8): Possible paraphrasing or variations
Zero scores: Unique content or significant modifications
Method distribution: Shows which matching strategies were most effective

3. Position Matrices

Structure: Binary numpy arrays indicating aligned positions.

suspect_matrix = [1, 1, 1, 0, 1, 0, 1, 1, 1, 0]  # 1 = aligned, 0 = unaligned
source_matrix  = [1, 1, 1, 0, 1, 0, 1, 0, 1, 1]

Interpretation:

Index: Position in the token sequence
Value 1: Token participates in an alignment
Value 0: Token is unaligned (unique content)

Usage Examples:

# Calculate alignment coverage
suspect_coverage = sum(suspect_matrix) / len(suspect_matrix)
source_coverage = sum(source_matrix) / len(source_matrix)

print(f"Suspect text alignment coverage: {suspect_coverage:.2%}")
print(f"Source text alignment coverage: {source_coverage:.2%}")

# Find unaligned regions
unaligned_suspect = [i for i, val in enumerate(suspect_matrix) if val == 0]
unaligned_source = [i for i, val in enumerate(source_matrix) if val == 0]

4. Scoring Results

Structure: Dictionary with detailed scoring information.

max_score, scored_sequences = ta.alignmentScore(alignment_sequences)

# scored_sequences structure:
{
    'sequence_0': {
        'score': 8.75,
        'subsequences': [
            {'start': 0, 'end': 4, 'score': 6.2, 'length': 4},
            {'start': 7, 'end': 9, 'score': 2.55, 'length': 2}
        ],
        'gaps': [{'start': 4, 'end': 7, 'penalty': 0.3}]
    }
}

Score Components:

Base Score: Sum of individual match scores
Consecutive Bonus: Added for uninterrupted alignments
Gap Penalty: Subtracted for breaks in alignment
Length Bonus: Reward for longer alignment spans

Interpretation Guidelines:

High scores (>10): Strong evidence of direct copying
Medium scores (5-10): Likely paraphrasing or close similarity
Low scores (1-5): Weak similarity or coincidental matches
Very low scores (<1): Minimal or no meaningful similarity

5. HTML Visualization Output

Structure: Lists of HTML elements for web display.

suspect_html, source_html = ta.synopsis_2_html(source_tokens, df_alignment)

# Example output:
suspect_html = [
    '<span style="background-color: #90EE90;">בראשית</span>',  # High match
    '<span style="background-color: #FFB6C1;">היתה</span>',   # Medium match
    '<span>והארץ</span>'                                      # No match
]

Color Coding:

Green shades: Strong matches (score > 0.8)
Yellow/Orange: Medium matches (score 0.5-0.8)
Pink/Red: Weak matches (score 0.3-0.5)
No highlighting: Unmatched text

6. Comprehensive Result Analysis

Coverage Analysis

def analyze_coverage(suspect_matrix, source_matrix, alignment_sequences):
    # Calculate basic coverage
    suspect_coverage = sum(suspect_matrix) / len(suspect_matrix)
    source_coverage = sum(source_matrix) / len(source_matrix)
    
    # Calculate alignment density
    total_alignments = sum(len(seq) for seq in alignment_sequences)
    avg_alignment_length = total_alignments / len(alignment_sequences) if alignment_sequences else 0
    
    # Find longest continuous alignment
    max_continuous = 0
    current_continuous = 0
    for val in suspect_matrix:
        if val == 1:
            current_continuous += 1
            max_continuous = max(max_continuous, current_continuous)
        else:
            current_continuous = 0
    
    return {
        'suspect_coverage': suspect_coverage,
        'source_coverage': source_coverage,
        'alignment_density': avg_alignment_length,
        'max_continuous_alignment': max_continuous
    }

Match Quality Distribution

def analyze_match_quality(df_alignment):
    matched_tokens = df_alignment[df_alignment['match'] > 0]
    
    if len(matched_tokens) == 0:
        return "No matches found"
    
    quality_distribution = {
        'perfect_matches': len(matched_tokens[matched_tokens['match'] == 1.0]),
        'strong_matches': len(matched_tokens[matched_tokens['match'] >= 0.8]),
        'medium_matches': len(matched_tokens[(matched_tokens['match'] >= 0.5) & 
                                           (matched_tokens['match'] < 0.8)]),
        'weak_matches': len(matched_tokens[matched_tokens['match'] < 0.5])
    }
    
    return quality_distribution

Method Effectiveness Analysis

def analyze_methods(df_alignment):
    matched_tokens = df_alignment[df_alignment['match'] > 0]
    method_stats = matched_tokens['match_procesure'].value_counts()
    method_scores = matched_tokens.groupby('match_procesure')['match'].mean()
    
    return {
        'method_frequency': method_stats.to_dict(),
        'method_average_scores': method_scores.to_dict()
    }

7. Interpretation Guidelines

Text Reuse Classification

Based on the combined results, you can classify text reuse as:

Direct Copying (High Confidence)

Coverage > 70%
Average match score > 0.9
Multiple perfect matches
Long continuous alignments

Close Paraphrasing (Medium-High Confidence)

Coverage 40-70%
Average match score 0.7-0.9
Mix of exact and edit-distance matches
Some gaps but clear structural similarity

Loose Similarity (Medium Confidence)

Coverage 20-40%
Average match score 0.5-0.7
Diverse matching methods
Fragmented alignments

Minimal Similarity (Low Confidence)

Coverage < 20%
Average match score < 0.5
Few scattered matches
May be coincidental

Red Flags and Validation

Single method dominance: If only one method produces matches, validate manually
Very short alignments: Multiple 1-2 word matches may be coincidental
Extremely high scores: Verify for potential exact duplicates
Inconsistent patterns: Mixed high/low scores may indicate selective copying

Advanced Usage

Advanced Matching Methods

# Complex Hebrew text alignment with multiple challenges:
# - Word boundary errors, typographical mistakes
# - Orthographic variations, Gematria differences
# - Use of synonyms and abbreviations

suspect_tokens = ["בראשית", "כרא", "ה'", "ח", "השמים", "ואת", "הארץ"]
source_tokens = ["בראשית", "ברא", "אלוהים", "שמונה", "השמיים", "ואתהארץ"]

# Configure comprehensive matching methods
methods = {
    "ortography": ["י", "ו"],                    # Handle orthographic variations
    "extra_seperators": [""],                    # Handle extra word separators
    "missing_seperators": [""],                  # Handle missing word separators
    "abbreviation": ["'"],                       # Handle Hebrew abbreviations
    "edit_distance": 0.7,                       # Edit distance threshold
    "gematria": True,                           # Hebrew numerical value matching
    "internal_swap": True                        # Allow word transpositions
}

alignment_sequences, df_alignment, suspect_matrix, source_matrix = ta.alignment(
    suspect_tokens,
    source_tokens,
    methods=methods
)

Advanced Results:

# This complex example produces sophisticated alignments:
[[(0, 0, 1, 'exact_match'),                    # "בראשית" matches exactly
  (1, 1, 0.8, 'ocr_replacables'),             # "כרא" → "ברא" (OCR correction)
  (2, 2, 1.0, 'synonym_simple_match'),        # "ה'" → "אלוהים" (abbreviation expansion)
  (3, 3, 0.75, 'single_gematria_match'),      # "ח" → "שמונה" (gematria: 8)
  (4, 4, 0.828, 'morphology_embeding_match'), # "השמים" → "השמיים" (morphological similarity)
  (5, 5, 0.8, 'missing_spaces_match'),        # "ואת" → "ואתהארץ" (missing space)
  (6, 5, 0.8, 'missing_spaces_match')]]       # "הארץ" → "ואתהארץ" (continuation)

Advanced Scoring

# Perform alignment
alignment_sequences, df_alignment, _, _ = ta.alignment(
    suspect_tokens, source_tokens, methods=methods
)

# Calculate detailed scores
max_score, scored_sequences = ta.alignmentScore(
    alignment_sequences,
    increment2one=0.3,    # Bonus for consecutive matches
    decrement_gap=0.1,    # Gap penalty
    verbose=True,         # Print detailed information
    prune=0.2            # Remove low-scoring sequences
)

print(f"Maximum alignment score: {max_score}")

Hebrew Language Analysis

from TRAligner.alignment_tools import HebAnalysis

# Initialize Hebrew analysis
heb_analyzer = HebAnalysis(
    txt="sample Hebrew text",
    compare_method="base"
)

# Use in alignment
methods = {
    "llm": heb_analyzer,
    "edit_distance": 0.7
}

Examples

Example 1: Basic Hebrew Text Alignment

import TRAligner.text_alignment_clean as ta

# Simple alignment example
suspect_tokens = ["בראשית", "ברא", "אלהים"]
source_tokens = ["בראשית", "ברא", "אלוהים"]

alignment_sequences, df_alignment, suspect_matrix, source_matrix = ta.alignment(
    suspect_tokens, 
    source_tokens, 
    methods={}
)

# Score the alignment
score, sequences = ta.alignmentScore(alignment_sequences)
print(f"Alignment score: {score}")

# Results: 
# [[(0, 0, 1, 'exact_match'),
#   (1, 1, 1, 'exact_match'),
#   (2, 2, 1, 'exact_match')]]

Example 2: Advanced Multi-Method Alignment

# Complex text with multiple Hebrew-specific challenges
suspect_tokens = ["בראשית", "כרא", "ה'", "ח", "השמים", "ואת", "הארץ"]
source_tokens = ["בראשית", "ברא", "אלוהים", "שמונה", "השמיים", "ואתהארץ"]

# Comprehensive method configuration
methods = {
    "ortography": ["י", "ו"],           # Orthographic variations
    "extra_seperators": [""],           # Handle extra separators
    "missing_seperators": [""],         # Handle missing separators
    "abbreviation": ["'"],              # Hebrew abbreviations
    "edit_distance": 0.7,              # Edit distance matching
    "gematria": True,                   # Numerical value matching
    "internal_swap": True               # Word transpositions
}

alignment_sequences, df_alignment, suspect_matrix, source_matrix = ta.alignment(
    suspect_tokens,
    source_tokens,
    methods=methods
)

# Complex results showing different matching methods:
# [[(0, 0, 1, 'exact_match'),
#   (1, 1, 0.8, 'ocr_replacables'),
#   (2, 2, 1.0, 'synonym_simple_match'),
#   (3, 3, 0.75, 'single_gematria_match'),
#   (4, 4, 0.828, 'morphology_embeding_match'),
#   (5, 5, 0.8, 'missing_spaces_match'),
#   (6, 5, 0.8, 'missing_spaces_match')]]

Example 3: Word Embedding Integration

# Using word embeddings for semantic similarity
import fasttext  # or any embedding model

# Initialize embedding model
embedding_model = fasttext.load_model("path/to/fasttext/model.bin")

# Configure methods with embeddings
methods = {
    "morphology-embeding": [(embedding_model, 0.702)],  # Embedding threshold
    "edit_distance": 0.7,
    "gematria": True,
    "orthography": ["י", "ו"]
}

suspect_tokens = ["בראשית", "כרא", "השמים"]
source_tokens = ["בראשית", "ברא", "השמיים"]

alignment_sequences, df_alignment, _, _ = ta.alignment(
    suspect_tokens, source_tokens, methods=methods
)

# Results will include embedding-based matches:
# [[(0, 0, 1, 'exact_match'),
#   (1, 1, 0.8, 'ocr_replacables'),
#   (2, 2, 0.828, 'morphology_embeding_match')]]

Example 4: Hebrew Number and Gematria Processing

# Test Hebrew number conversion
hebrew_numbers = ["אחד", "שנים", "שלושה", "עשרה", "עשרים"]

for heb_num in hebrew_numbers:
    numeric_value = ta.hebtext2num(heb_num)
    print(f"'{heb_num}' = {numeric_value}")

# Test gematria functionality
from hebrew_numbers import gematria_to_int

gematria_examples = ["יג", "כה", "לו"]
for gem in gematria_examples:
    value = gematria_to_int(gem)
    print(f"Gematria '{gem}' = {value}")

Example 4: Hebrew Number and Gematria Processing

# Test Hebrew number conversion
hebrew_numbers = ["אחד", "שנים", "שלושה", "עשרה", "עשרים"]

for heb_num in hebrew_numbers:
    numeric_value = ta.hebtext2num(heb_num)
    print(f"'{heb_num}' = {numeric_value}")

# Test gematria functionality
from hebrew_numbers import gematria_to_int

gematria_examples = ["יג", "כה", "לו"]
for gem in gematria_examples:
    value = gematria_to_int(gem)
    print(f"Gematria '{gem}' = {value}")

# Example of gematria matching in alignment
suspect_tokens = ["ח"]      # Gematria value: 8
source_tokens = ["שמונה"]   # Hebrew word for "eight"

methods = {"gematria": True}
alignment_sequences, _, _, _ = ta.alignment(suspect_tokens, source_tokens, methods=methods)
# Result: [(0, 0, 0.75, 'single_gematria_match')]

Example 5: Abbreviation Detection

# Hebrew abbreviations
abbreviations = ["ר'משה", "ד'ברים", "בעה'ב"]

for abbrev in abbreviations:
    is_abbrev, tokens = ta.is_abbreviation(abbrev, get_spliter=True)
    print(f"'{abbrev}' -> Abbreviation: {is_abbrev}, Tokens: {tokens}")

Example 5: Abbreviation Detection

# Hebrew abbreviations
abbreviations = ["ר'משה", "ד'ברים", "בעה'ב"]

for abbrev in abbreviations:
    is_abbrev, tokens = ta.is_abbreviation(abbrev, get_spliter=True)
    print(f"'{abbrev}' -> Abbreviation: {is_abbrev}, Tokens: {tokens}")

# Example of abbreviation matching in alignment
suspect_tokens = ["ה'"]        # Abbreviation for God
source_tokens = ["אלוהים"]     # Full word for God

methods = {"abbreviation": ["'"]}
alignment_sequences, _, _, _ = ta.alignment(suspect_tokens, source_tokens, methods=methods)
# Result: [(0, 0, 1.0, 'synonym_simple_match')]

Example 6: Complete Result Analysis Pipeline

import TRAligner.text_alignment_clean as ta
import numpy as np
import pandas as pd

# Sample texts for comprehensive analysis
suspect = "בראשית ברא אלהים את השמים ואת הארץ והארץ היתה תהו ובהו"
source = "בראשית ברא אלהים את השמים ואת הארץ והארץ הייתה תהו ובהו וחושך על פני תהום"

suspect_tokens = suspect.split()
source_tokens = source.split()

# Comprehensive method configuration
methods = {
    "edit_distance": 0.7,
    "gematria": True,
    "internal_swap": True,
    "stemming": True,
    "orthography": True
}

# Perform alignment
print("🔍 Performing alignment analysis...")
alignment_sequences, df_alignment, suspect_matrix, source_matrix = ta.alignment(
    suspect_tokens, source_tokens, 
    match_score=4, 
    gap_score=1, 
    methods=methods
)

# 1. Basic Statistics
print(f"\n📊 BASIC ALIGNMENT STATISTICS")
print(f"Alignment sequences found: {len(alignment_sequences)}")
print(f"Total tokens in suspect: {len(suspect_tokens)}")
print(f"Total tokens in source: {len(source_tokens)}")

# 2. Coverage Analysis
suspect_coverage = sum(suspect_matrix) / len(suspect_matrix)
source_coverage = sum(source_matrix) / len(source_matrix)
print(f"\n📈 COVERAGE ANALYSIS")
print(f"Suspect text coverage: {suspect_coverage:.1%} ({sum(suspect_matrix)}/{len(suspect_matrix)} tokens)")
print(f"Source text coverage: {source_coverage:.1%} ({sum(source_matrix)}/{len(source_matrix)} tokens)")

# 3. Match Quality Distribution
if df_alignment is not None and len(df_alignment) > 0:
    matched_tokens = df_alignment[df_alignment['match'] > 0]
    print(f"\n🎯 MATCH QUALITY DISTRIBUTION")
    print(f"Total matched tokens: {len(matched_tokens)}")
    
    if len(matched_tokens) > 0:
        perfect_matches = len(matched_tokens[matched_tokens['match'] == 1.0])
        strong_matches = len(matched_tokens[matched_tokens['match'] >= 0.8])
        medium_matches = len(matched_tokens[(matched_tokens['match'] >= 0.5) & 
                                          (matched_tokens['match'] < 0.8)])
        weak_matches = len(matched_tokens[matched_tokens['match'] < 0.5])
        
        print(f"Perfect matches (1.0): {perfect_matches}")
        print(f"Strong matches (≥0.8): {strong_matches}")
        print(f"Medium matches (0.5-0.8): {medium_matches}")
        print(f"Weak matches (<0.5): {weak_matches}")
        
        avg_score = matched_tokens['match'].mean()
        print(f"Average match score: {avg_score:.3f}")

# 4. Method Effectiveness
if len(matched_tokens) > 0:
    method_stats = matched_tokens['match_procesure'].value_counts()
    method_scores = matched_tokens.groupby('match_procesure')['match'].mean()
    
    print(f"\n🔧 METHOD EFFECTIVENESS")
    for method in method_stats.index:
        count = method_stats[method]
        avg_score = method_scores[method]
        print(f"{method}: {count} matches (avg score: {avg_score:.3f})")

# 5. Alignment Sequence Analysis
print(f"\n🔗 ALIGNMENT SEQUENCE DETAILS")
for i, seq in enumerate(alignment_sequences):
    print(f"\nSequence {i+1}: {len(seq)} token pairs")
    for sus_pos, src_pos, score, method in seq[:3]:  # Show first 3 pairs
        sus_word = suspect_tokens[sus_pos]
        src_word = source_tokens[src_pos]
        print(f"  '{sus_word}' ↔ '{src_word}' (score: {score:.3f}, method: {method})")
    if len(seq) > 3:
        print(f"  ... and {len(seq)-3} more pairs")

# 6. Scoring Analysis
if alignment_sequences:
    max_score, scored_sequences = ta.alignmentScore(
        alignment_sequences, 
        increment2one=0.3, 
        decrement_gap=0.1, 
        verbose=False
    )
    
    print(f"\n🏆 SCORING ANALYSIS")
    print(f"Maximum alignment score: {max_score:.2f}")
    print(f"Number of scored sequences: {len(scored_sequences)}")
    
    for seq_id, seq_data in list(scored_sequences.items())[:2]:  # Show top 2
        print(f"\nSequence {seq_id}:")
        print(f"  Total score: {seq_data['score']:.2f}")
        print(f"  Subsequences: {len(seq_data['subsequences'])}")

# 7. Text Reuse Classification
print(f"\n🎯 TEXT REUSE CLASSIFICATION")
if suspect_coverage >= 0.7 and avg_score >= 0.9:
    classification = "DIRECT COPYING (High Confidence)"
elif suspect_coverage >= 0.4 and avg_score >= 0.7:
    classification = "CLOSE PARAPHRASING (Medium-High Confidence)"
elif suspect_coverage >= 0.2 and avg_score >= 0.5:
    classification = "LOOSE SIMILARITY (Medium Confidence)"
else:
    classification = "MINIMAL SIMILARITY (Low Confidence)"

print(f"Classification: {classification}")

# 8. Detailed Token-by-Token Analysis
print(f"\n📝 DETAILED TOKEN ANALYSIS")
print("Suspect Text with Alignment Status:")
for i, token in enumerate(suspect_tokens):
    status = "✓" if suspect_matrix[i] == 1 else "✗"
    if df_alignment is not None and i < len(df_alignment):
        match_score = df_alignment.iloc[i]['match'] if df_alignment.iloc[i]['match'] > 0 else 0
        print(f"  {status} {i:2d}: '{token}' (score: {match_score:.2f})")
    else:
        print(f"  {status} {i:2d}: '{token}'")

print("\nSource Text with Alignment Status:")
for i, token in enumerate(source_tokens):
    status = "✓" if source_matrix[i] == 1 else "✗"
    print(f"  {status} {i:2d}: '{token}'")

# 9. Generate HTML Visualization
try:
    suspect_html, source_html = ta.synopsis_2_html(source_tokens, df_alignment)
    print(f"\n🎨 HTML VISUALIZATION")
    print("HTML elements generated successfully")
    print(f"Suspect HTML elements: {len(suspect_html)}")
    print(f"Source HTML elements: {len(source_html)}")
    
    # Show sample HTML
    print("\nSample HTML output (first 3 tokens):")
    for i, (sus, src) in enumerate(zip(suspect_html[:3], source_html[:3])):
        print(f"  {i}: Suspect: {sus}")
        print(f"     Source:  {src}")
        
except Exception as e:
    print(f"HTML generation error: {e}")

print(f"\n✅ Analysis complete!")

Expected Output Interpretation:

High coverage + high scores: Strong evidence of text reuse
Method diversity: Multiple methods confirm matches (more reliable)
Continuous alignments: Better evidence than scattered matches
HTML visualization: Provides intuitive visual confirmation

Dependencies

Required Packages

# Core dependencies
import numpy as np           # Numerical computations
import pandas as pd          # Data manipulation
import Levenshtein as lev   # Edit distance calculations
import math, re             # Mathematical and regex operations

# Hebrew language support
from hebrew_numbers import gematria_to_int  # Gematria calculations

# Extended functionality
import TRelasticExt as ee   # Elastic search extensions

Optional Dependencies

# For advanced Hebrew analysis
from transformers import AutoModel, AutoTokenizer  # HuggingFace models

# For Greek text processing
from greek_stemmer import GreekStemmer  # Greek language stemming

Performance Considerations

Optimization Tips

Token Preprocessing: Clean and normalize tokens before alignment
Method Selection: Choose appropriate matching methods for your use case
Score Thresholds: Adjust thresholds to balance precision and recall
Sequence Length: Consider breaking long texts into smaller segments

Memory Usage

Large texts may require significant memory for score matrices
Consider processing in chunks for very long documents
Use pruning to remove low-scoring alignments

Speed Optimization

# Fast alignment for large texts
methods = {
    "edit_distance": 0.8,  # Higher threshold = fewer comparisons
    "internal_swap": False,  # Disable for speed
    "gematria": False      # Disable if not needed
}

# Use minimum alignment size to filter short matches
alignment_sequences, _, _, _ = ta.alignment(
    suspect_tokens, source_tokens,
    methods=methods,
    minimum_alignment_size=3  # Only consider alignments of 3+ tokens
)

Error Handling

Common Issues and Solutions

Import Errors: Ensure all dependencies are installed

try:
    import TRAligner.text_alignment_clean as ta
except ImportError as e:
    print(f"TRAligner import failed: {e}")

Hebrew Processing Errors: Check hebrew_numbers package

try:
    from hebrew_numbers import gematria_to_int
    gematria_available = True
except ImportError:
    gematria_available = False
    print("Hebrew gematria functions not available")

Empty Alignment Results: Adjust matching thresholds

if len(alignment_sequences) == 0:
    print("No alignments found. Try adjusting thresholds:")
    print("- Lower edit_distance threshold")
    print("- Decrease minimum_alignment_size")
    print("- Enable more matching methods")

Contributing

TRAligner is designed for research in text reuse detection. For contributions or issues:

Ensure compatibility with Hebrew text processing
Maintain performance for large-scale analysis
Follow the established API patterns
Include comprehensive test cases

Citation

If you use TRAligner in your research, please cite our paper:

@article{miller2024text,
  title={Text Alignment in the Service of Text Reuse Detection},
  author={Miller, Hadar and Kuflik, Tsvi and Lavee, Moshe},
  journal={Applied Sciences},
  volume={15},
  number={6},
  pages={3395},
  year={2025},
  publisher={MDPI},
  doi={10.3390/app15063395},
  url={https://www.mdpi.com/2076-3417/15/6/3395}
}

Miller, H.; Kuflik, T.; Lavee, M. Text Alignment in the Service of Text Reuse Detection. Applied Sciences 2025, 15(6), 3395. https://doi.org/10.3390/app15063395

License

This package is developed for academic research purposes. Please cite appropriately when using in publications.

Version History

Current: Advanced Hebrew text alignment with multiple matching methods
Features: Smith-Waterman algorithm, gematria support, HTML visualization
Optimization: Performance improvements for large-scale text analysis

For more examples and advanced usage, see the accompanying Jupyter notebook: TRAligner_test.ipynb

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.3

Jan 14, 2026

0.2.2

Jan 14, 2026

0.2.1

Jan 14, 2026

This version

0.2.0

Nov 6, 2025

0.1.0

Jun 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traligner-0.2.0.tar.gz (52.9 kB view details)

Uploaded Nov 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traligner-0.2.0-py3-none-any.whl (28.4 kB view details)

Uploaded Nov 6, 2025 Python 3

File details

Details for the file traligner-0.2.0.tar.gz.

File metadata

Download URL: traligner-0.2.0.tar.gz
Upload date: Nov 6, 2025
Size: 52.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for traligner-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c0030dc2bd54aeb97750dc0c667982ac2da445983c24687b65f73042eb0b1ab8`
MD5	`b5f49d4f14ec9b22a2eee8bc82f00ade`
BLAKE2b-256	`1c45b1662bd5799d87fa90864f6ee6ff196de3fbda927f9b9a7a38a901a10403`

See more details on using hashes here.

File details

Details for the file traligner-0.2.0-py3-none-any.whl.

File metadata

Download URL: traligner-0.2.0-py3-none-any.whl
Upload date: Nov 6, 2025
Size: 28.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for traligner-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0601bc135924ef3bde4283413bf2f294f341975b8ee22ae0ba64ad1aab22c37c`
MD5	`353d5ad66f8b1c82772db7fef2312432`
BLAKE2b-256	`ef0559d75ec3b1f2fb4ab32fc67d17da35bf14f98e48ee0470d5c647e15f3230`

See more details on using hashes here.

traligner 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TRAligner Documentation

Table of Contents

Overview

Features

Core Alignment Features

Hebrew Language Support

Advanced Matching Methods

Output and Visualization

Installation

Prerequisites

Package Structure

Import

Quick Start

Basic Alignment Example

Core Components

1. Main Alignment Function

2. Smith-Waterman Algorithm

3. Word Comparison Engine

API Reference

Hebrew Text Processing Functions

hebtext2num(txt)

is_abbreviation(token, get_spliter=False, indicator="'")

replace_chars(exchange, replacables, s)

Scoring and Analysis Functions

alignmentScore(alignment_sequences, increment2one=0.3, decrement_gap=0.1, verbose=False, prune=0.0)

word_edit_distance(tokens1, tokens2, mode='distance')

Visualization Functions

synopsis_2_html(src_t, df_suspect_alignment)

synopsis2htmlTable(text1_t, text2_t, align_sequenses)

Understanding Results

1. Alignment Sequences

2. Alignment DataFrame

3. Position Matrices

4. Scoring Results

5. HTML Visualization Output

6. Comprehensive Result Analysis

Coverage Analysis

Match Quality Distribution

Method Effectiveness Analysis

7. Interpretation Guidelines

Text Reuse Classification

Red Flags and Validation

Advanced Usage

Advanced Matching Methods

Advanced Scoring

Hebrew Language Analysis

Examples

Example 1: Basic Hebrew Text Alignment

Example 2: Advanced Multi-Method Alignment

Example 3: Word Embedding Integration

Example 4: Hebrew Number and Gematria Processing

Example 4: Hebrew Number and Gematria Processing

Example 5: Abbreviation Detection

Example 5: Abbreviation Detection

Example 6: Complete Result Analysis Pipeline

Dependencies

Required Packages

Optional Dependencies

Performance Considerations

Optimization Tips

Memory Usage

Speed Optimization

Error Handling

Common Issues and Solutions

Contributing

Citation

License

Version History

Project details

Verified details

Maintainers

`hebtext2num(txt)`

`is_abbreviation(token, get_spliter=False, indicator="'")`

`replace_chars(exchange, replacables, s)`

`alignmentScore(alignment_sequences, increment2one=0.3, decrement_gap=0.1, verbose=False, prune=0.0)`

`word_edit_distance(tokens1, tokens2, mode='distance')`

`synopsis_2_html(src_t, df_suspect_alignment)`

`synopsis2htmlTable(text1_t, text2_t, align_sequenses)`