Skip to main content

A pipeline tool for performing customized text alignment procedures

Project description

License: MIT

The Text Alignment Tool

This Python text alignment tool is intended to be a general purpose tool for aligning texts in a robust and easily extensible way. It tracks any changes to the original text so that there is an end-to-end mapping of the alignment data.

Architecture

Inline-style: Diagram of Alignment Tool Pipeline Structure

  1. The alignment tool consists of a main class TextAlignmentTool, which coordinates the alignment process.
  2. The alignment tool receives a single TextLoader for the query text and a single TextLoader for the target text (you must keep track of the mapping from the original input text(s) in the TextLoader and its output to the rest of the pipeline).
  3. The alignment tool is then fed n TextTransformers and for each text and n AlignmentAlgorithms. These can be used in any combination and order, for example: the query text could pass through 3 TextTransformers and the target text could pass through 1 TextTransformer, then they go through a single AlignmentAlgorithm, the target text then passes through 2 TextTransformers and we could perform a final AlignmentAlgorithm on the pair of texts.
  4. find_alignment_to_query and find_alignment_to_target will backtrack through the text mappings and provide a key for mapping either the query to the target or the target to the query.

A somewhat basic alignment process could look something like this:

# Create text loaders for query and target
query_loader = PgpAltoXMLTextLoader(list(QUERY_TEMP_FOLDER.glob("**/*.xml")))
target_loader = PgpXmlTeiTextLoader(list(TARGET_TEMP_FOLDER.glob("**/*.xml")))

# Create the alignment tool
aligner = TextAlignmentTool(query_loader, target_loader)

# Perform three transformation operations on the target
normalize_target_sigla = PgpTeiNormalizeSiglaTransformer()
remove_target_extras = PgpTeiRemoveExtrasTransformer()
relocate_insertions = PgpTeiRelocateInsertionsTransformer()

aligner.target_text_transforms(
    [normalize_target_sigla, remove_target_extras, relocate_insertions]
)

# Create and run one alignment process
first_alignment_algorithm = LineAlignmentAlgorithm()
aligner.align_text(first_alignment_algorithm)

# Get the mapping information for the alignment
alignment_mappings = aligner.latest_alignment

Functionality

Tracking of text changes and mappings to aligned text use a system of index maps. The TextLoader will ingest the input text and output a 1-dimensional numpy uint32 array consisting of one number for each letter in the input text in the order it occurs within the text (the number is simply the unicode value of the character using python's ord function).

Text Loader

For example, let's imagine we have our initial text in a simple text file, and we will assume the line breaks are significant for the alignment process:

Lorem ipsum dolor sit amet 
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua

We would write a simple loader for this to ingest the text and preserve a record of the line breaks:

from text_alignment_tool import TextChunk
import numpy as np

text = """Lorem ipsum dolor sit amet 
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua"""


def parse_text(text: str) -> tuple[list[tuple[int, int]], list[TextChunk], np.array]:
    input_output_map: list[tuple[int, int]] = []
    text_chunk_indices: list[TextChunk] = []
    output_text: list[int] = []

    text_chunk_start_idx = 0
    for input_idx, char in enumerate(text):
        output_idx = len(output_text)
        input_output_map.append((input_idx, output_idx))
        if char == "\n":
            text_chunk_indices.append(TextChunk(text_chunk_start_idx, output_idx))
            text_chunk_start_idx = output_idx + 1
            continue
        output_text.append(ord(char))

    return input_output_map, text_chunk_indices, np.array(output_text, dtype=np.uint32) 

input_output_map, text_chunk_indices, output_text = parse_text(text)

# Inspect the results
print(input_output_map[25:30])
print(text_chunk_indices)
print(output_text[0:5]) 

# Deserialize text
print(''.join([chr(x) for x in output_text[0:5]]))

Output:

[(25, 25), (26, 26), (27, 27), (28, 27), (29, 28)]
[TextChunk(start_idx=0, end_idx=27), TextChunk(start_idx=28, end_idx=55)]
[ 76 111 114 101 109 ]
Lorem

When creating a custom text loader, you should subclass TextLoader and make sure to calculate self._output, self._input_output_map, and self._text_chunk_indices. You can modify the __init__() method to take whatever variables you need, and you can modify the class however it is needed to perform the parsing operation. It is a nice addition to include a method in the custom TextLoader to rebuild text in the input format based upon the data from the alignment operation.

Text Transformer

The output of the TextLoader may be exactly what is needed for the alignment process, but often it will be necessary to perform other alterations such as stripping out unneccesary characters, performing some rule based character conversions, or refining the text_chunks. Any number of TextTransformers can be used in series to accomplish this. Using narrowly focused TextTransformers will make it easier to debug and to mix and match TextTransformer as needed to achieve the desired alignment.

When passing a text through a TextTransformer, the transformer must use its _input_output_map to track how it has changed the input. For instance, if we wanted to create a transformer to remove the word "the", we might start with a text input like "the quick brown dog jumped over the lazy fox.", which in the alignment tool is: [116, 104, 101, 32, 113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]

The output from the TextTransformer would be: [113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]

And the _input_output_map would show the mappings from the index of the input array to the index of the output array: [(4,0),(5,1),(6,2),(7,3), ...]

input val map input idx to output idx output val
113 (4,0) 113
117 (5,1) 117
105 (6,2) 105
99 (7,3) 99
107 (8,4) 107
32 (9,5) 32
... ... ...

Changing the order of individual elements in the list is also possible, for instance for the same input above we could instead have the output: [98, 114, 111, 119, 110, 32, 113, 117, 105, 99, 107, 32, 116, 104, 101, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]

The words "the" and "brown" have been transposed, and the resulting _input_output_map would be: [(10,0),(11,1),(12,2),(13,3),(14,4),(3,5),(4,6),(5,7),(6,8),(7,9),(8,10),(9,11),(0,12),(1,13),(2,14),(15,15), ...]

The TextTransformer may also redefine text chunks with the _text_chunk_indices property, which is a simple ordered list of starting + ending indices that define n sections of the output text (you may use overlapping sections if desired), e.g., [(0,20),(21,35),(30,91)] with three chunks of the text using indices: 0–20, 21–35, and 30–91.

Alignment Algorithm

The AlignmentAlgorithm class can be subclassed to perform analysis of both the query and target text at the same time. Any number of such classes may be used at any place within the alignment pipeline. The AlignmentAlgorithm will always receive a self._query and a self._target property, both of which are provided automatically to it by the TextAlignmentTool from the output of the latest transformation of the query and target texts. It will also automatically receive the latest _text_chunk_indices for the query and for the target as self._input_query_text_chunk_indices and self._input_target_text_chunk_indices.

An AlignmentAlgorithm will produce a mapping in the _alignment property, a simplified example of which might be: query = ['h','e','l','l','o',' ','w','o','r','l','d'] and target = ['h','e','l','l','o',' ','w','a','d','d'] (of course these would be lists of uint32's in our system, not strings) could be aligned as [(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(10,9)] (for Wadd, see https://en.wikipedia.org/wiki/Wadd).

An AlignmentAlgorithm can also be used to redefine text chunks based on mutual analysis of the query and target texts. That is, the AlignmentAlgorithm may be used both for gross alignments—defining possibly corresponding text chunks with the properties _output_query_text_chunk_indices and _output_target_text_chunk_indices in addition to the fine grained alignment using the _alignment property, which is simply a list of the corresponding character indices in the query and source text.

Alignment Operation Tracking

The TextAlignmentTool automatically keeps track of the order of operations and the transforms that have been performed in the __operation_list property which contains a list of AlignmentOperations. This simplifies peeking in on any part of the alignment process for debugging purposes and also enables custom mappings between query and target.

The convenience methods find_alignment_to_query and find_alignment_to_target enable you to walk the alignments and transforms back to the first initial input provided by the TextLoader. You will need to provide your own function within the TextLoader to transform the aligned text into your desired format.

Debugging Help

When you use the TextAlignmentTool in a debugging context, it will inject an instance of the DebugHelper class into the global context as dbg. This helper provides four convenience methods to inspect your aligment pipeline: dbg.display_text, dbg.display_text_chunk, dbg.display_text_chunks, and dbg.display_text_region. These methods will output the human readable text for the internal uint32 numpy array numeric representation of the text and can extract specified ranges and text chunks as well.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_alignment_tool-0.2.10.tar.gz (64.1 kB view details)

Uploaded Source

Built Distribution

text_alignment_tool-0.2.10-py3-none-any.whl (38.4 kB view details)

Uploaded Python 3

File details

Details for the file text_alignment_tool-0.2.10.tar.gz.

File metadata

  • Download URL: text_alignment_tool-0.2.10.tar.gz
  • Upload date:
  • Size: 64.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.8 CPython/3.9.6 Linux/5.4.0-84-generic

File hashes

Hashes for text_alignment_tool-0.2.10.tar.gz
Algorithm Hash digest
SHA256 d2ab7b54dc001430e1db715608b94e78ec3c2b2b31dbe805e336da92a21ae0f1
MD5 662c5ba9a84ef4664875d59bd38d95d8
BLAKE2b-256 013c48b6644ae9dfe365262c9c7f7bc9de9df201f362f6ce007ee8bc500b395a

See more details on using hashes here.

File details

Details for the file text_alignment_tool-0.2.10-py3-none-any.whl.

File metadata

File hashes

Hashes for text_alignment_tool-0.2.10-py3-none-any.whl
Algorithm Hash digest
SHA256 5f36ca3a7db7ed3edfeaa176f7b5351bfd4640e09f8add2faa52f0c71a0505d5
MD5 c4129e2f6bc5ab7b6240a45179870858
BLAKE2b-256 5dfe0acf8646780274f61ad829097fba336b560ab94eee6ee29ea5190c7fee5f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page