Skip to main content

A pipeline tool for performing customized text alignment procedures

Project description

License: MIT

The Text Alignment Tool

This Python text alignment tool is intended to be a general purpose tool for aligning texts in a robust and easily extensible way. It tracks any changes to the original text so that there is an end-to-end mapping of the alignment data.

Architecture

Inline-style: Diagram of Alignment Tool Pipeline Structure

  1. The alignment tool consists of a main class TextAlignmentTool, which coordinates the alignment process.
  2. The alignment tool receives a single TextLoader for the query text and a single TextLoader for the target text (you must keep track of the mapping from the original input text(s) in the TextLoader and its output to the rest of the pipeline).
  3. The alignment tool is then fed n TextTransformers and for each text and n AlignmentAlgorithms. These can be used in any combination and order, for example: the query text could pass through 3 TextTransformers and the target text could pass through 1 TextTransformer, then they go through a single AlignmentAlgorithm, the target text then passes through 2 TextTransformers and we could perform a final AlignmentAlgorithm on the pair of texts.
  4. find_alignment_to_query and find_alignment_to_target will backtrack through the text mappings and provide a key for mapping either the query to the target or the target to the query.

A somewhat basic alignment process could look something like this:

# Create text loaders for query and target
query_loader = PgpAltoXMLTextLoader(list(QUERY_TEMP_FOLDER.glob("**/*.xml")))
target_loader = PgpXmlTeiTextLoader(list(TARGET_TEMP_FOLDER.glob("**/*.xml")))

# Create the alignment tool
aligner = TextAlignmentTool(query_loader, target_loader)

# Perform three transformation operations on the target
normalize_target_sigla = PgpTeiNormalizeSiglaTransformer()
remove_target_extras = PgpTeiRemoveExtrasTransformer()
relocate_insertions = PgpTeiRelocateInsertionsTransformer()

aligner.target_text_transforms(
    [normalize_target_sigla, remove_target_extras, relocate_insertions]
)

# Create and run one alignment process
first_alignment_algorithm = LineAlignmentAlgorithm()
aligner.align_text(first_alignment_algorithm)

# Get the mapping information for the alignment
alignment_mappings = aligner.latest_alignment

Functionality

Tracking of text changes and mappings to aligned text use a system of index maps. The TextLoader will ingest the input text and output a 1-dimensional numpy uint32 array consisting of one number for each letter in the input text in the order it occurs within the text (the number is simply the unicode value of the character using python's ord function).

Text Loader

For example, let's imagine we have our initial text in a simple text file, and we will assume the line breaks are significant for the alignment process:

Lorem ipsum dolor sit amet 
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua

We would write a simple loader for this to ingest the text and preserve a record of the line breaks:

from text_alignment_tool import TextChunk
import numpy as np

text = """Lorem ipsum dolor sit amet 
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua"""


def parse_text(text: str) -> tuple[list[tuple[int, int]], list[TextChunk], np.array]:
    input_output_map: list[tuple[int, int]] = []
    text_chunk_indices: list[TextChunk] = []
    output_text: list[int] = []

    text_chunk_start_idx = 0
    for input_idx, char in enumerate(text):
        output_idx = len(output_text)
        input_output_map.append((input_idx, output_idx))
        if char == "\n":
            text_chunk_indices.append(TextChunk(text_chunk_start_idx, output_idx))
            text_chunk_start_idx = output_idx + 1
            continue
        output_text.append(ord(char))

    return input_output_map, text_chunk_indices, np.array(output_text, dtype=np.uint32) 

input_output_map, text_chunk_indices, output_text = parse_text(text)

# Inspect the results
print(input_output_map[25:30])
print(text_chunk_indices)
print(output_text[0:5]) 

# Deserialize text
print(''.join([chr(x) for x in output_text[0:5]]))

Output:

[(25, 25), (26, 26), (27, 27), (28, 27), (29, 28)]
[TextChunk(start_idx=0, end_idx=27), TextChunk(start_idx=28, end_idx=55)]
[ 76 111 114 101 109 ]
Lorem

When creating a custom text loader, you should subclass TextLoader and make sure to calculate self._output, self._input_output_map, and self._text_chunk_indices. You can modify the __init__() method to take whatever variables you need, and you can modify the class however it is needed to perform the parsing operation. It is a nice addition to include a method in the custom TextLoader to rebuild text in the input format based upon the data from the alignment operation.

Text Transformer

The output of the TextLoader may be exactly what is needed for the alignment process, but often it will be necessary to perform other alterations such as stripping out unneccesary characters, performing some rule based character conversions, or refining the text_chunks. Any number of TextTransformers can be used in series to accomplish this. Using narrowly focused TextTransformers will make it easier to debug and to mix and match TextTransformer as needed to achieve the desired alignment.

When passing a text through a TextTransformer, the transformer must use its _input_output_map to track how it has changed the input. For instance, if we wanted to create a transformer to remove the word "the", we might start with a text input like "the quick brown dog jumped over the lazy fox.", which in the alignment tool is: [116, 104, 101, 32, 113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]

The output from the TextTransformer would be: [113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]

And the _input_output_map would show the mappings from the index of the input array to the index of the output array: [(4,0),(5,1),(6,2),(7,3), ...]

input val map input idx to output idx output val
113 (4,0) 113
117 (5,1) 117
105 (6,2) 105
99 (7,3) 99
107 (8,4) 107
32 (9,5) 32
... ... ...

Changing the order of individual elements in the list is also possible, for instance for the same input above we could instead have the output: [98, 114, 111, 119, 110, 32, 113, 117, 105, 99, 107, 32, 116, 104, 101, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]

The words "the" and "brown" have been transposed, and the resulting _input_output_map would be: [(10,0),(11,1),(12,2),(13,3),(14,4),(3,5),(4,6),(5,7),(6,8),(7,9),(8,10),(9,11),(0,12),(1,13),(2,14),(15,15), ...]

The TextTransformer may also redefine text chunks with the _text_chunk_indices property, which is a simple ordered list of starting + ending indices that define n sections of the output text (you may use overlapping sections if desired), e.g., [(0,20),(21,35),(30,91)] with three chunks of the text using indices: 0–20, 21–35, and 30–91.

Alignment Algorithm

The AlignmentAlgorithm class can be subclassed to perform analysis of both the query and target text at the same time. Any number of such classes may be used at any place within the alignment pipeline. The AlignmentAlgorithm will always receive a self._query and a self._target property, both of which are provided automatically to it by the TextAlignmentTool from the output of the latest transformation of the query and target texts. It will also automatically receive the latest _text_chunk_indices for the query and for the target as self._input_query_text_chunk_indices and self._input_target_text_chunk_indices.

An AlignmentAlgorithm will produce a mapping in the _alignment property, a simplified example of which might be: query = ['h','e','l','l','o',' ','w','o','r','l','d'] and target = ['h','e','l','l','o',' ','w','a','d','d'] (of course these would be lists of uint32's in our system, not strings) could be aligned as [(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(10,9)] (for Wadd, see https://en.wikipedia.org/wiki/Wadd).

An AlignmentAlgorithm can also be used to redefine text chunks based on mutual analysis of the query and target texts. That is, the AlignmentAlgorithm may be used both for gross alignments—defining possibly corresponding text chunks with the properties _output_query_text_chunk_indices and _output_target_text_chunk_indices in addition to the fine grained alignment using the _alignment property, which is simply a list of the corresponding character indices in the query and source text.

Alignment Operation Tracking

The TextAlignmentTool automatically keeps track of the order of operations and the transforms that have been performed in the __operation_list property which contains a list of AlignmentOperations. This simplifies peeking in on any part of the alignment process for debugging purposes and also enables custom mappings between query and target.

The convenience methods find_alignment_to_query and find_alignment_to_target enable you to walk the alignments and transforms back to the first initial input provided by the TextLoader. You will need to provide your own function within the TextLoader to transform the aligned text into your desired format.

Debugging Help

When you use the TextAlignmentTool in a debugging context, it will inject an instance of the DebugHelper class into the global context as dbg. This helper provides four convenience methods to inspect your aligment pipeline: dbg.display_text, dbg.display_text_chunk, dbg.display_text_chunks, and dbg.display_text_region. These methods will output the human readable text for the internal uint32 numpy array numeric representation of the text and can extract specified ranges and text chunks as well.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_alignment_tool-0.2.7.tar.gz (29.4 kB view hashes)

Uploaded Source

Built Distribution

text_alignment_tool-0.2.7-py3-none-any.whl (36.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page