A pipeline tool for performing customized text alignment procedures
Project description
The Text Alignment Tool
This Python text alignment tool is intended to be a general purpose tool for aligning texts in a robust and easily extensible way. It tracks any changes to the original text so that there is an end-to-end mapping of the alignment data.
Architecture
Inline-style:
- The alignment tool consists of a main class
TextAlignmentTool
, which coordinates the alignment process. - The alignment tool receives a single
TextLoader
for the query text and a singleTextLoader
for the target text (you must keep track of the mapping from the original input text(s) in theTextLoader
and its output to the rest of the pipeline). - The alignment tool is then fed n
TextTransformer
s and for each text and nAlignmentAlgorithm
s. These can be used in any combination and order, for example: the query text could pass through 3TextTransformer
s and the target text could pass through 1TextTransformer
, then they go through a singleAlignmentAlgorithm
, the target text then passes through 2TextTransformer
s and we could perform a finalAlignmentAlgorithm
on the pair of texts. find_alignment_to_query
andfind_alignment_to_target
will backtrack through the text mappings and provide a key for mapping either the query to the target or the target to the query.
A somewhat basic alignment process could look something like this:
# Create text loaders for query and target
query_loader = PgpAltoXMLTextLoader(list(QUERY_TEMP_FOLDER.glob("**/*.xml")))
target_loader = PgpXmlTeiTextLoader(list(TARGET_TEMP_FOLDER.glob("**/*.xml")))
# Create the alignment tool
aligner = TextAlignmentTool(query_loader, target_loader)
# Perform three transformation operations on the target
normalize_target_sigla = PgpTeiNormalizeSiglaTransformer()
remove_target_extras = PgpTeiRemoveExtrasTransformer()
relocate_insertions = PgpTeiRelocateInsertionsTransformer()
aligner.target_text_transforms(
[normalize_target_sigla, remove_target_extras, relocate_insertions]
)
# Create and run one alignment process
first_alignment_algorithm = LineAlignmentAlgorithm()
aligner.align_text(first_alignment_algorithm)
# Get the mapping information for the alignment
alignment_mappings = aligner.latest_alignment
Functionality
Tracking of text changes and mappings to aligned text use a system of index maps. The TextLoader
will ingest the input text and output a 1-dimensional numpy uint32 array consisting of one number for each letter in the input text in the order it occurs within the text (the number is simply the unicode value of the character using python's ord
function).
Text Loader
For example, let's imagine we have our initial text in a simple text file, and we will assume the line breaks are significant for the alignment process:
Lorem ipsum dolor sit amet
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
We would write a simple loader for this to ingest the text and preserve a record of the line breaks:
from text_alignment_tool import TextChunk
import numpy as np
text = """Lorem ipsum dolor sit amet
consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua"""
def parse_text(text: str) -> tuple[list[tuple[int, int]], list[TextChunk], np.array]:
input_output_map: list[tuple[int, int]] = []
text_chunk_indices: list[TextChunk] = []
output_text: list[int] = []
text_chunk_start_idx = 0
for input_idx, char in enumerate(text):
output_idx = len(output_text)
input_output_map.append((input_idx, output_idx))
if char == "\n":
text_chunk_indices.append(TextChunk(text_chunk_start_idx, output_idx))
text_chunk_start_idx = output_idx + 1
continue
output_text.append(ord(char))
return input_output_map, text_chunk_indices, np.array(output_text, dtype=np.uint32)
input_output_map, text_chunk_indices, output_text = parse_text(text)
# Inspect the results
print(input_output_map[25:30])
print(text_chunk_indices)
print(output_text[0:5])
# Deserialize text
print(''.join([chr(x) for x in output_text[0:5]]))
Output:
[(25, 25), (26, 26), (27, 27), (28, 27), (29, 28)]
[TextChunk(start_idx=0, end_idx=27), TextChunk(start_idx=28, end_idx=55)]
[ 76 111 114 101 109 ]
Lorem
When creating a custom text loader, you should subclass TextLoader
and make sure to calculate self._output
, self._input_output_map
, and self._text_chunk_indices
. You can modify the __init__() method to take whatever variables you need, and you can modify the class however it is needed to perform the parsing operation. It is a nice addition to include a method in the custom TextLoader
to rebuild text in the input format based upon the data from the alignment operation.
Text Transformer
The output of the TextLoader
may be exactly what is needed for the alignment process, but often it will be necessary to perform other alterations such as stripping out unneccesary characters, performing some rule based character conversions, or refining the text_chunks. Any number of TextTransformer
s can be used in series to accomplish this. Using narrowly focused TextTransformer
s will make it easier to debug and to mix and match TextTransformer
as needed to achieve the desired alignment.
When passing a text through a TextTransformer
, the transformer must use its _input_output_map
to track how it has changed the input. For instance, if we wanted to create a transformer to remove the word "the", we might start with a text input like "the quick brown dog jumped over the lazy fox.", which in the alignment tool is:
[116, 104, 101, 32, 113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]
The output from the TextTransformer
would be:
[113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]
And the _input_output_map
would show the mappings from the index of the input array to the index of the output array:
[(4,0),(5,1),(6,2),(7,3), ...]
input val | map input idx to output idx | output val |
---|---|---|
113 | (4,0) | 113 |
117 | (5,1) | 117 |
105 | (6,2) | 105 |
99 | (7,3) | 99 |
107 | (8,4) | 107 |
32 | (9,5) | 32 |
... | ... | ... |
Changing the order of individual elements in the list is also possible, for instance for the same input above we could instead have the output:
[98, 114, 111, 119, 110, 32, 113, 117, 105, 99, 107, 32, 116, 104, 101, 32, 100, 111, 103, 32, 106, 117, 109, 112, 101, 100, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 102, 111, 120, 46]
The words "the" and "brown" have been transposed, and the resulting _input_output_map
would be:
[(10,0),(11,1),(12,2),(13,3),(14,4),(3,5),(4,6),(5,7),(6,8),(7,9),(8,10),(9,11),(0,12),(1,13),(2,14),(15,15), ...]
The TextTransformer
may also redefine text chunks with the _text_chunk_indices
property, which is a simple ordered list of starting + ending indices that define n sections of the output text (you may use overlapping sections if desired), e.g., [(0,20),(21,35),(30,91)]
with three chunks of the text using indices: 0–20, 21–35, and 30–91.
Alignment Algorithm
The AlignmentAlgorithm
class can be subclassed to perform analysis of both the query and target text at the same time. Any number of such classes may be used at any place within the alignment pipeline. The AlignmentAlgorithm
will always receive a self._query
and a self._target
property, both of which are provided automatically to it by the TextAlignmentTool
from the output of the latest transformation of the query and target texts. It will also automatically receive the latest _text_chunk_indices
for the query and for the target as self._input_query_text_chunk_indices
and self._input_target_text_chunk_indices
.
An AlignmentAlgorithm
will produce a mapping in the _alignment
property, a simplified example of which might be: query = ['h','e','l','l','o',' ','w','o','r','l','d']
and target = ['h','e','l','l','o',' ','w','a','d','d']
(of course these would be lists of uint32's in our system, not strings) could be aligned as [(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(10,9)]
(for Wadd, see https://en.wikipedia.org/wiki/Wadd).
An AlignmentAlgorithm
can also be used to redefine text chunks based on mutual analysis of the query and target texts. That is, the AlignmentAlgorithm
may be used both for gross alignments—defining possibly corresponding text chunks with the properties _output_query_text_chunk_indices
and _output_target_text_chunk_indices
in addition to the fine grained alignment using the _alignment
property, which is simply a list of the corresponding character indices in the query and source text.
Alignment Operation Tracking
The TextAlignmentTool
automatically keeps track of the order of operations and the transforms that have been performed in the __operation_list
property which contains a list of AlignmentOperation
s. This simplifies peeking in on any part of the alignment process for debugging purposes and also enables custom mappings between query and target.
The convenience methods find_alignment_to_query
and find_alignment_to_target
enable you to walk the alignments and transforms back to the first initial input provided by the TextLoader
. You will need to provide your own function within the TextLoader
to transform the aligned text into your desired format.
Debugging Help
When you use the TextAlignmentTool
in a debugging context, it will inject an instance of the DebugHelper
class into the global context as dbg
. This helper provides four convenience methods to inspect your aligment pipeline: dbg.display_text
, dbg.display_text_chunk
, dbg.display_text_chunks
, and dbg.display_text_region
. These methods will output the human readable text for the internal uint32 numpy array numeric representation of the text and can extract specified ranges and text chunks as well.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file text_alignment_tool-0.2.12.tar.gz
.
File metadata
- Download URL: text_alignment_tool-0.2.12.tar.gz
- Upload date:
- Size: 64.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.8 CPython/3.9.6 Linux/5.4.0-84-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42d237b5f48a30527187c4936fae7155696a1aa25bef48ef75bb82e03eeb96b3 |
|
MD5 | 7f72a1ef1786f9ba2c3e5fcfe225bf7b |
|
BLAKE2b-256 | 8998fe29adb753c46c2858d921951926bb20dde20c31e151257ec3faeab053de |
File details
Details for the file text_alignment_tool-0.2.12-py3-none-any.whl
.
File metadata
- Download URL: text_alignment_tool-0.2.12-py3-none-any.whl
- Upload date:
- Size: 38.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.8 CPython/3.9.6 Linux/5.4.0-84-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd3fe98f355136c726993faf91df3a887f9899b8248a2a43a51ed86dd1f8a5f3 |
|
MD5 | 626e02cb68fdc751cba5df4818f906bf |
|
BLAKE2b-256 | f25b5c7472508c54a3c1665a1c55fa02112048acf0bfa648fcf48966fbb56623 |