A utility for aligning and mapping text spans between different text representations.
Project description
Span Aligner
A utility for aligning and mapping text spans between different text representations, particularly useful for Label Studio annotation compatibility.
Features
- Sanitize span boundaries to avoid special characters.
- Find exact and fuzzy matches of text segments in original documents.
- Map spans from one text representation to another.
- Rebuild tagged text with nested annotations.
- Merge result objects containing span annotations.
Installation
Install from source:
pip install .
For development:
pip install -e ".[dev]"
Usage
from span_aligner import SpanAligner
original = "Hello, World!"
result_obj = {
"spans": [{"start": 0, "end": 5, "text": "Hello", "labels": ["greeting"]}],
"entities": [],
"task": {"data": {"text": ""}}
}
success, mapped = SpanAligner.map_spans_to_original(original, result_obj)
print(mapped)
Map Tags to Original
Align annotated spans from a tagged string back to their positions in the original text, keeping the mistakes and original text as written in the original.
original_text = "The quick brown fox jumps\n\n over the dog."
# Imagine the text was slightly modified or translated, but tags are present
tagged_text = "The <adj>quick</adj> brown fox jumps over the <animal>dog</animal>."
mapped_tagged_text = SpanAligner.map_tags_to_original(
original_text=original_text,
tagged_text=tagged_text,
min_ratio=0.8
)
print(mapped_tagged_text)
# Output might look like: "The <adj>quick</adj> brown fox jumps\n\n over the <animal>dog</animal>."
# (If original text differed slightly, tags would be placed on best matching spans)
Rebuild Tagged Text
Reconstruct a string with XML-like tags from raw text and span/entity lists.
text = "Hello World"
spans = [{"start": 0, "end": 11, "labels": ["sentence"]}]
entities = [{"start": 6, "end": 11, "labels": ["location"]}]
tagged, stats = SpanAligner.rebuild_tagged_text(text, spans, entities)
print(tagged)
# Output: <sentence>Hello <location>World</location></sentence>
Rebuild Tagged Text from Task
Generate tagged text directly from a Label Studio task object.
# Assuming 'task' is a Label Studio task object (or similar structure)
# with .data['text'] and .annotations attributes
mapping = {"Location": "loc", "Person": "per"}
tagged_output = SpanAligner.rebuild_tagged_text_from_task(task, mapping)
print(tagged_output)
Get Annotations from Tagged Text
Extract structured spans and entities from a string with inline tags.
tagged_input = "Visit <loc>Paris</loc> and see the <landmark>Eiffel Tower</landmark>."
annotations = SpanAligner.get_annotations_from_tagged_text(
tagged_input,
ner_map={"loc": "Location", "landmark": "Location"}
)
print(annotations["entities"])
# Output:
# [
# {"start": 6, "end": 11, "text": "Paris", "labels": ["Location"]},
# {"start": 24, "end": 36, "text": "Eiffel Tower", "labels": ["Location"]}
# ]
print(annotations["plain_text"])
# Output: "Visit Paris and see the Eiffel Tower."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file span_aligner-0.1.0.tar.gz.
File metadata
- Download URL: span_aligner-0.1.0.tar.gz
- Upload date:
- Size: 17.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83e478f7e9489fe25b83654c5327d9f4a431699ad590f796e20b435bd5c0437e
|
|
| MD5 |
edf335dd9632579dfe2e647985f602b4
|
|
| BLAKE2b-256 |
4f7e388403905ba688d2c9ec5ba0228bce5ee4511b8f67711eec8c239b24e927
|
File details
Details for the file span_aligner-0.1.0-py3-none-any.whl.
File metadata
- Download URL: span_aligner-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8147c1885f91f9d9d8f1e43dda06c63e7d96b30288db77fc9cee1bc9499a766
|
|
| MD5 |
ae2bb2b42da4dcb78950066a9388473b
|
|
| BLAKE2b-256 |
0a5764fd623ebf44feee7cbfa93bfa216af982aafa84ca455a3a4962857452b6
|