A utility for aligning and mapping text spans between different text representations.
Project description
Span Aligner
A utility for aligning and mapping text spans between different text representations, particularly useful for Label Studio annotation compatibility.
Features
- Sanitize span boundaries to avoid special characters.
- Find exact and fuzzy matches of text segments in original documents.
- Map spans from one text representation to another.
- Rebuild tagged text with nested annotations.
- Merge result objects containing span annotations.
Installation
Install from source:
pip install span-aligner
Usage
Get Annotations from Tagged Text
Extract structured spans and entities from a string with inline tags.
tagged_input = "<administrative_body>Environmental Committee</administrative_body> discussed the <impact_location>central park</impact_location> renovation on <publication_date>2025-12-15</publication_date>."
ner_map = {
"administrative_body": "ADMINISTRATIVE BODY",
"publication_date": "PUBLICATION DATE",
"impact_location": "PRIMARY LOCATION"
}
span_map ={
"motivation" : "MOTIVATION"
}
annotations = SpanAligner.get_annotations_from_tagged_text(
tagged_input,
ner_map=ner_map,
span_map=span_map
)
print(annotations["entities"])
# Output:
#[
# {'start': 0, 'end': 23, 'text': 'Environmental Committee', 'labels': ['ADMINISTRATIVE BODY']},
# {'start': 38, 'end': 50, 'text': 'central park', 'labels': ['PRIMARY LOCATION']},
# {'start': 65, 'end': 75, 'text': '2025-12-15', 'labels': ['PUBLICATION DATE']}
#]
print(annotations["spans"])
# Output:
#[
# {'start': 0, 'end': 76, 'text': 'Environmental Committee discussed the central park renovation on 2025-12-15.', 'labels': ['MOTIVATION']}
#]
print(annotations["plain_text"])
# Output: "Environmental Committee discussed the central park renovation on 2025-12-15."
Rebuild Tagged Text
Reconstruct a string with XML-like tags from raw text and span/entity lists.
text = "On 2026-01-12, the Budget Committee finalized the annual report."
# Spans corresponding to 'MOTIVATION' label, mapped to 'motivation' tag
spans = [{"start": 0, "end": 64, "labels": ["motivation"]}]
# Entities corresponding to 'ADMINISTRATIVE BODY' label, mapped to 'administrative_body' tag
entities = [{"start": 15, "end": 35, "labels": ["administrative_body"]}]
tagged, stats = SpanAligner.rebuild_tagged_text(text, spans, entities)
print(tagged)
# Output: <motivation>On 2026-01-12, the <administrative_body>Budget Committee</administrative_body> finalized the annual report.</motivation>
Rebuild Tagged Text from Task
Generate tagged text directly from a Label Studio task object.
# Assuming 'task' is a Label Studio task object (or similar structure)
# with .data['text'] and .annotations attributes
mapping = {
"DECISION": "decision",
"LEGAL FRAMEWORK": "legal_framework",
"EXPIRATION DATE": "expiry_date"
}
tagged_output = SpanAligner.rebuild_tagged_text_from_task(task, mapping)
print(tagged_output)
Map Tags to Original
Align annotated spans from a tagged string back to their positions in the original text, keeping the mistakes and text as written in the original.
original_text = "Budget Budget Committee met on 2026-01-12 to view\n\n the central park prject."
# Imagine the text was slightly modified or translated, but tags are present
tagged_text = "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to review the <impact_location>central park</impact_location> project."
mapped_tagged_text = SpanAligner.map_tags_to_original(
original_text=original_text,
tagged_text=tagged_text,
min_ratio=0.7
)
print(mapped_tagged_text)
# Output might look like: "Budget <administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to view\n\n the <impact_location>central park</impact_location> prject."
Map Tags to Original and Get Positions
Combine mapping tags to original text and extracting entities with correct labels.
original_text = "Legal basis: Art. 5. The Env. Committee met on 2026-01-12."
tagged_text = "Legal basis: <article>Art. 5</article>. The <administrative_body>Environmental Committee</administrative_body> met on <session_date>2026-01-12</session_date>."
# 1. Map tags to the noisy original text
mapped_tagged_text = SpanAligner.map_tags_to_original(
original_text=original_text,
tagged_text=tagged_text,
min_ratio=0.7
)
# 2. Extract annotations using the mapping
ner_label_mapping = {
"administrative_body": "ADMINISTRATIVE BODY",
"session_date": "SESSION DATE",
"article": "ARTICLE"
}
annotations = SpanAligner.get_annotations_from_tagged_text(
mapped_tagged_text,
ner_map=ner_label_mapping
)
print(annotations["entities"])
# Output:
# [
# {'start': 13, 'end': 19, 'text': 'Art. 5', 'labels': ['ARTICLE']},
# {'start': 47, 'end': 57, 'text': '2026-01-12', 'labels': ['SESSION DATE']}
# ]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file span_aligner-0.1.2.tar.gz.
File metadata
- Download URL: span_aligner-0.1.2.tar.gz
- Upload date:
- Size: 18.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15f210343b8d71b39f9c2f9516e147d8e2e366ccb2ee3344856a21385c424187
|
|
| MD5 |
b984f3a891783e9910806f7cc386da6e
|
|
| BLAKE2b-256 |
a42ff631ffc7268d14f48e6b74b991d950ba2fed0c3452a9e33c71cce958c826
|
File details
Details for the file span_aligner-0.1.2-py3-none-any.whl.
File metadata
- Download URL: span_aligner-0.1.2-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17c6cc8ac7021dfa8b8aa08b2b8eed4f8f87f73dfcbca5887593e2ac3ff12e22
|
|
| MD5 |
aa78df5ff476b97dfa1c74db8fbddeaa
|
|
| BLAKE2b-256 |
9db7c38b5ff377f52054d80b58996493014470735447549c895ceef337f1a5fb
|