A utility for aligning and mapping text spans between different text representations.

Project description

Span Aligner

A utility for aligning and mapping text spans between different text representations, particularly useful for Label Studio annotation compatibility.

Features

Sanitize span boundaries to avoid special characters.
Find exact and fuzzy matches of text segments in original documents.
Map spans from one text representation to another.
Rebuild tagged text with nested annotations.
Merge result objects containing span annotations.

Installation

Install from source:

pip install span-aligner

Usage

Get Annotations from Tagged Text

Extract structured spans and entities from a string with inline tags.

tagged_input = "<administrative_body>Environmental Committee</administrative_body> discussed the <impact_location>central park</impact_location> renovation on <publication_date>2025-12-15</publication_date>."

ner_map = {
    "administrative_body": "ADMINISTRATIVE BODY",
    "publication_date": "PUBLICATION DATE",
    "impact_location": "PRIMARY LOCATION"
}

span_map ={
    "motivation" : "MOTIVATION"
}

annotations = SpanAligner.get_annotations_from_tagged_text(
    tagged_input,
    ner_map=ner_map,
    span_map=span_map
)

print(annotations["entities"])
# Output:
#[
#    {'start': 0, 'end': 23, 'text': 'Environmental Committee', 'labels': ['ADMINISTRATIVE BODY']},
#    {'start': 38, 'end': 50, 'text': 'central park', 'labels': ['PRIMARY LOCATION']},
#    {'start': 65, 'end': 75, 'text': '2025-12-15', 'labels': ['PUBLICATION DATE']}
#]

print(annotations["spans"])
# Output:
#[
#    {'start': 0, 'end': 76, 'text': 'Environmental Committee discussed the central park renovation on 2025-12-15.', 'labels': ['MOTIVATION']}
#]


print(annotations["plain_text"])
# Output: "Environmental Committee discussed the central park renovation on 2025-12-15."

Rebuild Tagged Text

Reconstruct a string with XML-like tags from raw text and span/entity lists.

text = "On 2026-01-12, the Budget Committee finalized the annual report."
# Spans corresponding to 'MOTIVATION' label, mapped to 'motivation' tag
spans = [{"start": 0, "end": 64, "labels": ["motivation"]}]
# Entities corresponding to 'ADMINISTRATIVE BODY' label, mapped to 'administrative_body' tag
entities = [{"start": 15, "end": 35, "labels": ["administrative_body"]}]

tagged, stats = SpanAligner.rebuild_tagged_text(text, spans, entities)
print(tagged)
# Output: <motivation>On 2026-01-12, the <administrative_body>Budget Committee</administrative_body> finalized the annual report.</motivation>

Rebuild Tagged Text from Task

Generate tagged text directly from a Label Studio task object.

# Assuming 'task' is a Label Studio task object (or similar structure)
# with .data['text'] and .annotations attributes
mapping = {
    "DECISION": "decision",
    "LEGAL FRAMEWORK": "legal_framework",
    "EXPIRATION DATE": "expiry_date"
}

tagged_output = SpanAligner.rebuild_tagged_text_from_task(task, mapping)
print(tagged_output)

Map Tags to Original

Align annotated spans from a tagged string back to their positions in the original text, keeping the mistakes and text as written in the original.

original_text = "Budget Budget Committee met on 2026-01-12 to view\n\n the central park prject."
# Imagine the text was slightly modified or translated, but tags are present
tagged_text = "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to review the <impact_location>central park</impact_location> project."

mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.7
)
print(mapped_tagged_text)
# Output might look like: "Budget <administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to view\n\n the <impact_location>central park</impact_location> prject."

Map Tags to Original and Get Positions

Combine mapping tags to original text and extracting entities with correct labels.

original_text = "Legal basis: Art. 5. The Env. Committee met on 2026-01-12."
tagged_text = "Legal basis: <article>Art. 5</article>. The <administrative_body>Environmental Committee</administrative_body> met on <session_date>2026-01-12</session_date>."

# 1. Map tags to the noisy original text
mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.7
)

# 2. Extract annotations using the mapping
ner_label_mapping = {
    "administrative_body": "ADMINISTRATIVE BODY",
    "session_date": "SESSION DATE",
    "article": "ARTICLE"
}

annotations = SpanAligner.get_annotations_from_tagged_text(
    mapped_tagged_text,
    ner_map=ner_label_mapping
)

print(annotations["entities"])
# Output:
# [
#  {'start': 13, 'end': 19, 'text': 'Art. 5', 'labels': ['ARTICLE']},
#  {'start': 47, 'end': 57, 'text': '2026-01-12', 'labels': ['SESSION DATE']}
# ]

Project details

Release history Release notifications | RSS feed

0.3.2

Feb 26, 2026

0.3.1

Feb 25, 2026

0.3.0

Feb 25, 2026

0.2.4

Feb 6, 2026

0.2.3

Feb 6, 2026

0.2.2

Feb 6, 2026

0.2.1

Feb 6, 2026

0.2.0

Feb 6, 2026

This version

0.1.2

Jan 12, 2026

0.1.0

Jan 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

span_aligner-0.1.2.tar.gz (18.3 kB view details)

Uploaded Jan 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

span_aligner-0.1.2-py3-none-any.whl (15.8 kB view details)

Uploaded Jan 12, 2026 Python 3

File details

Details for the file span_aligner-0.1.2.tar.gz.

File metadata

Download URL: span_aligner-0.1.2.tar.gz
Upload date: Jan 12, 2026
Size: 18.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for span_aligner-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`15f210343b8d71b39f9c2f9516e147d8e2e366ccb2ee3344856a21385c424187`
MD5	`b984f3a891783e9910806f7cc386da6e`
BLAKE2b-256	`a42ff631ffc7268d14f48e6b74b991d950ba2fed0c3452a9e33c71cce958c826`

See more details on using hashes here.

File details

Details for the file span_aligner-0.1.2-py3-none-any.whl.

File metadata

Download URL: span_aligner-0.1.2-py3-none-any.whl
Upload date: Jan 12, 2026
Size: 15.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for span_aligner-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`17c6cc8ac7021dfa8b8aa08b2b8eed4f8f87f73dfcbca5887593e2ac3ff12e22`
MD5	`aa78df5ff476b97dfa1c74db8fbddeaa`
BLAKE2b-256	`9db7c38b5ff377f52054d80b58996493014470735447549c895ceef337f1a5fb`

See more details on using hashes here.

span-aligner 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Span Aligner

Features

Installation

Usage

Get Annotations from Tagged Text

Rebuild Tagged Text

Rebuild Tagged Text from Task

Map Tags to Original

Map Tags to Original and Get Positions

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes