Skip to main content

A utility for aligning and mapping text spans between different text representations.

Project description

Span Aligner

A utility for aligning and mapping text spans between different text representations, particularly useful for Label Studio annotation compatibility.

Features

  • Sanitize span boundaries to avoid special characters.
  • Find exact and fuzzy matches of text segments in original documents.
  • Map spans from one text representation to another.
  • Rebuild tagged text with nested annotations.
  • Merge result objects containing span annotations.

Installation

Install from source:

pip install span-aligner

Usage

Get Annotations from Tagged Text

Extract structured spans and entities from a string with inline tags.

tagged_input = "<administrative_body>Environmental Committee</administrative_body> discussed the <impact_location>central park</impact_location> renovation on <publication_date>2025-12-15</publication_date>."

ner_map = {
    "administrative_body": "ADMINISTRATIVE BODY",
    "publication_date": "PUBLICATION DATE",
    "impact_location": "PRIMARY LOCATION"
}

span_map ={
    "motivation" : "MOTIVATION"
}

annotations = SpanAligner.get_annotations_from_tagged_text(
    tagged_input,
    ner_map=ner_map,
    span_map=span_map
)

print(annotations["entities"])
# Output:
#[
#    {'start': 0, 'end': 23, 'text': 'Environmental Committee', 'labels': ['ADMINISTRATIVE BODY']},
#    {'start': 38, 'end': 50, 'text': 'central park', 'labels': ['PRIMARY LOCATION']},
#    {'start': 65, 'end': 75, 'text': '2025-12-15', 'labels': ['PUBLICATION DATE']}
#]

print(annotations["spans"])
# Output:
#[
#    {'start': 0, 'end': 76, 'text': 'Environmental Committee discussed the central park renovation on 2025-12-15.', 'labels': ['MOTIVATION']}
#]


print(annotations["plain_text"])
# Output: "Environmental Committee discussed the central park renovation on 2025-12-15."

Rebuild Tagged Text

Reconstruct a string with XML-like tags from raw text and span/entity lists.

text = "On 2026-01-12, the Budget Committee finalized the annual report."
# Spans corresponding to 'MOTIVATION' label, mapped to 'motivation' tag
spans = [{"start": 0, "end": 64, "labels": ["motivation"]}]
# Entities corresponding to 'ADMINISTRATIVE BODY' label, mapped to 'administrative_body' tag
entities = [{"start": 15, "end": 35, "labels": ["administrative_body"]}]

tagged, stats = SpanAligner.rebuild_tagged_text(text, spans, entities)
print(tagged)
# Output: <motivation>On 2026-01-12, the <administrative_body>Budget Committee</administrative_body> finalized the annual report.</motivation>

Rebuild Tagged Text from Task

Generate tagged text directly from a Label Studio task object.

# Assuming 'task' is a Label Studio task object (or similar structure)
# with .data['text'] and .annotations attributes
mapping = {
    "DECISION": "decision",
    "LEGAL FRAMEWORK": "legal_framework",
    "EXPIRATION DATE": "expiry_date"
}

tagged_output = SpanAligner.rebuild_tagged_text_from_task(task, mapping)
print(tagged_output)

Map Tags to Original

Align annotated spans from a tagged string back to their positions in the original text, keeping the mistakes and text as written in the original.

original_text = "Budget Budget Committee met on 2026-01-12 to view\n\n the central park prject."
# Imagine the text was slightly modified or translated, but tags are present
tagged_text = "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to review the <impact_location>central park</impact_location> project."

mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.7
)
print(mapped_tagged_text)
# Output might look like: "Budget <administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to view\n\n the <impact_location>central park</impact_location> prject."

Map Tags to Original and Get Positions

Combine mapping tags to original text and extracting entities with correct labels.

original_text = "Legal basis: Art. 5. The Env. Committee met on 2026-01-12."
tagged_text = "Legal basis: <article>Art. 5</article>. The <administrative_body>Environmental Committee</administrative_body> met on <session_date>2026-01-12</session_date>."

# 1. Map tags to the noisy original text
mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.7
)

# 2. Extract annotations using the mapping
ner_label_mapping = {
    "administrative_body": "ADMINISTRATIVE BODY",
    "session_date": "SESSION DATE",
    "article": "ARTICLE"
}

annotations = SpanAligner.get_annotations_from_tagged_text(
    mapped_tagged_text,
    ner_map=ner_label_mapping
)

print(annotations["entities"])
# Output:
# [
#  {'start': 13, 'end': 19, 'text': 'Art. 5', 'labels': ['ARTICLE']},
#  {'start': 47, 'end': 57, 'text': '2026-01-12', 'labels': ['SESSION DATE']}
# ]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

span_aligner-0.1.2.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

span_aligner-0.1.2-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file span_aligner-0.1.2.tar.gz.

File metadata

  • Download URL: span_aligner-0.1.2.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for span_aligner-0.1.2.tar.gz
Algorithm Hash digest
SHA256 15f210343b8d71b39f9c2f9516e147d8e2e366ccb2ee3344856a21385c424187
MD5 b984f3a891783e9910806f7cc386da6e
BLAKE2b-256 a42ff631ffc7268d14f48e6b74b991d950ba2fed0c3452a9e33c71cce958c826

See more details on using hashes here.

File details

Details for the file span_aligner-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: span_aligner-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for span_aligner-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 17c6cc8ac7021dfa8b8aa08b2b8eed4f8f87f73dfcbca5887593e2ac3ff12e22
MD5 aa78df5ff476b97dfa1c74db8fbddeaa
BLAKE2b-256 9db7c38b5ff377f52054d80b58996493014470735447549c895ceef337f1a5fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page