Skip to main content

A utility for aligning and mapping text spans between different text representations.

Project description

Span Aligner

A utility for aligning and mapping text spans between different text representations, particularly useful for Label Studio annotation compatibility.

Features

  • Sanitize span boundaries to avoid special characters.
  • Find exact and fuzzy matches of text segments in original documents.
  • Map spans from one text representation to another.
  • Rebuild tagged text with nested annotations.
  • Merge result objects containing span annotations.

Installation

Install from source:

pip install .

For development:

pip install -e ".[dev]"

Usage

from span_aligner import SpanAligner

original = "Hello, World!"
result_obj = {
    "spans": [{"start": 0, "end": 5, "text": "Hello", "labels": ["greeting"]}],
    "entities": [],
    "task": {"data": {"text": ""}}
}

success, mapped = SpanAligner.map_spans_to_original(original, result_obj)
print(mapped)

Map Tags to Original

Align annotated spans from a tagged string back to their positions in the original text, keeping the mistakes and original text as written in the original.

original_text = "The quick brown fox jumps\n\n over the dog."
# Imagine the text was slightly modified or translated, but tags are present
tagged_text = "The <adj>quick</adj> brown fox jumps over the <animal>dog</animal>."

mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.8
)
print(mapped_tagged_text)
# Output might look like: "The <adj>quick</adj> brown fox jumps\n\n over the <animal>dog</animal>."
# (If original text differed slightly, tags would be placed on best matching spans)

Rebuild Tagged Text

Reconstruct a string with XML-like tags from raw text and span/entity lists.

text = "Hello World"
spans = [{"start": 0, "end": 11, "labels": ["sentence"]}]
entities = [{"start": 6, "end": 11, "labels": ["location"]}]

tagged, stats = SpanAligner.rebuild_tagged_text(text, spans, entities)
print(tagged)
# Output: <sentence>Hello <location>World</location></sentence>

Rebuild Tagged Text from Task

Generate tagged text directly from a Label Studio task object.

# Assuming 'task' is a Label Studio task object (or similar structure)
# with .data['text'] and .annotations attributes
mapping = {"Location": "loc", "Person": "per"}

tagged_output = SpanAligner.rebuild_tagged_text_from_task(task, mapping)
print(tagged_output)

Get Annotations from Tagged Text

Extract structured spans and entities from a string with inline tags.

tagged_input = "Visit <loc>Paris</loc> and see the <landmark>Eiffel Tower</landmark>."

annotations = SpanAligner.get_annotations_from_tagged_text(
    tagged_input,
    ner_map={"loc": "Location", "landmark": "Location"}
)

print(annotations["entities"])
# Output: 
# [
#   {"start": 6, "end": 11, "text": "Paris", "labels": ["Location"]},
#   {"start": 24, "end": 36, "text": "Eiffel Tower", "labels": ["Location"]}
# ]
print(annotations["plain_text"])
# Output: "Visit Paris and see the Eiffel Tower."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

span_aligner-0.1.0.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

span_aligner-0.1.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file span_aligner-0.1.0.tar.gz.

File metadata

  • Download URL: span_aligner-0.1.0.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for span_aligner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 83e478f7e9489fe25b83654c5327d9f4a431699ad590f796e20b435bd5c0437e
MD5 edf335dd9632579dfe2e647985f602b4
BLAKE2b-256 4f7e388403905ba688d2c9ec5ba0228bce5ee4511b8f67711eec8c239b24e927

See more details on using hashes here.

File details

Details for the file span_aligner-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: span_aligner-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for span_aligner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8147c1885f91f9d9d8f1e43dda06c63e7d96b30288db77fc9cee1bc9499a766
MD5 ae2bb2b42da4dcb78950066a9388473b
BLAKE2b-256 0a5764fd623ebf44feee7cbfa93bfa216af982aafa84ca455a3a4962857452b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page