Skip to main content

A utility for aligning and mapping text spans between different text representations.

Project description

span-aligner

PyPI: span-aligner

Span Projecting & Alignment

span-aligner is a Python utility for aligning and mapping text spans between different versions of a text, and for projecting annotations across languages using semantic alignment. It supports both monolingual (fuzzy/regex) and cross-lingual (embedding-based) span mapping, making it ideal for tasks like annotation transfer, text comparison, and multilingual NLP workflows.

Features

The library consists of three main classes:

  1. SentenceAligner: Handles the low-level token alignment between two strings. It computes similarity matrices using embeddings and applies matching algorithms (like Max Weight Matching) to find the best token correspondence.
  2. SpanAligner: A utility class for converting between annotation formats. It handles the conversion between Label Studio-style JSON annotations and XML-tagged text strings (e.g., <label>text</label>).
  3. SpanProjector: The high-level orchestrator that uses SentenceAligner to project spans from source to target. It handles token clustering and span boundary reconstruction.

Installation

Install:

pip install span-aligner

Quick Start: Span Projection

The SpanProjector is the primary class for projecting annotations. It uses semantic alignment based on transformer embeddings (BERT, XLM-R) to map tokens between sentences.

from span_aligner import SpanProjector
from IPython.display import display, HTML

# Initialize the projector
# model: 'bert' (multilingual cased) or 'xlmr' / 'xlmr-large'
# token_type: 'bpe' (recommended for transformers) or 'word'
projector = SpanProjector(
    src_lang="en", 
    tgt_lang="nl", 
    model="bert", 
    token_type="bpe",
    matching_method="inter"
)

text_src = """This is a long and winded sentence to test the mapping of 'Ghent, Belgium' to the target text in Dutch.
With some additional context to make it more complex and realistic for the span projection process."""

text_tgt = """Dit is een lange en omslachtige zin om de mapping van 'Gent, België' naar de doeltaal in het Nederlands te testen.
Met wat extra context om het complexer en realistischer te maken voor het spanprojectieproces."""

spans = [
    {"start": 59, "end": 73, "text": "Ghent, Belgium", "labels": ["LOC"]},
]

# 1. Project Spans with Visualization (for Jupyter Notebooks)
# Returns both the projected spans and a list of HTML strings visualizing the alignment process
projected_spans, renderings = projector.project_spans_with_renderings(text_src, text_tgt, spans)

print("Projected Spans:")
print(projected_spans)

# In a notebook, use display(HTML(...)) to see the alignment details
if renderings:
    display(HTML(renderings[0]))

Output (Console):

Projected Spans:
[{'start': 55, 'end': 67, 'text': 'Gent, België', 'labels': ['LOC']}]

Output (Visualization):

The method project_spans_with_renderings returns rich HTML that visualizes the alignment candidates and the chosen path.

Rank \ Src Idx0
Ghent
1
,
2
Belgium
012
Gent
(0.88)
[P0,P1,P2,P3,P4]
13
,
(0.94)
[P0,P1,P2,P3,P4]
14
België
(0.95)
[P0]
114
België
(0.85)
12
Gent
(0.86)
[P1]
213
,
(0.84)
13
,
(0.83)
[P2]
315
'
(0.79)
21
Nederlands
(0.79)
[P4]
49
mapping
(0.78)
15
'
(0.78)
[P3]

Top Paths:

  • P0 Score: 19.19
    12(s:0,r:0) → 13(s:1,r:0) → 14(s:2,r:0)
  • P1 Score: 18.34
    12(s:0,r:0) → 13(s:1,r:0) → 12(s:2,r:1)
  • P2 Score: 16.99
    12(s:0,r:0) → 13(s:1,r:0) → 13(s:2,r:2)
  • P3 Score: 16.50
    12(s:0,r:0) → 13(s:1,r:0) → 15(s:2,r:4)
  • P4 Score: 10.59
    12(s:0,r:0) → 13(s:1,r:0) → 21(s:2,r:3)

For standard projection without visualization, you can use projector.project_spans(text_src, text_tgt, spans).

Usage

The package span_aligner provides two main classes: SpanAligner and SpanProjector.

  • SpanAligner: Uses regex and fuzzy search. It is highly efficient but restricted to monolingual tasks (same language). It serves as a strong baseline for correcting boundary offsets or mapping annotations between slightly different versions of a text.

  • SpanProjector: Uses word embeddings (Transformers) to align tokens semantically. It supports cross-lingual projection and handles significant paraphrasing. However, it is computationally more expensive.

    • Complexity: The mwmf (Max Weight Matching) algorithm has a complexity of O(n³), meaning execution time increases exponentially with text length. Default inter functions much faster. Works excellently for short, distinct spans.
    • Use Case: Use when languages differ or when textual differences are too great for fuzzy matching.

Optimization & Best Practices

To achieve the best results while managing computational cost, follow these guidelines:

1. Choose the Right Tool for the Job

If the source and target texts are in the same language, always start with SpanAligner. It is significantly faster and creates precise splits. Only switch to SpanProjector if fuzzy matching fails due to low textual overlap.

2. Manage Text Length (Chunking)

The SpanProjector (specifically with mwmf) struggles with very long sequences.

  • Split Texts: Break documents into logical segments (e.g., paragraphs, decisions, list items) before projection.
  • Project Locally: Align spans within their corresponding segments rather than projecting a small span against an entire document.

3. Select the Appropriate Algorithm

  • mwmf (Max Weight Matching): The gold standard. Finds the globally optimal alignment but is slow. Use for final, high-quality output on segmented text.
  • inter (Intersection): Much faster. Works excellently for short, distinct spans (e.g., named entities like persons, locations, dates) where context is less critical.
  • itermax: A balanced heuristic that offers better speed than mwmf with comparable quality for many tasks.

4. Translation-Assisted Projection (Hybrid Approach)

If direct cross-lingual projection yields subpar results, consider an intermediate translation step to simplify the alignment task:

  1. Translate Source: Use an LLM or NMT model to translate the annotated source text (or just the spans) into the target language.
  2. Align Locally: Use SpanAligner (or SpanProjector with inter) to map the translated spans onto the actual target text.

Tip: The translation should mimic the vocabulary of the target text as closely as possible.

  • Workflow: annotated_source + target_textLLMrough_translated_sourceSpanAlignerfinal_annotated_target

Span Aligner

Utilities for exact and fuzzy span mapping.

Get Annotations from Tagged Text

Extract structured spans and entities from a string with inline tags.

from span_aligner import SpanAligner

tagged_input = "<administrative_body>Environmental Committee</administrative_body> discussed the <impact_location>central park</impact_location> renovation on <publication_date>2025-12-15</publication_date>."

ner_map = {
    "administrative_body": "ADMINISTRATIVE BODY",
    "publication_date": "PUBLICATION DATE",
    "impact_location": "PRIMARY LOCATION"
}

span_map ={
    "motivation" : "MOTIVATION"
}

annotations = SpanAligner.get_annotations_from_tagged_text(
    tagged_input,
    ner_map=ner_map,
    span_map=span_map
)

print(annotations["entities"])
# Output:
#[
#    {'start': 0, 'end': 23, 'text': 'Environmental Committee', 'labels': ['ADMINISTRATIVE BODY']},
#    {'start': 38, 'end': 50, 'text': 'central park', 'labels': ['PRIMARY LOCATION']},
#    {'start': 65, 'end': 75, 'text': '2025-12-15', 'labels': ['PUBLICATION DATE']}
#]

Rebuild Tagged Text

Reconstruct a string with XML-like tags from raw text and span/entity lists.

from span_aligner import SpanAligner

text = "On 2026-01-12, the Budget Committee finalized the annual report."
# Entities corresponding to 'ADMINISTRATIVE BODY' label (indices skip "the ")
entities = [{"start": 19, "end": 35, "labels": ["administrative_body"]}]

tagged, stats = SpanAligner.rebuild_tagged_text(text, entities=entities)
print(tagged)
# Output: On 2026-01-12, the <administrative_body>Budget Committee</administrative_body> finalized the annual report.

Map Tags to Original

Align annotated spans from a tagged string back to their positions in the original text, allowing for noisy text or translation differences.

from span_aligner import SpanAligner

original_text = "Budget Committee met on 2026-01-12 to view\n\n the central park prject."
tagged_text = "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to review the <impact_location>central park</impact_location> project."

mapped_tagged_text = SpanAligner.map_tags_to_original(
    original_text=original_text,
    tagged_text=tagged_text,
    min_ratio=0.7
)
print(mapped_tagged_text)
# Output preserves original text errors:
# "<administrative_body>Budget Committee</administrative_body> met on <publication_date>2026-01-12</publication_date> to view
#  the <impact_location>central park</impact_location> prject."

Span Projector

Project annotations from one text to another using semantic alignment (e.g., cross-lingual projection).

The process begins by generating embeddings for both source and target texts, creating a similarity matrix, and finding the optimal set of alignment pairs. Several algorithms are implemented for this matching phase, including mwmf, inter, itermax, fwd, rev, greedy, and threshold.

Project En -> En (Identity/Paraphrase)

Project annotations to a similar text in the same language. Functions similar to the spanAligner with improved fuzzy matching.

from span_aligner import SpanProjector

# Initialize projector (uses BERT embeddings by default)
projector = SpanProjector(src_lang="en", tgt_lang="en")

src_text = "The <ent>cat</ent> sat on the mat."
tgt_text = "The cat sat\n\n on th.e mat."

tagged_tgt, spans = projector.project_tagged_text(src_text, tgt_text)
print(tagged_tgt)
# Output: The <ent>cat</ent>\n\n sat on th.e mat.

Project En -> Nl (Cross-Lingual)

Project annotations from an English source text to a Dutch target translation.

from span_aligner import SpanProjector

# Initialize projector
projector = SpanProjector(src_lang="en", tgt_lang="nl")

src_text = """DECISION LIST <contextual_location>Municipality of Zele</contextual_location>
 <administrative_body>Standing Committee</administrative_body> | <contextual_date>June 28, 2021</contextual_date>
  <title>1. Acceptance of candidacies for the examination procedure coordinator of Welfare</title>
  <decision>Acceptance of candidacies for the examination procedure coordinator of Welfare</decision>
  <title>2. Establishment of valuation rules for the integrated entity Municipality and Public Social Welfare Center (OCMW)</title>
  <decision>Establishment of valuation rules for the integrated entity Municipality and OCMW</decision>"""

tgt_text = """BESLUITENLIJST Gemeente Zele Vast bureau | 28 juni 20211.
 1. Aanvaarden kandidaturen examenprocedure coördinator Welzijn
 Aanvaarden kandidaturen examenprocedure coördinator Welzijn
 2. Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW
 Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW"""

tagged_tgt, spans = projector.project_tagged_text(src_text, tgt_text)
print(tagged_tgt)
# Output: BESLUITENLIJST <contextual_location>Gemeente Zele</contextual_location>
# <administrative_body>Vast bureau</administrative_body> <contextual_date>| 28 juni 20211</contextual_date>.
# <title>1. Aanvaarden kandidaturen examenprocedure coördinator Welzijn
# Aanvaarden kandidaturen examenprocedure coördinator</title> Welzijn
# <title>2. Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW</title>
# <decision>Vaststelling waarderingsregels geïntegreerde entiteit Gemeente en OCMW</decision>

Pluggable Embedding Backends

The alignment and projection utilities support pluggable embedding backends. This means you can choose between different embedding providers depending on your requirements, hardware, or available APIs. The embedding backend is responsible for converting tokens or words into vector representations used for alignment.

Available Embedding Providers

  • HuggingFace Transformers (default):

    • Use the TransformerEmbeddingProvider for BERT, XLM-R, and similar models.
    • Example: aligner = SentenceAligner(model="bert")
  • Sentence-Transformers:

    • Use the SentenceTransformerProvider for fast, high-quality sentence or token embeddings.
    • Example: aligner = SentenceAligner(embedding_provider=SentenceTransformerProvider(model="paraphrase-multilingual-MiniLM-L12-v2"))
  • Ollama API:

    • Use the OllamaEmbeddingProvider to get embeddings from a local Ollama server (e.g., with embeddinggemma).
    • Example: aligner = SentenceAligner(embedding_provider=OllamaEmbeddingProvider(model="embeddinggemma"))

You can also implement your own embedding provider by subclassing EmbeddingProvider and implementing the required methods.

How to Use a Custom Embedding Provider

from span_aligner.sentence_aligner import SentenceAligner, EmbeddingProvider

class MyEmbeddingProvider(EmbeddingProvider):
        def get_embeddings(self, tokens):
                # Return a numpy array of embeddings for the tokens
                ...
        def get_subword_to_word_map(self, words):
                # Return (subwords, word_map)
                ...

aligner = SentenceAligner(embedding_provider=MyEmbeddingProvider())

Switching Providers

You can switch embedding providers by passing the desired provider to SentenceAligner or SpanProjector via the embedding_provider argument. If not provided, the default is a HuggingFace transformer model.


Sentence Aligner

Low-level class for aligning tokens between two texts (sentences or paragraphs) using transformer embeddings. Based on the work of simalign but optimized for span mapping (partial alignment instead of full text) and customized for different embedding providers (Ollama, SaaS providers, Transformers, Sentence-Transformers).

Initialize Aligner

from span_aligner import SentenceAligner

# Use bert embeddings (default) with BPE tokenization
aligner = SentenceAligner(model="bert", token_type="bpe") 

text_src = "1. Approval of the minutes of the previous meeting"
text_tgt = "1. Goedkeuring notulen van de voorgaande vergadering"

Get Text Embeddings

Retrieve tokens and embedding vectors for a string.

tokens_src, vecs_src = aligner.get_text_embeddings(text_src)
print(f"Src tokens: {len(tokens_src)}, Vectors: {vecs_src.shape}")
# Output: Src tokens: 10, Vectors: (12, 768)

Align Partial Substring

Find the alignment of a specific substring from source to target.

# Align "simple test"
res_sub = aligner.align_texts_partial_substring(text_src, text_tgt, "minutes of the previous meeting")
print("==============================")
for src, tgt in res_sub.alignments.get("inter"):
    print(f"Aligned: '{src}' {res_sub.src_tokens[src].text}-> '{tgt}' {res_sub.tgt_tokens[tgt].text}")
# Output:
# ==============================
# Aligned: '0' - 'minutes'-> '3' - 'notulen'
# Aligned: '1' - 'of'-> '4' - 'van'
# Aligned: '2' - 'the'-> '5' - 'de'
# Aligned: '3' - 'previous'-> '6' - 'voorgaande'
# Aligned: '4' - 'meeting'-> '7' - 'vergadering'

Configuration & Advanced Usage

Embedding Models

The model parameter supports common transformer models:

  • "bert": bert-base-multilingual-cased (Default, robust multilingual performance)
  • "xlmr": xlm-roberta-base (Strong cross-lingual transfer)
  • "xlmr-large": xlm-roberta-large (Higher accuracy, more resource intensive)
# Use xlm-roberta-base
projector = SpanProjector(model="xlmr")

Matching Algorithms

The matching_method parameter controls how the token similarity matrix is converted into an alignment.

  • "mwmf" (Max Weight Matching): Finds the global optimal independent edge set. Best quality, O(n³) complexity.
  • "inter" (Intersection): Intersection of forward and backward attention. High precision, lower recall, very fast.
  • "itermax" (Iterative Max): Heuristic iterative maximization. Good speed/quality balance.
  • "greedy" (Greedy): Selects best matches greedily. Fast but local optimum.
# Trade accuracy for speed with 'inter'
projector = SpanProjector(matching_method="inter")

Tokenization: BPE vs Word

  • token_type="bpe" (Recommended): Uses the transformer's subword tokenizer (e.g. WordPiece). Handles rare words better and aligns closer to the model's internal representation.
  • token_type="word": Splits by whitespace/punctuation. Simpler, but can result in [UNK] tokens for transformers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

span_aligner-0.3.2.tar.gz (67.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

span_aligner-0.3.2-py3-none-any.whl (60.8 kB view details)

Uploaded Python 3

File details

Details for the file span_aligner-0.3.2.tar.gz.

File metadata

  • Download URL: span_aligner-0.3.2.tar.gz
  • Upload date:
  • Size: 67.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for span_aligner-0.3.2.tar.gz
Algorithm Hash digest
SHA256 d3086070df63972ee1f9b3ef43c6e9f2444ad9b2f31127aef73af501576fb08d
MD5 4b5332a86073ea27fea922f446c680d1
BLAKE2b-256 240306cf94df070ba7206417d5590d4ec8b96fd8a7a64dcca9341d873b82761a

See more details on using hashes here.

File details

Details for the file span_aligner-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: span_aligner-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 60.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for span_aligner-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cdd6fd093badc412fb1919c49ec30f9a50612ec2d7f7568ea02052e39ebbe990
MD5 52c680fd3d50fbe01773062e10e7b3ac
BLAKE2b-256 712e3c9d5959ef082167a121336ab06af4101be1c19e3d849c35568f303b1f76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page