Skip to main content

semantic alignment between 2 lists of strings

Project description

Semantic Text Aligner

This project aligns two lists of strings using embedding-based dynamic time warping (DTW) and stitches chunked alignment outputs back into a single global alignment. It returns a list[tuple[str | None, str | None]] where each row pairs items from the left/right lists or None when a gap is inserted.

What it solves

  • Sequence alignment: core.py embeds tokens with litellm and runs a DTW-style dynamic program with a gap penalty to align arbitrary lists of strings.
  • Chunk stitching: stitcher.py reconstructs a full alignment from overlapping chunks. It uses a canonical anchor (first non-None/non-None row) to synchronize overlaps, merges compatible gap blocks, and upgrades gap-only rows when the corresponding token appears in the new chunk.
  • Handles degenerate regions where gap-only rows can appear in different orders across chunks, normalizing them via anchor-based overlap matching.

Key concepts

  • Aligned row: (left_item, right_item) where each element is str | None.
  • Chunk: List of aligned rows for a span of the inputs, extracted with an overlap window.
  • Overlap window: Size O; maximum search window W_max = 2 * O.
  • Canonical anchor: First row in a new chunk with both sides non-None; used to re-synchronize.
  • Permutable gap block: Consecutive rows with exactly one None; these may reorder between chunks but must preserve column order.

APIs

  • One-shot or chunked alignment:
    semantic_text_aligner.core.align_string_lists(input_data, chunk_size=None, overlap_size=None, gap_penalty=0.1, model="ollama/nomic-embed-text")
    • chunk_size=None runs a single full alignment.
    • When chunked, overlap_size=None defaults to min(4, chunk_size // 2) and stitched output is returned.
  • Low-level alignment (no stitching): semantic_text_aligner.aligner.align_sequences(...)
  • Stitching helpers: stitch_two_chunks (returns rows to append) and stitch_all_chunks.

How stitching works

  1. Take tail_window = prev_tail[-W_max:] and head_window = new_chunk[:W_max] where W_max = 2 * overlap_size.
  2. Find canonical anchor in head_window. If none, append new_chunk as-is.
  3. Locate a compatible anchor in tail_window (exact match or mergeable gap/full rows).
  4. Merge compatible overlap rows, upgrading gaps when the other chunk supplies the missing token, stopping on conflicts.
  5. Rewrite any affected tail prefix, drop duplicate head rows already absorbed, trim the accumulator tail as needed, and append the stitched rows.

CLI usage

Align two newline-delimited files (with optional chunking):

python -m semantic_text_aligner.core left.txt right.txt --chunk-size 6 --overlap-size 3

Example

file1.txt:

morning espresso at blue bottle
pay electricity bill online
email project update to sara
weekly team sync meeting
order groceries from citymarket
30-min treadmill run
cook veggie stir fry for dinner
backup laptop to external drive
review quarterly budget spreadsheet
call mom about weekend plans
read 20 pages of novel before bed
update personal to-do list in notion

file2.txt:

morning espresso @ blue bottle
pay electric bill online
sync weekly team meeting
order grocery delivery from city market
warmup walk then 30 min treadmill
prep veggies for stir-fry dinner
stir fry tofu and veggies for dinner
full system backup of laptop
review quarterly budget spreadsheet
call parents about weekend trip
scroll news instead of reading book
clean up and reorganize notion tasks
python -m src.semantic_text_aligner.core file1.txt file2.txt

output

 1. morning espresso at blue bottle      | morning espresso @ blue bottle
 2. pay electricity bill online          | pay electric bill online
 3. email project update to sara         |
 4. weekly team sync meeting             | sync weekly team meeting
 5. order groceries from citymarket      | order grocery delivery from city market
 6. 30-min treadmill run                 | warmup walk then 30 min treadmill
 7. cook veggie stir fry for dinner      | prep veggies for stir-fry dinner
 8.                                      | stir fry tofu and veggies for dinner
 9.                                      | full system backup of laptop
10. backup laptop to external drive      |
11. review quarterly budget spreadsheet  | review quarterly budget spreadsheet
12. call mom about weekend plans         | call parents about weekend trip
13.                                      | scroll news instead of reading book
14.                                      | clean up and reorganize notion tasks
15. read 20 pages of novel before bed    |
16. update personal to-do list in notion |

Quick one-off alignment inline:

python - <<'PY'
from semantic_text_aligner.core import align_string_lists
left = ["a", "b", "c"]
right = ["a", "c", "d"]
print(align_string_lists((left, right)))
PY

Stitch multiple pre-aligned chunks (uses fixtures from tests/fixtures/case_mixed1.py):

python -m semantic_text_aligner.stitcher --case-indices 0,1,2 --overlap-size 2

Programmatic example

from semantic_text_aligner.core import align_string_lists

left = ["dog", "pizza", "house", "balloon"]
right = ["cat", "mouse", "pizza pie", "home"]

# Full alignment in one shot
full = align_string_lists((left, right))

# Or align in overlapping chunks for memory control
chunked = align_string_lists((left, right), chunk_size=3, overlap_size=None)

Testing
-------
Run the unit suite (covers the four canonical GOAL1 cases plus overlap/gap upgrades and edge cases):

python -m unittest discover -s tests -p 'test*.py'


Development notes
-----------------
- Stitching is append-oriented: `stitch_two_chunks` returns only the rows to append; callers manage the accumulator.
- The gap-upgrade logic treats compatible rows as equal if one side is `None` and the other provides the same token; conflicts stop overlap growth.
- Overlap trimming is bounded to the tail window to keep memory use predictable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_text_aligner-0.0.1.tar.gz (131.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_text_aligner-0.0.1-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file semantic_text_aligner-0.0.1.tar.gz.

File metadata

File hashes

Hashes for semantic_text_aligner-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9ad7445924a9bda01bfa406f6ac82ac6007887ba43f34b59641cdb002a395327
MD5 243ebff9590406d8f045315eee5f623e
BLAKE2b-256 6411e12757819fbc314c042847a9a5a53d06cae9ebf1ea6427b6ffc748e0f5c2

See more details on using hashes here.

File details

Details for the file semantic_text_aligner-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_text_aligner-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 22f3e0e0db7e3310c031bf7f0155cc6b754c652c211797401878be26a7f6d0c6
MD5 3d31669e4f9a0516364dc4695bfaa459
BLAKE2b-256 59d60687e4cad51e467ca6f282147a2ced778145e8b3ea6874dc481e47f15698

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page