semantic alignment between 2 lists of strings
Project description
Semantic Text Aligner
This project aligns two lists of strings using embedding-based dynamic time warping (DTW) and stitches chunked alignment outputs back into a single global alignment. It returns a list[tuple[str | None, str | None]] where each row pairs items from the left/right lists or None when a gap is inserted.
What it solves
- Sequence alignment:
core.pyembeds tokens withlitellmand runs a DTW-style dynamic program with a gap penalty to align arbitrary lists of strings. - Chunk stitching:
stitcher.pyreconstructs a full alignment from overlapping chunks. It uses a canonical anchor (first non-None/non-None row) to synchronize overlaps, merges compatible gap blocks, and upgrades gap-only rows when the corresponding token appears in the new chunk. - Handles degenerate regions where gap-only rows can appear in different orders across chunks, normalizing them via anchor-based overlap matching.
Key concepts
- Aligned row:
(left_item, right_item)where each element isstr | None. - Chunk: List of aligned rows for a span of the inputs, extracted with an overlap window.
- Overlap window: Size
O; maximum search windowW_max = 2 * O. - Canonical anchor: First row in a new chunk with both sides non-
None; used to re-synchronize. - Permutable gap block: Consecutive rows with exactly one
None; these may reorder between chunks but must preserve column order.
APIs
- One-shot or chunked alignment:
semantic_text_aligner.core.align_string_lists(input_data, chunk_size=None, overlap_size=None, gap_penalty=0.1, model="ollama/nomic-embed-text")chunk_size=Noneruns a single full alignment.- When chunked,
overlap_size=Nonedefaults tomin(4, chunk_size // 2)and stitched output is returned.
- Low-level alignment (no stitching):
semantic_text_aligner.aligner.align_sequences(...) - Stitching helpers:
stitch_two_chunks(returns rows to append) andstitch_all_chunks.
How stitching works
- Take
tail_window = prev_tail[-W_max:]andhead_window = new_chunk[:W_max]whereW_max = 2 * overlap_size. - Find canonical anchor in
head_window. If none, appendnew_chunkas-is. - Locate a compatible anchor in
tail_window(exact match or mergeable gap/full rows). - Merge compatible overlap rows, upgrading gaps when the other chunk supplies the missing token, stopping on conflicts.
- Rewrite any affected tail prefix, drop duplicate head rows already absorbed, trim the accumulator tail as needed, and append the stitched rows.
CLI usage
Align two newline-delimited files (with optional chunking):
python -m semantic_text_aligner.core left.txt right.txt --chunk-size 6 --overlap-size 3
Example
file1.txt:
morning espresso at blue bottle
pay electricity bill online
email project update to sara
weekly team sync meeting
order groceries from citymarket
30-min treadmill run
cook veggie stir fry for dinner
backup laptop to external drive
review quarterly budget spreadsheet
call mom about weekend plans
read 20 pages of novel before bed
update personal to-do list in notion
file2.txt:
morning espresso @ blue bottle
pay electric bill online
sync weekly team meeting
order grocery delivery from city market
warmup walk then 30 min treadmill
prep veggies for stir-fry dinner
stir fry tofu and veggies for dinner
full system backup of laptop
review quarterly budget spreadsheet
call parents about weekend trip
scroll news instead of reading book
clean up and reorganize notion tasks
python -m src.semantic_text_aligner.core file1.txt file2.txt
output
1. morning espresso at blue bottle | morning espresso @ blue bottle
2. pay electricity bill online | pay electric bill online
3. email project update to sara |
4. weekly team sync meeting | sync weekly team meeting
5. order groceries from citymarket | order grocery delivery from city market
6. 30-min treadmill run | warmup walk then 30 min treadmill
7. cook veggie stir fry for dinner | prep veggies for stir-fry dinner
8. | stir fry tofu and veggies for dinner
9. | full system backup of laptop
10. backup laptop to external drive |
11. review quarterly budget spreadsheet | review quarterly budget spreadsheet
12. call mom about weekend plans | call parents about weekend trip
13. | scroll news instead of reading book
14. | clean up and reorganize notion tasks
15. read 20 pages of novel before bed |
16. update personal to-do list in notion |
Quick one-off alignment inline:
python - <<'PY'
from semantic_text_aligner.core import align_string_lists
left = ["a", "b", "c"]
right = ["a", "c", "d"]
print(align_string_lists((left, right)))
PY
Stitch multiple pre-aligned chunks (uses fixtures from tests/fixtures/case_mixed1.py):
python -m semantic_text_aligner.stitcher --case-indices 0,1,2 --overlap-size 2
Programmatic example
from semantic_text_aligner.core import align_string_lists
left = ["dog", "pizza", "house", "balloon"]
right = ["cat", "mouse", "pizza pie", "home"]
# Full alignment in one shot
full = align_string_lists((left, right))
# Or align in overlapping chunks for memory control
chunked = align_string_lists((left, right), chunk_size=3, overlap_size=None)
Testing
-------
Run the unit suite (covers the four canonical GOAL1 cases plus overlap/gap upgrades and edge cases):
python -m unittest discover -s tests -p 'test*.py'
Development notes
-----------------
- Stitching is append-oriented: `stitch_two_chunks` returns only the rows to append; callers manage the accumulator.
- The gap-upgrade logic treats compatible rows as equal if one side is `None` and the other provides the same token; conflicts stop overlap growth.
- Overlap trimming is bounded to the tail window to keep memory use predictable.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_text_aligner-0.0.1.tar.gz.
File metadata
- Download URL: semantic_text_aligner-0.0.1.tar.gz
- Upload date:
- Size: 131.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ad7445924a9bda01bfa406f6ac82ac6007887ba43f34b59641cdb002a395327
|
|
| MD5 |
243ebff9590406d8f045315eee5f623e
|
|
| BLAKE2b-256 |
6411e12757819fbc314c042847a9a5a53d06cae9ebf1ea6427b6ffc748e0f5c2
|
File details
Details for the file semantic_text_aligner-0.0.1-py3-none-any.whl.
File metadata
- Download URL: semantic_text_aligner-0.0.1-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22f3e0e0db7e3310c031bf7f0155cc6b754c652c211797401878be26a7f6d0c6
|
|
| MD5 |
3d31669e4f9a0516364dc4695bfaa459
|
|
| BLAKE2b-256 |
59d60687e4cad51e467ca6f282147a2ced778145e8b3ea6874dc481e47f15698
|