Library for multiple asymmetric alignments on different alphabets
Project description
MAlign
MAlign is a Python library for multiple sequence alignment with asymmetric scoring matrices across different domains. Unlike standard alignment tools that assume symmetric substitution costs, MAlign supports directional scoring -- the cost of aligning symbol A with symbol B can differ from B with A.
While designed primarily for computational linguistics (e.g., historical phonology, cognate detection), MAlign works with any hashable Python objects and is suitable for general-purpose sequence alignment tasks.
Key Features
- Asymmetric scoring: Direction-dependent alignment costs, with
from_substitution_counts()factory for log-odds matrices from observed sound change frequencies - True multi-alignment: N-dimensional alignment for up to 4 sequences (via YenKSP on N-dim graphs), with automatic UPGMA progressive fallback for larger sets
- Multiple algorithms: Needleman-Wunsch (
anw) and Yen's k-shortest paths (yenksp) - k-best alignments: Return the top-k optimal alignments, not just the best one
- Matrix learning: Supervised (EM, gradient descent) and unsupervised (
bootstrap_matrix) from sequence pairs - Prior-guided learning: Blend phonological feature priors with data-driven scores via linearly-decaying regularization
- Block detection: Detect and merge complementary-gap patterns (diphthongization, metathesis) into compound symbols
- Feature-based scoring: Build matrices from phonological feature distances (via distfeat)
- Matrix imputation: Fill sparse matrices using sklearn-based methods
- Evaluation metrics: Accuracy, precision, recall, and F1 for alignment quality
Installation
pip install malign
For phonological feature-based scoring matrices:
pip install malign[features]
Quick Start
Basic Alignment
import malign
alms = malign.align(["ATTCGGAT", "TACGGATTT"], k=2)
print(malign.tabulate_alms(alms))
Custom Scoring Matrix
matrix = malign.ScoringMatrix.from_sequences(
sequences=[["A", "C", "G", "T"], ["A", "C", "G", "T"]],
match=2.0, mismatch=-1.0, gap_score=-1.5,
)
alms = malign.align(["ACGT", "AGT"], k=1, matrix=matrix)
Full Pipeline: Features to Evaluation
This example shows the complete workflow for linguistic alignment -- building a scoring matrix from phonological feature distances, aligning cognate pairs, and evaluating the results:
import malign
# Build a scoring matrix from phonological feature distances
matrix = malign.ScoringMatrix.from_distfeat(
sequences=[["n", "o", "t", "e"], ["n", "o", "tʃ", "e"]],
gap="-", gap_score=-1.0,
)
# Align cognate sequences
alms = malign.align(
[["n", "o", "t", "e"], ["n", "o", "tʃ", "e"]],
k=3, matrix=matrix, method="anw",
)
print(malign.tabulate_alms(alms[:2]))
# Evaluate against gold standard
gold = malign.Alignment(
[("n", "o", "t", "e"), ("n", "o", "tʃ", "e")], score=0.0,
)
print(f"Accuracy: {malign.alignment_accuracy(alms[0], gold):.2%}")
print(f"F1: {malign.alignment_f1(alms[0], gold):.2%}")
Matrix Learning from Cognates
cognate_sets = [
[["n", "o", "t", "e"], ["n", "o", "tʃ", "e"]],
[["f", "a", "t", "o"], ["h", "a", "d", "o"]],
]
matrix = malign.learn_matrix(cognate_sets, method="em", max_iter=10)
# Optionally regularize with a phonological prior
matrix = malign.learn_matrix(
cognate_sets, method="em", max_iter=10, prior_matrix=prior,
)
Unsupervised Bootstrap Learning
# No clustering needed -- just pairs of related sequences
pairs = [
(["p", "a", "t", "a"], ["b", "a", "d", "a"]),
(["t", "a", "p", "a"], ["d", "a", "b", "a"]),
(["k", "a", "t", "a"], ["g", "a", "d", "a"]),
]
matrix = malign.bootstrap_matrix(pairs, max_iter=20)
# Optionally blend with a phonological prior
prior = malign.ScoringMatrix.from_distfeat(
sequences=[["p", "t", "k", "b", "d", "g"], ["p", "t", "k", "b", "d", "g"]],
)
matrix = malign.bootstrap_matrix(pairs, max_iter=20, prior_matrix=prior)
Block Detection (Diphthongization / Metathesis)
# Merge complementary-gap columns into compound symbols
alms = malign.align([["a"], ["j", "e"]], k=1, merge_blocks=True)
# Sequence 2 gets compound symbol ("j", "e") instead of separate columns
Algorithms
| Method | Description | Best for |
|---|---|---|
anw (default) |
Asymmetric Needleman-Wunsch | Pairwise alignment, small k |
yenksp |
Yen's k-shortest paths on alignment graph | Large k, diverse alignments |
dumb |
Gap-padding baseline | Testing and comparison |
Requirements
- Python >= 3.12
- numpy, scipy, scikit-learn, tabulate, PyYAML
- Optional: distfeat for feature-based scoring
Documentation
Community
Contributions, bug reports, and feature requests are welcome via GitHub issues and pull requests.
Author and Citation
Developed by Tiago Tresoldi (tiago.tresoldi@lingfil.uu.se).
The author has received funding from the Riksbankens Jubileumsfond (grant agreement ID: MXM19-1087:1, Cultural Evolution of Texts).
During the first stages of development, the author received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, Computer-Assisted Language Comparison).
If you use malign, please cite it as:
Tresoldi, Tiago (2026). MALIGN, a library for multiple asymmetric alignments on different domains. Version 0.5. Uppsala: Department of Linguistics and Philology, Uppsala University.
In BibTeX:
@misc{Tresoldi2026malign,
author = {Tresoldi, Tiago},
title = {MALIGN, a library for multiple asymmetric alignments on different domains. Version 0.5},
howpublished = {\url{https://github.com/tresoldi/malign}},
address = {Uppsala},
publisher = {Department of Linguistics and Philology, Uppsala University},
year = {2026},
}
License
MIT License. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file malign-0.5.0.tar.gz.
File metadata
- Download URL: malign-0.5.0.tar.gz
- Upload date:
- Size: 66.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b737ff647aee5b737340ecb59f30abcb9b40ca598a28bd927b77eab3e4a14c0
|
|
| MD5 |
6ec92588775c4b843fdcc1415d1f6ced
|
|
| BLAKE2b-256 |
0496dbc3d1405637d3ee9c6204200bd97d3ca55a44fb94c95af4253f6b2c9e38
|
File details
Details for the file malign-0.5.0-py3-none-any.whl.
File metadata
- Download URL: malign-0.5.0-py3-none-any.whl
- Upload date:
- Size: 39.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a86ba513715e97f0058336da9b3c7a6314aa16b05b60ad54e2aa2f02bdfd057
|
|
| MD5 |
c65dec70ff1e4db03f5e4e1f483f52fc
|
|
| BLAKE2b-256 |
c5560fa43609c93e4c1871db062e3eb435e16237f752180b0cd2a56b1543af6f
|