Skip to main content

Library for multiple asymmetric alignments on different alphabets

Project description

MAlign

PyPI

MAlign is a Python library for multiple sequence alignment with asymmetric scoring matrices across different domains. Unlike standard alignment tools that assume symmetric substitution costs, MAlign supports directional scoring -- the cost of aligning symbol A with symbol B can differ from B with A.

While designed primarily for computational linguistics (e.g., historical phonology, cognate detection), MAlign works with any hashable Python objects and is suitable for general-purpose sequence alignment tasks.

Key Features

  • Asymmetric scoring: Direction-dependent alignment costs, with from_substitution_counts() factory for log-odds matrices from observed sound change frequencies
  • True multi-alignment: N-dimensional alignment for up to 4 sequences (via YenKSP on N-dim graphs), with automatic UPGMA progressive fallback for larger sets
  • Multiple algorithms: Needleman-Wunsch (anw) and Yen's k-shortest paths (yenksp)
  • k-best alignments: Return the top-k optimal alignments, not just the best one
  • Matrix learning: Supervised (EM, gradient descent) and unsupervised (bootstrap_matrix) from sequence pairs
  • Prior-guided learning: Blend phonological feature priors with data-driven scores via linearly-decaying regularization
  • Block detection: Detect and merge complementary-gap patterns (diphthongization, metathesis) into compound symbols
  • Feature-based scoring: Build matrices from phonological feature distances (via distfeat)
  • Matrix imputation: Fill sparse matrices using sklearn-based methods
  • Evaluation metrics: Accuracy, precision, recall, and F1 for alignment quality

Installation

pip install malign

For phonological feature-based scoring matrices:

pip install malign[features]

Quick Start

Basic Alignment

import malign

alms = malign.align(["ATTCGGAT", "TACGGATTT"], k=2)
print(malign.tabulate_alms(alms))

Custom Scoring Matrix

matrix = malign.ScoringMatrix.from_sequences(
    sequences=[["A", "C", "G", "T"], ["A", "C", "G", "T"]],
    match=2.0, mismatch=-1.0, gap_score=-1.5,
)
alms = malign.align(["ACGT", "AGT"], k=1, matrix=matrix)

Full Pipeline: Features to Evaluation

This example shows the complete workflow for linguistic alignment -- building a scoring matrix from phonological feature distances, aligning cognate pairs, and evaluating the results:

import malign

# Build a scoring matrix from phonological feature distances
matrix = malign.ScoringMatrix.from_distfeat(
    sequences=[["n", "o", "t", "e"], ["n", "o", "tʃ", "e"]],
    gap="-", gap_score=-1.0,
)

# Align cognate sequences
alms = malign.align(
    [["n", "o", "t", "e"], ["n", "o", "tʃ", "e"]],
    k=3, matrix=matrix, method="anw",
)
print(malign.tabulate_alms(alms[:2]))

# Evaluate against gold standard
gold = malign.Alignment(
    [("n", "o", "t", "e"), ("n", "o", "tʃ", "e")], score=0.0,
)
print(f"Accuracy: {malign.alignment_accuracy(alms[0], gold):.2%}")
print(f"F1: {malign.alignment_f1(alms[0], gold):.2%}")

Matrix Learning from Cognates

cognate_sets = [
    [["n", "o", "t", "e"], ["n", "o", "tʃ", "e"]],
    [["f", "a", "t", "o"], ["h", "a", "d", "o"]],
]
matrix = malign.learn_matrix(cognate_sets, method="em", max_iter=10)

# Optionally regularize with a phonological prior
matrix = malign.learn_matrix(
    cognate_sets, method="em", max_iter=10, prior_matrix=prior,
)

Unsupervised Bootstrap Learning

# No clustering needed -- just pairs of related sequences
pairs = [
    (["p", "a", "t", "a"], ["b", "a", "d", "a"]),
    (["t", "a", "p", "a"], ["d", "a", "b", "a"]),
    (["k", "a", "t", "a"], ["g", "a", "d", "a"]),
]
matrix = malign.bootstrap_matrix(pairs, max_iter=20)

# Optionally blend with a phonological prior
prior = malign.ScoringMatrix.from_distfeat(
    sequences=[["p", "t", "k", "b", "d", "g"], ["p", "t", "k", "b", "d", "g"]],
)
matrix = malign.bootstrap_matrix(pairs, max_iter=20, prior_matrix=prior)

Block Detection (Diphthongization / Metathesis)

# Merge complementary-gap columns into compound symbols
alms = malign.align([["a"], ["j", "e"]], k=1, merge_blocks=True)
# Sequence 2 gets compound symbol ("j", "e") instead of separate columns

Algorithms

Method Description Best for
anw (default) Asymmetric Needleman-Wunsch Pairwise alignment, small k
yenksp Yen's k-shortest paths on alignment graph Large k, diverse alignments
dumb Gap-padding baseline Testing and comparison

Requirements

  • Python >= 3.12
  • numpy, scipy, scikit-learn, tabulate, PyYAML
  • Optional: distfeat for feature-based scoring

Documentation

Community

Contributions, bug reports, and feature requests are welcome via GitHub issues and pull requests.

Author and Citation

Developed by Tiago Tresoldi (tiago.tresoldi@lingfil.uu.se).

The author has received funding from the Riksbankens Jubileumsfond (grant agreement ID: MXM19-1087:1, Cultural Evolution of Texts).

During the first stages of development, the author received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, Computer-Assisted Language Comparison).

If you use malign, please cite it as:

Tresoldi, Tiago (2026). MALIGN, a library for multiple asymmetric alignments on different domains. Version 0.5. Uppsala: Department of Linguistics and Philology, Uppsala University.

In BibTeX:

@misc{Tresoldi2026malign,
  author = {Tresoldi, Tiago},
  title = {MALIGN, a library for multiple asymmetric alignments on different domains. Version 0.5},
  howpublished = {\url{https://github.com/tresoldi/malign}},
  address = {Uppsala},
  publisher = {Department of Linguistics and Philology, Uppsala University},
  year = {2026},
}

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

malign-0.5.0.tar.gz (66.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

malign-0.5.0-py3-none-any.whl (39.4 kB view details)

Uploaded Python 3

File details

Details for the file malign-0.5.0.tar.gz.

File metadata

  • Download URL: malign-0.5.0.tar.gz
  • Upload date:
  • Size: 66.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for malign-0.5.0.tar.gz
Algorithm Hash digest
SHA256 2b737ff647aee5b737340ecb59f30abcb9b40ca598a28bd927b77eab3e4a14c0
MD5 6ec92588775c4b843fdcc1415d1f6ced
BLAKE2b-256 0496dbc3d1405637d3ee9c6204200bd97d3ca55a44fb94c95af4253f6b2c9e38

See more details on using hashes here.

File details

Details for the file malign-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: malign-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 39.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for malign-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a86ba513715e97f0058336da9b3c7a6314aa16b05b60ad54e2aa2f02bdfd057
MD5 c65dec70ff1e4db03f5e4e1f483f52fc
BLAKE2b-256 c5560fa43609c93e4c1871db062e3eb435e16237f752180b0cd2a56b1543af6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page