Multiple Sequence Alignment using Dynamic Programming
Project description
huseyinmsa
A Python library for Multiple Sequence Alignment (MSA) implemented from scratch using Dynamic Programming.
Installation
pip install huseyinmsa
Quick start
from huseyinmsa import MSA, print_alignment
sequences = {
"Seq1": "ACGTACGT",
"Seq2": "ACGACGT",
"Seq3": "ACGT",
}
msa = MSA()
alignment = msa.align(sequences)
print_alignment(alignment)
print("SP Score:", msa.sp_score(alignment))
Reading / writing FASTA files
from huseyinmsa import MSA, read_fasta, write_fasta, print_alignment
sequences = read_fasta("sequences.fasta")
msa = MSA(match=1, mismatch=-1, gap=-2)
alignment = msa.align(sequences)
print_alignment(alignment)
write_fasta(alignment, "aligned.fasta")
Algorithm
The library implements progressive multiple sequence alignment driven entirely by dynamic programming.
Stage 1 — Pairwise distances (Needleman-Wunsch)
Every pair of input sequences is globally aligned using the classic Needleman-Wunsch O(n·m) DP algorithm.
dp[0][j] = j × gap
dp[i][0] = i × gap
dp[i][j] = max(
dp[i-1][j-1] + sub(seq1[i], seq2[j]), # match or mismatch
dp[i-1][j] + gap, # delete from seq1
dp[i][j-1] + gap, # insert into seq1
)
After alignment, sequence identity (identical columns / total columns)
is used to convert the score into a distance: distance = 1 − identity.
Stage 2 — Guide tree (UPGMA)
The n×n distance matrix is clustered with UPGMA:
- Each sequence starts as its own cluster.
- The two closest clusters are merged; their distance to every other cluster is recalculated as the size-weighted average.
- Repeat until a single root cluster remains.
The result is a binary tree (e.g. (('A','B'),'C')) that tells the aligner
which sequences to merge first — always the most similar ones.
Stage 3 — Progressive alignment (profile–profile DP)
Sequences are merged bottom-up along the guide tree. At each merge step, two blocks of already-aligned sequences (profiles) are aligned with a second DP pass. The score between two columns is the average substitution score over all non-gap character pairs; gap penalties are scaled by the fraction of real characters in a column.
Parameters
| Parameter | Default | Meaning |
|---|---|---|
match |
1 |
Score for identical characters |
mismatch |
-1 |
Penalty for mismatched characters |
gap |
-2 |
Linear gap penalty |
Scoring
After alignment, the Sum-of-Pairs (SP) score is available:
score = msa.sp_score(alignment)
It sums the pairwise substitution / gap score across every aligned column for every pair of sequences. Higher values indicate a better alignment.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file huseyin_cakir-0.1.0.tar.gz.
File metadata
- Download URL: huseyin_cakir-0.1.0.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e71c6fd5dbc137f4c9209f725e10c6109a522fff04239a478009aab26957093e
|
|
| MD5 |
abb518d026039ab2262fbae05e7d945d
|
|
| BLAKE2b-256 |
9017e2ae39db29748dfc629484405a7819b179e92d72c94cb877a70c33246006
|
File details
Details for the file huseyin_cakir-0.1.0-py3-none-any.whl.
File metadata
- Download URL: huseyin_cakir-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8935e82451d7bffbe34e080ae5aceaa660c1b11b6e0d040bb4c8f616916c43c
|
|
| MD5 |
89196b00aa73dfef3c8bcd51825b4476
|
|
| BLAKE2b-256 |
b1eed83dc37dae0fe4800021ec0c33eae0f90cc296843a8c8164ddd0fca3b672
|