Skip to main content

Multiple Sequence Alignment using Dynamic Programming

Project description

huseyinmsa

A Python library for Multiple Sequence Alignment (MSA) implemented from scratch using Dynamic Programming.

Installation

pip install huseyinmsa

Quick start

from huseyinmsa import MSA, print_alignment

sequences = {
    "Seq1": "ACGTACGT",
    "Seq2": "ACGACGT",
    "Seq3": "ACGT",
}

msa = MSA()
alignment = msa.align(sequences)
print_alignment(alignment)
print("SP Score:", msa.sp_score(alignment))

Reading / writing FASTA files

from huseyinmsa import MSA, read_fasta, write_fasta, print_alignment

sequences = read_fasta("sequences.fasta")
msa = MSA(match=1, mismatch=-1, gap=-2)
alignment = msa.align(sequences)

print_alignment(alignment)
write_fasta(alignment, "aligned.fasta")

Algorithm

The library implements progressive multiple sequence alignment driven entirely by dynamic programming.

Stage 1 — Pairwise distances (Needleman-Wunsch)

Every pair of input sequences is globally aligned using the classic Needleman-Wunsch O(n·m) DP algorithm.

dp[0][j] = j × gap
dp[i][0] = i × gap

dp[i][j] = max(
    dp[i-1][j-1] + sub(seq1[i], seq2[j]),   # match or mismatch
    dp[i-1][j]   + gap,                      # delete from seq1
    dp[i][j-1]   + gap,                      # insert into seq1
)

After alignment, sequence identity (identical columns / total columns) is used to convert the score into a distance: distance = 1 − identity.

Stage 2 — Guide tree (UPGMA)

The n×n distance matrix is clustered with UPGMA:

  1. Each sequence starts as its own cluster.
  2. The two closest clusters are merged; their distance to every other cluster is recalculated as the size-weighted average.
  3. Repeat until a single root cluster remains.

The result is a binary tree (e.g. (('A','B'),'C')) that tells the aligner which sequences to merge first — always the most similar ones.

Stage 3 — Progressive alignment (profile–profile DP)

Sequences are merged bottom-up along the guide tree. At each merge step, two blocks of already-aligned sequences (profiles) are aligned with a second DP pass. The score between two columns is the average substitution score over all non-gap character pairs; gap penalties are scaled by the fraction of real characters in a column.

Parameters

Parameter Default Meaning
match 1 Score for identical characters
mismatch -1 Penalty for mismatched characters
gap -2 Linear gap penalty

Scoring

After alignment, the Sum-of-Pairs (SP) score is available:

score = msa.sp_score(alignment)

It sums the pairwise substitution / gap score across every aligned column for every pair of sequences. Higher values indicate a better alignment.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

huseyin_cakir-0.1.0.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

huseyin_cakir-0.1.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file huseyin_cakir-0.1.0.tar.gz.

File metadata

  • Download URL: huseyin_cakir-0.1.0.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for huseyin_cakir-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e71c6fd5dbc137f4c9209f725e10c6109a522fff04239a478009aab26957093e
MD5 abb518d026039ab2262fbae05e7d945d
BLAKE2b-256 9017e2ae39db29748dfc629484405a7819b179e92d72c94cb877a70c33246006

See more details on using hashes here.

File details

Details for the file huseyin_cakir-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: huseyin_cakir-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for huseyin_cakir-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d8935e82451d7bffbe34e080ae5aceaa660c1b11b6e0d040bb4c8f616916c43c
MD5 89196b00aa73dfef3c8bcd51825b4476
BLAKE2b-256 b1eed83dc37dae0fe4800021ec0c33eae0f90cc296843a8c8164ddd0fca3b672

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page