Pure Python Clustal Omega Multiple Sequence Alignment implementation
Project description
BunyaminMSA
Pure Python implementation of the Clustal Omega Multiple Sequence Alignment (MSA) algorithm.
Bioinformatics Final Project — Bunyamin Arpc
Installation
pip install BunyaminMSA
Or from source:
git clone https://github.com/bunyaminarpc/BunyaminMSA.git
cd BunyaminMSA
pip install -e .
Quick Start
from bunyaminmsa import ClustalOmega
msa = ClustalOmega()
sequences = ["ACGTACGT", "ACGGACGT", "TTTTACGT"]
names = ["Human", "Mouse", "Zebrafish"]
result = msa.align(sequences, names=names)
print(result["alignment_str"])
FASTA Input
fasta = """
>seq1
ACGTACGTACGT
>seq2
ACGGACGTACGG
>seq3
TTTTACGTATTT
"""
result = msa.align_from_fasta(fasta)
Command Line
bunyaminmsa --fasta input.fasta
bunyaminmsa --seqs ACGT ACGG TTTT --names s1 s2 s3
bunyaminmsa --fasta input.fasta --output alignment.aln
Algorithm Overview
Clustal Omega performs MSA in three main stages:
1. Pairwise Distance Calculation (k-mer based)
All sequence pairs are compared using k-mer frequency profiles and cosine distance. This is faster than full pairwise DP and robust to long sequences.
2. Guide Tree Construction (UPGMA)
The pairwise distance matrix is used to build a binary guide tree using UPGMA (Unweighted Pair Group Method with Arithmetic mean). Closely related sequences are merged first.
3. Progressive Alignment
Sequences are aligned following the guide tree (post-order traversal):
- Leaf–Leaf: Needleman-Wunsch global alignment with affine gap penalties
- Profile–Profile: Frequency profiles are built for each aligned group; alignment proceeds between profiles column-by-column
API Reference
ClustalOmega
| Method | Description |
|---|---|
align(sequences, names=None) |
Align list of sequences |
align_from_fasta(fasta_text) |
Parse FASTA string and align |
get_distance_matrix() |
Return last computed distance matrix |
get_guide_tree() |
Return last computed guide tree |
Result Dictionary
| Key | Type | Description |
|---|---|---|
names |
list[str] | Sequence names |
aligned |
list[str] | Aligned sequences (with gaps) |
alignment_str |
str | CLUSTAL-format alignment |
distance_matrix |
list[list[float]] | n×n pairwise distances |
sequence_type |
str | 'dna' or 'protein' |
guide_tree |
str | String representation of UPGMA tree |
Running Tests
python tests/test_clustal_omega.py
# or
pytest tests/
License
MIT License — Bunyamin Arpc
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bunyaminmsa-1.0.0.tar.gz.
File metadata
- Download URL: bunyaminmsa-1.0.0.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
772b5bace157f4c47e0c25936a88355d1e402feef0b4ea6cddef97de9781e526
|
|
| MD5 |
746ad099a15e947192d2a5662a5fef0b
|
|
| BLAKE2b-256 |
85ca6a397c564ccc20f15e826da0696c5648d809dd333d98354a5ffee9d108b5
|
File details
Details for the file bunyaminmsa-1.0.0-py3-none-any.whl.
File metadata
- Download URL: bunyaminmsa-1.0.0-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43a47006fe95c7b81d0a41e6167559d70a7278a56663198c381c4b9d4e05ef3e
|
|
| MD5 |
839291353c6beb7ae9b327e82325cccf
|
|
| BLAKE2b-256 |
64b831dc7b08102836b5d4ca13c6a05182bd1723e5844371d0a5554e8514313b
|