Skip to main content

Pure-Python Clustal Omega multiple sequence alignment implementation

Project description

EmreTasdemirClustalOmega

A pure-Python implementation of the Clustal Omega multiple sequence alignment (MSA) algorithm for DNA sequences. No external dependencies required — only the Python standard library.


Features

  • k-tuple distance matrix — fast pairwise sequence comparison using shared k-mers
  • mBed embedding — projects sequences into Euclidean space via reference-sequence distances
  • Bisecting k-means clustering — groups sequences before tree construction to reduce complexity
  • UPGMA guide tree — builds a hierarchical guide tree both with and without k-means pre-clustering
  • Profile HMM alignment — progressive alignment along the guide tree using profile Hidden Markov Models and Viterbi decoding
  • Iterative refinement — improves the MSA score via repeated leave-one-out realignment (HHAlign style)
  • FASTA input — reads standard .fasta / .fa files; also supports interactive manual input

Installation

pip install EmreTasdemirClustalOmega

Requires Python 3.8 or later.


Quick Start

As a library

from clustalomega import align

sequences = [
    ("seq1", "ATGCTAGCTAGCT"),
    ("seq2", "ATGCTAGCTAGCC"),
    ("seq3", "ATGCTTGCTAGCT"),
    ("seq4", "TTGCTAGCTATCT"),
]

aligned_blocks, names = align(sequences, k=2)

for name, block in zip(names, aligned_blocks):
    print(f"{name:<10} {block}")

Output:

seq1       ATGCTAGCTAGCT
seq2       ATGCTAGCTAGCC
seq3       ATGCTTGCTAGCT
seq4       TTGCTAGCTATCT

Parameters

Parameter Type Default Description
sequences list[tuple[str, str]] List of (name, sequence) tuples
k int 3 k-tuple length for distance calculation
seed int 42 Random seed for reproducibility
print_ile_yazdirma bool False Print step-by-step output to stdout

Returns: (aligned_blocks, names) where both are lists of strings in the same order.


Verbose mode

aligned_blocks, names = align(sequences, k=2, print_ile_yazdirma=True)

This prints the full pipeline output: distance matrix, embedding vectors, clustering steps, guide tree, initial alignment, and refinement progress.


From a FASTA file

from clustalomega._io_6_9 import fasta_oku
from clustalomega import align

sequences = fasta_oku("my_sequences.fasta")
aligned_blocks, names = align(sequences, k=3)

for name, block in zip(names, aligned_blocks):
    print(f"{name:<12} {block}")

Command-Line Interface

After installation, run the interactive CLI:

clustalomega

It will ask you to:

  1. Choose input method (manual entry or FASTA file)
  2. Enter the k-tuple length
  3. Run the full pipeline and print all intermediate results

Algorithm Overview

The pipeline mirrors the original Clustal Omega algorithm:

1. k-tuple distance matrix
        ↓
2. mBed embedding (reference-based Euclidean projection)
        ↓
3. Bisecting k-means clustering  (⌈√N⌉ clusters)
        ↓
4. UPGMA guide tree
        ├─ per-cluster sub-trees (k-means UPGMA)
        └─ centroid-level super-tree
        ↓
5. Progressive alignment  (Profile HMM + Viterbi)
        ↓
6. Iterative refinement   (HHAlign-style, max 3 rounds)
        ↓
   Final MSA

Example: 30-sequence dataset

SP score before refinement : -16487
SP score after  refinement : -12210   (gain: +4277)
Alignment length           : 37 columns

Project Structure

clustalomega/
├── __init__.py              # Public API: align()
├── cli.py                   # Interactive command-line entry point
├── _math_utils_1_5.py       # Math helpers (rounding, logarithm, padding)
├── _io_6_9.py               # FASTA parser and manual input
├── _distance_10_15.py       # k-tuple distance matrix
├── _embedding_16_22.py      # mBed embedding
├── _clustering_23_32.py     # Bisecting k-means
├── _guide_tree_33_43.py     # UPGMA guide tree
├── _alignment_44_53.py      # Profile HMM + Viterbi alignment
└── _refinement_54_60.py     # Iterative refinement + SP scoring

License

MIT License — see LICENSE for details.


Author

Emre Taşdemiremre1.tasdemir.58@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emretasdemirclustalomega-0.1.0.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emretasdemirclustalomega-0.1.0-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file emretasdemirclustalomega-0.1.0.tar.gz.

File metadata

  • Download URL: emretasdemirclustalomega-0.1.0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for emretasdemirclustalomega-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bbc582fec6aaae63592d5eb2aa1e9be82266a8b552c13ec8f402ea37a404b012
MD5 0baa4cb484b781c37f2405264a130372
BLAKE2b-256 283476783a1bbbfa006d70102abcfd0e5afb7b1ecc66f5790427dfb2a26198ef

See more details on using hashes here.

File details

Details for the file emretasdemirclustalomega-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for emretasdemirclustalomega-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9817523adb38ab55fb81aa7f617c4d1a81c3f518d4a8b5bc24adada10ff983c
MD5 db819182ef9b8b3f00d85869d037a725
BLAKE2b-256 f28248c78d875fd2d017709d45571e72d4fd3cca4c8a51cf1c91cb756ef4bba4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page