Skip to main content

Multiple Sequence Alignment using Graph Clustering

Project description

MAGUS

Multiple Sequence Alignment using Graph Clustering


Purpose and Functionality

MAGUS is a tool for piecewise large-scale multiple sequence alignment.
The dataset is divided into subsets, which are independently aligned with a base method (currently MAFFT -linsi). These subalignments are merged together with the Graph Clustering Merger (GCM). GCM builds the final alignment by clustering an alignment graph, which is constructed from a set of backbone alignments. This process allows MAGUS to effectively boost MAFFT -linsi to over a million sequences.

The basic procedure is outlined below. Steps 4-7 are GCM.

  1. The input is a set of unaligned sequences. Alternatively, the user can provide a set of multiple sequence alignments and skip the next two steps.
  2. The dataset is decomposed into subsets.
  3. The subsets are aligned with MAFFT -linsi.
  4. A set of backbone alignments are generated with MAFFT -linsi (or provided by the user).
  5. The backbones are compiled into an alignment graph.
  6. The graph is clustered with MCL.
  7. The clusters are resolved into a final alignment.

Installing MAGUS

Deepest thanks to Baqiao Liu for setting up the MAGUS PyPI package (https://pypi.org/project/magus-msa/)
This is currently the easiest way to get started with MAGUS.
The package can be installed with

pip3 install magus-msa

and executed with

magus <arguments>

Alternatively, you can download and extract the code from this repository to a directory of your choice.
Then, you can run MAGUS with

python3 <directory_path>/magus.py


Dependencies

MAGUS requires

  • Python 3
  • MAFFT (linux version is included)
  • MCL (linux version is included)
  • FastTree and Clustal Omega are needed if using these guide trees (linux versions included)

If you would like to use some other version of MAFFT and/or MCL (for instance, if you're using Mac), you will need to edit the MAFFT/MCL paths in configuration.py
(I'll pull these out into a separate config file to make it simpler).


Getting Started

Please navigate your terminal to the "example" directory to get started with some sample data.
A few basic ways of running MAGUS are shown below.
Run "magus.py -h" to view the full list of arguments.

Align a set of unaligned sequences from scratch
python3 ../magus.py -d outputs -i unaligned_sequences.txt -o magus_result.txt

-o specifies the output alignment path
-d (optional) specifies the working directory for GCM's intermediate files, like the graph, clusters, log, etc.

Merge a prepared set of alignments
python3 ../magus.py -d outputs -s subalignments -o magus_result.txt

-s specifies the directory with subalignment files. Alternatively, you can pass a list of file paths.


Controlling the pipeline

Specify subset decomposition behavior
python3 ../magus.py -d outputs -i unaligned_sequences.txt -t fasttree --maxnumsubsets 100 --maxsubsetsize 50 -o magus_result.txt

-t specifies the guide tree method to use, and is the main way to set the decomposition strategy.
Available options are fasttree (default), parttree, clustal (recommended for very large datasets), and random.
--maxnumsubsets sets the desired number of subsets to decompose into (default 25).
--maxsubsetsize sets the threshold to stop decomposing subsets below this number (default 50).
Decomposition proceeds until maxnumsubsets is reached OR all subsets are below maxsubsetsize.

Specify beckbones for alignment graph
python3 ../magus.py -d outputs -i unaligned_sequences.txt -r 10 -m 200 -o magus_result.txt
python3 ../magus.py -d outputs -s subalignments -b backbones -o magus_result.txt

-r and -m specify the number of MAFFT backbones and their maximum size, respectively. Default to 10 and 200.
Alternatively, the user can provide his own backbones; -b can be used to provide a directory or a list of files.

Specify graph trace method
python3 ../magus.py -d outputs -i unaligned_sequences.txt --graphtracemethod mwtgreedy -o magus_result.txt

--graphtracemethod is the flag that governs the graph trace method. Options are minclusters (default and recommended), fm, mwtgreedy (recommended for very large graphs), rg, or mwtsearch.

Unconstrained alignment
python3 ../magus.py -d outputs -i unaligned_sequences.txt -c false -o magus_result.txt

By default, MAGUS constrains the merged alignment to induce all subalignments. This constraint can be disabled with -c false.
This drastically slows MAGUS and is strongly not recommended above 200 sequences.


Things to Keep in Mind

  • MAGUS will not overwrite existing backbone, graph and cluster files.
    Please delete them/specify a different working directory to perform a clean run.
  • Related issue: if MAGUS is stopped while running MAFFT, MAFFT's output backbone files will be empty.
    This will cause errors if MAGUS reruns and finds these empty files.
  • A large number of subalignments (>100) will start to significantly slow down the ordering phase, especially for very heterogenous data.
    I would generally disadvise using more than 100 subalignments, unless the data is expected to be well-behaved.

Related Publications

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

magus-msa-0.2.0.tar.gz (27.4 MB view details)

Uploaded Source

Built Distribution

magus_msa-0.2.0-py3-none-any.whl (27.6 MB view details)

Uploaded Python 3

File details

Details for the file magus-msa-0.2.0.tar.gz.

File metadata

  • Download URL: magus-msa-0.2.0.tar.gz
  • Upload date:
  • Size: 27.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for magus-msa-0.2.0.tar.gz
Algorithm Hash digest
SHA256 adac5a1106e88d605b625c7267a7b49a31f636ca386836c5a147887958ba2286
MD5 66e50d74ea4070b00feca395f921c721
BLAKE2b-256 9b7ddd51c4c644b0c1225dbdf577e195927955679b7590723d62562c8fb9e0e9

See more details on using hashes here.

File details

Details for the file magus_msa-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: magus_msa-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 27.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for magus_msa-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 12d5dc1e65d4231feca465035837757413505195a967dbaf5926eef1571bc130
MD5 56f2291dd94abf4e4131ebae4a8df721
BLAKE2b-256 cb8a444e260214fe2e74b32f7d698e68796b969f26b34065d062ba9c36702f20

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page