Skip to main content
This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

RDMCL recursively clusters groups of homologous sequences into orthogroups.

Project Description


Recursive Dynamic Markov Clustering

A method for identifying hierarchical orthogroups among homologous sequences

‘Orthology’ is a term that was coined to describe ‘homology via speciation’¹, which is now a concept broadly used as a predictor of shared gene function among species²⁻³. From a systematics perspective, orthology also represents a natural schema for classifying/naming genes coherently. As we move into the foreseeable future, when the genomes of all species on the planet have been sequences, it will be important to catalog the evolutionary history of all genes and label them in a rational way. Considerable effort has been made to programmatically identify orthologs, leading to excellent software solutions and several large public databases for genome-scale predictions. What is currently missing, however, is a convenient method for fine-grained analysis of specific gene families.

In essence, RD-MCL is an extension of conventional Markov clustering-based orthogroup prediction algorithms like OrthoMCL, with three key differences:

  1. The similarity metric used to describe the relatedness of sequences is based on multiple sequence alignments, not pair-wise sequence alignments or BLAST. This significantly improves the quality of the information available to the clustering algorithm.
  2. The appropriate granularity of the Markov clustering algorithm, as is controlled by the ‘inflation factor’ and ‘edge similarity threshold’, is determined on the fly. This is in contrast to almost all other methods, where default parameters are selected at the outset and imposed indiscriminately on all datasets.
  3. Differences in evolutionary rates among orthologous groups of sequences are accounted for by recursive rounds of clustering.

Getting started

RD-MCL is hosted on the Python Package Index, so the easiest way to get the software and most dependencies is via pip:

$: pip install rdmcl
$: rdmcl -setup

The program will complain if you don’t run ‘-setup’ before the first time you use it, so make sure you do that.

The input for RD-MCL is a sequence file in any of the many supported formats, where the name of each sequence is prefixed with an organism identifier. For example:

>ath-At4g02970
MNVYIDTETGSSFSITIDFGETVLEIKEKIEKSQGIPVSKQILYLDGKALEDDLHKIDYM
ILFESRLLLRISPDADPNQSNEQTEQSKQIDDKKQEFCGIQDSSESKKITRVMARRVHNI
YSSLPAYSLDELLGPKYSATVAVGGRTNQVVQPTEQASTSGTAKEVLRDSDSPVEKKIKT
NPMKFTVHVKPYQEDTRMIHVEVNADDNVEELRKELVKMQERGELNLPHEAFHLLGLGSS
ETCPHQNRSEEPNQCPTILMSPHGLQAIVT
>cel-CE08215_2
QIFVKVLGVSYAFKIHREDTVFDIKNDIEHRHDIPQHSYWLSFSGKRLEDHCSIGDYNIQ
KSSTITMYFRSG
>cel-CE16986
MKATTVKENEVKDDRKLSLNEMLRKRCLQVKNTKMKNSSMPKFQYFVRLNGKTRTLNVNA
SDTVEQGKMQLCHNARSTRMSYGGKPLSDQITFGEYNISNNSTMDLHFRI
>hsa-Hs20473312
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQGKEGIPPDQQRLIFAGKQLEDGRTLSDYN
IQKESTLHLVLRLLVVLRKGRRSLTPLPRRISTRERRLSWLS
>sce-YDR139c
MIVKVKTLTGKEISVELKESDLVYHIKELLEEKEGIPPSQQRLIFQGKQIDDKLTVTDAH
LVEGMQLHLVLTLRGGN

The above is a few sequences from KOG0001, coming from Arabidopsis (ath), C. Elgans (cel), Human (hsa), and yeast (sce). Note the hyphen (-) separating each identifier from the gene name. This is important! Make sure there are no spurious hyphens in any of the gene names, and if you can’t use a hyphen for some reason, set the delimiting character with the -ts flag.

Once you have your sequences named correctly, simply pass it into rdmcl:

$: rdmcl your_seq_file.fa

A new directory will be created which will contain all of the accoutrement associated with the run, including a ‘final_clusters.txt’ file, which is the result you’ll probably be most interested in.

There are several parameters you can modify; use :$ rdmcl -h to get a listing of them. Things are still under development so I haven’t written a wiki yet, but I’d be overjoyed to get feedback if you are confused by anything. Please do not hesitate to email me!

Contact

Any comments you have would be really appreciated. Please feel free to add issues in the GitHub issue tracker or contact Steve Bond (lead developer) directly at steve.bond@nih.gov.

Release History

Release History

This version
History Node

1.0.3

History Node

1.0.2

History Node

1.0.1

History Node

1.0.0

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
rdmcl-1.0.3.tar.gz (45.0 kB) Copy SHA256 Checksum SHA256 Source May 16, 2017

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting