Dynamic core ortholog compilation tool
Project description
dcc2
dcc2 is a tool for compiling core set data for h1s using predicted orthologs from OMA, both OMA-browser and OMA-standalone. Outputs of this tool are 3 (optional 4) folders required for a HaMStR run, including (1) core_orthologs (comprises of OMA orthologous group - OG, or OMA pairs - OP. Each OG/OP has its own directory, where a multiple fasta file and a corresponding profile HMM can be found), (2) genome_dir (contains gene sets of taxa, from which the orthologs are originated), (3) blast_dir (holds the blast databases of those gene sets within genome_dir), and an optional (4) weight_dir (contains feature architecure annotations of all gene sets).
Table of Contents
How to install
dcc2 is distributed as a python package called dcc2. It is compatible with Python ≥ v3.7.
Install the dcc2 package
You can install dcc2 using pip
:
python3 -m pip install dcc2
or, in case you do not have admin rights, and don't use package systems like Anaconda to manage environments you need to use the --user
option:
python3 -m pip install --user dcc2
and then add the following line to the end of your ~/.bashrc or ~/.bash_profile file, restart the current terminal to apply the change (or type source ~/.bashrc
):
export PATH=$HOME/.local/bin:$PATH
Setup dcc2
After installing dcc2, you need to run the prepare script to download and parse required OMA browser data.
You can do it by running this command
prepareDcc -o /output/path/for/oma/data
Usage
For parsing OMA orthologs by using an OMA group ID:
parseOmaById -g 1 -n HUMAN,THEAM,DESM0 -o /output/path/ -j jobName --cpus 8
Or using list of OMA taxa:
parseOmaBySpec -n HUMAN,ECOLI,YEAST -o /output/path/ -j jobName --annoFas --cpus 8
If only 2 OMA taxa are given, you can choose to use OMA pairs instead of OMA groups:
parseOmaBySpec -n HUMAN,ECOLI -t pair -o /output/path/ -j jobName --annoFAS --cpus 8
For parsing an output from OMA-standalone, dcc2 requires:
- the output orthoXML file from OMA,
- a taxon mapping file in tab-delimited format containing 3 columns
<NCBI taxon ID> <Original taxon name> <Abbr. taxon name>
, where original taxon name is the name that is written in the orthoXML input file and abbr. taxon name is its abbreviation spcide code (for example: HOMSA for Homo sapiens) - protein set of included taxa. This can be either given as a folder, or automatically downloaded from OMA database
parseOrthoxml -i input.orthoxml -m mapping_file.txt -g /path/to/gene/set -o /output/path/ -j jobName --annoFas --cpus 8
Bugs
Any bug reports or comments, suggestions are highly appreciated. Please open an issue on GitHub or be in touch via email.
Contributors
Contact
For further support or bug reports please contact: tran@bio.uni-frankfurt.de
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.