Skip to main content

A fungal trophy classifier based on CAZymes

Project description

CATAStrophy

CATAStrophy is a classification method for describing lifestyles/trophic characteristics of filamentous plant pathogens using carbohydrate-active enzymes (CAZymes). The name CATAStrophy is a backronym portmanteau hybrid where "CATAS" means CAZyme Assisted Training And Sorting.

CATAStrophy takes HMMER3 files from searches against dbCAN as input and returns pseudo-probabilities (See details) of trophic class memberships for each file.

Installing

CATAStrophy is a python program which can be used as a module or via a command-line interface.

NOTE: Because the repository currently private the following pip command won't work. Use the methods to install from git instead for now.

You can install from Pypi using pip:

pip3 install --user catastrophy

You can also install directly from the git repository.

pip3 install --user git+git@bitbucket.org:ccdm-curtin/catastrophy.git
git clone https://<your_username>@bitbucket.org/ccdm-curtin/catastrophy.git ./catastrophy
cd catastrophy
pip install --user .
# Or use pip install -e . if you want to edit the modules.

CATAStrophy is tested to work with python 3.5+, and it depends on numpy. The pip commands above should install these for you automatically but if you use any of these packages yourself it's a good idea to install CATAStrophy in a python virtual environment (You should probably use these when installing most python packages).

Using virtualenv is pretty easy, here's a basic rundown of the workflow.

# If it isn't installed already run one of these
# Try to use the system package managers if possible to avoid mixing up system dependencies.
sudo pip3 install virtualenv
sudo apt install python3-virtualenv # Ubuntu and probably Debian
sudo dnf install python3-virtualenv # Fedora 24+

# Change dir to where you want the env to live (usually a project dir).
cd my_project

# Create a virtualenv in a folder ./env
# python3.7 can be substituted with you version of python.
python3.7 -m venv env

So now the virtualenv is set up, now you can load it and install CATAStrophy

# Loads the virtualenv (essentially changes PYTHONPATH and some other env variables).
source env/bin/activate

pip3 install catastrophy
# or
pip3 install git+https://<your_username>@bitbucket.org/ccdm-curtin/catastrophy.git
# or
git clone https://<your_username>@bitbucket.org/ccdm-curtin/catastrophy.git ./
pip install .

Using CATAStrophy

The command line interface is pretty simple, you just need to supply the input files and where to put the output. The input files should be the output from HMMER3 hmmscan as either the raw HMMER3 text output or the "domain table" output provided by the --domtblout flag. Parsing the domain table output is about twice as fast as the regular text output, so if you have lots of files to run it might be worth saving those files.

The easiest way to get a file like this is to annotate your proteome using the dbCAN online tool at http://csbl.bmb.uga.edu/dbCAN/annotate.php, and save the HMMER3 raw text results locally. Assuming that you have this file locally you can run CATAStrophy like so:

catastrophy -i my_dbcan_results.txt -f hmmer -o my_catastrophy_results.csv

The output will be a tab-delimited file (which you can open in excel) with the first row containing column headers and subsequent rows containing a label and the pseudo-probabilities of membership to each trophic class. The -f/--format flag is optional and defaults to hmmer, but if you want to use domain table output, you should include the flag -f domtab (run catastropy --help for more options).

NOTE: In this document I use the .csv extension to mean any plain text tabular format because excel doesn't recognise alternate extensions like .tsv. The domain table output is actually space delimited and the catastrophy output is a tab delimited file.

By default the filenames are used as the label but you can explicitly specify a label using the -l/--label flag. The output from the command above will have two lines, one containing the column headers and the other containing results for the file my_dbcan_results.txt which will have the label "my_dbcan_results.txt".

To give it a nicer label you can run this.

catastrophy -i my_dbcan_results.txt -l prettier_label -o my_catastrophy_results.csv

Which would give the output line for my_dbcan_results.txt the label "prettier_label". Unfortunately, labels cannot contain spaces unless you explicitly escape them (quotes won't work).

If you want to run multiple files at the same time you just need to separate the files by spaces, like this:

catastrophy -i dbcan_1.txt dbcan_2.txt -o my_catastrophy_results.csv

# Or equivalently
catastrophy -i dbcan_*.txt -o my_catastrophy_results.csv

The output from this will contain three rows, one containing the headers and the other two containing the results for the files dbcan_1.txt and dbcan_2.txt which will be labelled by the filenames. Note that standard bash "globbing" patterns expand into a space delimited array, so you can easily use "*" or subshells if you like (eg. $(find . -type f -name *.txt) etc). To explicitly label these files you can again supply the label flag with the space separated labels.

catastrophy -i dbcan_1.txt dbcan_2.txt -l label1 label2 -o my_catastrophy_results.csv

Note that if you do use the label flag, the number of labels must be the same as the number of input files.

Both the input and output flags support standard input/output (they are actually the default values). So you could change the single file commands from above to:

cat my_dbcan_results.txt | catastrophy -l prettier_label > my_catastrophy_results.csv

# or using the convention for "-" representing stdin/stdout

cat my_dbcan_results.txt | catastrophy -i - -l prettier_label -o - > my_catastrophy_results.csv

If you don't spefify a label for stdin input the label will be "".

If you really want to you could also mix and match stdin and filepaths using "-" to specify stdin.

cat dbcan_2.txt | catastrophy -i dbcan_1.txt - -o my_catastrophy_results.csv

So the second result row in the output would come from stdin. Of course, if you cat multiple files into catastrophy they will all be treated as a single file so it doesn't usually make sense to use stdin with multiple inputs.

Finally because dbCAN is updated as new CAZyme classes are created, merged, or split, catastrophy has a final parameter that allows you to select the model trained on a specific dbCAN version (starting from version 5).

To specify the version of the model to use, just include the -m/--model flag with one of the valid options (see catastrophy -h for the options).

catastrophy -m v5 -i my_dbcan_results.txt -o my_catastrophy_results.csv

The model versions just reflect the version of dbCAN that the model was trained against.

Running dbCAN locally

If you have lots of proteomes to run (or you're a command-line snob like me) then you probably don't want to use the web interface. In that case you can run the dbCAN pipeline locally using HMMER.

The instructions for running the HMMER and the dbCAN parser can be found here http://csbl.bmb.uga.edu/dbCAN/download.php in the readme.txt file. It isn't the most friendly documentation though so i'll repeat it here (assuming that you've installed HMMER and are using a unix-like OS).

First download the HMMs and the parser script.

cd <a directory that you can work in>

mkdir -p ./data
wget -qc -P ./data http://csbl.bmb.uga.edu/dbCAN/download/dbCAN-fam-HMMs.txt.v5

# Optional, useful for summarising your dbCAN 
# results but not necessary for CATAStrophy.
wget -qc -P ./data http://csbl.bmb.uga.edu/dbCAN/download/hmmscan-parser.sh

Note that I'm downloading a specific version of the database rather that just the latest one. Now we can convert the file containing HMM definitions into a HMMER database.

hmmpress ./data/dbCAN-fam-HMMs.txt.v5

Now we can run HMMER to find matches to the dbCAN HMMs. For demonstration, we'll save both outputs.

hmmscan --domtblout my_fasta_hmmer.csv ./data/dbCAN-fam-HMMs.txt.v5 my_fasta.fasta > my_fasta_hmmer.txt

The domain table is now in the file my_fasta_hmmer.csv and the plain hmmer text output is in my_fasta_hmmer.txt. Either one of these files is appropriate for use with CATAStrophy, (just remember to specify the --format flag. In practise, you'll probably only need the domain table output in which case you could just redirect the standard output to /dev/null to delete it.

If you want to look at the dbCAN matches, you can use the summary script from dbCAN. This script takes the domain table output from hmmscan as input and returns a new tabular file.

bash ./data/hmmscan-parser.sh my_fasta_hmmer.csv > my_fasta_dbcan.csv

And that's it!

Details

Some extra details about the CATAStrophy method, including the classes used and the calculation of the RCD.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catastrophy-0.0.1.tar.gz (23.9 kB view hashes)

Uploaded Source

Built Distribution

catastrophy-0.0.1-py2.py3-none-any.whl (215.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page