Automatic detection and subtyping of CRISPR-Cas operons

These details have not been verified by PyPI

Project links

Project description

CasPredict

Detect CRISPR-Cas genes and arrays, and predict the subtype based on both Cas genes and CRISPR repeat sequence.

CasPredict and RepeatType are also available through a webserver

This software finds Cas genes with a large suite of HMMs, then groups these HMMs into operons, and predicts the subtype of the operons based on a scoring scheme. Furthermore, it finds CRISPR arrays with minced, and using a kmer-based machine learning approach (extreme gradient boosting trees) it predicts the subtype of the CRISPR arrays based on the consensus repeat. It then connects the Cas operons and CRISPR arrays, producing as output:

CRISPR-Cas loci, with consensus subtype prediction based on both Cas genes (mostly) and CRISPR consensus repeats
Orphan Cas operons, and their predicted subtype
Orphan CRISPR arrays, and their predicted associated subtype

It includes the following subtypes:

All the ones in the most recent Nature Reviews Microbiology (Makarova et al. 2020): Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants
Updated type IV subtypes and variants based on: Type IV CRISPR–Cas systems are highly diverse and involved in competition between plasmids
Type V-K: RNA-guided DNA insertion with CRISPR-associated transposases
Transposon associated type I-F: Transposon-encoded CRISPR–Cas systems direct RNA-guided DNA integration

It can automatically draw gene maps of CRISPR-Cas systems and orphan Cas operons and CRISPR arrays

Citation

Coming soon...

Quick start
Installation
CasPredict - How to
- Plotting
RepeatType - How to
RepeatType - Train

Quick start

conda create -n caspredict -c conda-forge -c bioconda -c russel88 caspredict
conda activate caspredict
caspredict my.fasta my_output

Installation

CasPredict can be installed either through conda or pip.

It is advised to use conda, since this installs CasPredict and all dependencies, and downloads with database in one go.

Conda

Use miniconda or anaconda to install.

Create the environment with caspredict and all dependencies and database

conda create -n caspredict -c conda-forge -c bioconda -c russel88 caspredict

pip

If you have the dependencies (Python >= 3.8, HMMER >= 3.2, Prodigal >= 2.6, grep, sed) in your PATH you can install with pip

python -m pip install caspredict

When installing with pip, you need to download the database manually:

# Download and unpack
svn checkout https://github.com/Russel88/CasPredict/trunk/data
tar -xvzf data/Profiles.tar.gz
mv Profiles/ data/
rm data/Profiles.tar.gz

# Tell CasPredict where the data is:
# either by setting an environment variable (has to done for each terminal session, or added to .bashrc):
export CASPREDICT_DB="/path/to/data/"
# or by using the --db argument each time you run CasPredict:
caspredict input.fa output --db /path/to/data/

CasPredict - How to

CasPredict takes as input a nucleotide fasta, and produces outputs with CRISPR-Cas predictions

Activate environment

conda activate caspredict

Run with a nucleotide fasta as input

caspredict genome.fa my_output

Use multiple threads

caspredict genome.fa my_output -t 20

Check the different options

caspredict -h

Output

CRISPR_Cas.tab: CRISPR_Cas loci, with consensus subtype prediction
- Contains a consensus prediction (Prediction), and the separate predictions for the Cas operon (Prediction_Cas) and CRISPR arrays (Prediction_CRISPRs)
cas_operons.tab: All certain Cas operons
- Contains a prediction of subtype (Prediction) and the subtype with the highest score (Best_type). If the score is high then Prediction = Best_type
crisprs_all.tab: All CRISPR arrays
- Contains a prediction of the associated subtype based on the repeat sequence (Prediction).
- The 'Subtype' column is the subtype with highest probability. Prediction = Subtype if Subtype_probability is high
crisprs_orphan.tab: Orphan CRISPRs (those not in CRISPR_Cas.tab)
- Same columns as crisprs_all.tab
cas_operons_orphan.tab: Orphan Cas operons (those not in CRISPR_Cas.tab)
- Same columns as cas_operons.tab
CRISPR_Cas_putative.tab: Putative CRISPR_Cas loci, often lonely Cas genes next to a CRISPR array
- Same columns as CRISPR_Cas.tab
cas_operons_putative.tab: Putative Cas operons, mostly false positives, but also some ambiguous and partial systems
- Same columns as cas_operons.tab
spacers/*.fa: Fasta files with all spacer sequences
hmmer.tab: All HMM vs. ORF matches, unfiltered results
genes.tab All genes and their positions
arguments.tab: File with arguments given to CasPredict
hmmer.log Error messages from HMMER (only produced if any errors were encountered)

If run with `--keep_tmp` the following is also produced

prodigal.log Log from prodigal
proteins.faa Protein sequences
hmmer/*.tab Alignment output from HMMER for each Cas HMM
minced.out: CRISPR array output from minced

Notes on output

Files are only created if there is any data. For example, the CRISPR_Cas.tab file is only created if there are any CRISPR-Cas loci.

Plotting

CasPredict will automatically plot a map of the CRISPR-Cas loci, orphan Cas operons, and orphan CRISPR arrays.

These maps can be expanded (--expand N) by adding unknown genes and genes with alignment scores below the thresholds. This can help in identify potentially un-annotated genes in operons. You can generate new plots without having to re-run the entire pipeline by adding --redo_typing to the command. This will re-use the mappings and re-type the operons and re-make the plot, based on new thresholds and plot parameters.

The plot below is run with --expand 5

Cas genes are in red.
Cas genes, with alignment scores below the thresholds, are in dark green
Unknown genes are in gray (the number matches the genes.tab file)
Arrays are in blue, with their predicted subtype association based on the consensus repeat sequence.

RepeatType - How to

With an input of CRISPR repeats (one per line, in a simple textfile) RepeatType will predict the subtype, based on the kmer composition of the repeat

Activate environment

conda activate caspredict

Run with a simple textfile, containing only CRISPR repeats (in capital letters), one repeat per line.

repeatType repeats.txt

Output

The script prints:

Repeat sequence
Predicted subtype
Probability of prediction

Notes on output

Predictions with probabilities below 0.75 are uncertain, and should be taken with a grain of salt.
The classifier was only trained on the subtypes for which there were enough (>20) repeats. It can therefore only predict subtypes of repeats associated with the following subtypes:
- I-A, I-B, I-C, I-D, I-E, I-F, I-G
- II-A, II-B, II-C
- III-A, III-B, III-C, III-D
- IV-A1, IV-A2, IV-A3
- V-A
- VI-B
This is the accuracy per subtype (on an unseen test dataset):
- I-A 0.60
- I-B 0.90
- I-C 0.98
- I-D 0.47
- I-E 1.00
- I-F 0.99
- I-G 0.83
- II-A 0.94
- II-B 1.00
- II-C 0.89
- III-A 0.89
- III-B 0.49
- III-C 0.60
- III-D 0.28
- IV-A1 0.79
- IV-A2 0.78
- IV-A3 0.98
- V-A 0.77
- VI-B 1.00

RepeatType - Train

You can train the repeat classifier with your own set of subtyped repeats. With a tab-delimeted input where 1. column contains the subtypes and 2. column contains the CRISPR repeat sequences, RepeatTrain will train a CRISPR repeat classifier that is directly usable for both RepeatType and CasPredict.

Train

repeatTrain typed_repeats.tab my_classifier

Use new model in RepeatType

repeatType repeats.txt --db my_classifier

Use new model in CasPredict

Save the original database files:

mv ${CASPREDICT_DB}/type_dict.tab ${CASPREDICT_DB}/type_dict_orig.tab
mv ${CASPREDICT_DB}/xgb_repeats.model ${CASPREDICT_DB}/xgb_repeats_orig.model

Move the new model into the database folder

mv my_classifier/* ${CASPREDICT_DB}/

CasPredict and RepeatType will now use the new model for repeat prediction!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.4

Apr 15, 2020

0.5.3

Apr 13, 2020

0.5.2

Apr 1, 2020

0.5.1

Apr 1, 2020

0.5.0

Mar 27, 2020

0.4.3

Mar 23, 2020

0.4.2

Mar 20, 2020

0.4.1

Mar 20, 2020

0.4.0

Mar 19, 2020

0.3.6

Mar 18, 2020

0.3.5

Mar 18, 2020

0.3.4

Mar 17, 2020

0.3.2

Mar 16, 2020

0.3.1

Mar 16, 2020

0.3.0

Mar 16, 2020

0.2.2

Feb 21, 2020

0.2.1

Feb 20, 2020

0.2.0

Feb 20, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caspredict-0.5.4.tar.gz (25.0 kB view details)

Uploaded Apr 15, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

caspredict-0.5.4-py3.8.egg (53.5 kB view details)

Uploaded Apr 15, 2020 Egg

File details

Details for the file caspredict-0.5.4.tar.gz.

File metadata

Download URL: caspredict-0.5.4.tar.gz
Upload date: Apr 15, 2020
Size: 25.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/2.7.16

File hashes

Hashes for caspredict-0.5.4.tar.gz
Algorithm	Hash digest
SHA256	`fe9ffa66b1b789f2435a62f30ceb4559998765b38faefb8fbf308a32393a1b25`
MD5	`eb9edbd0eb18d374355c9e9206acc36e`
BLAKE2b-256	`a7daf719bd7c060c1de3cb53728441bd561bed3f68c387ba2aaa21465c0d895b`

See more details on using hashes here.

File details

Details for the file caspredict-0.5.4-py3.8.egg.

File metadata

Download URL: caspredict-0.5.4-py3.8.egg
Upload date: Apr 15, 2020
Size: 53.5 kB
Tags: Egg
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/2.7.16

File hashes

Hashes for caspredict-0.5.4-py3.8.egg
Algorithm	Hash digest
SHA256	`ba6defc1e183dfcddd0c2b8ffc699b1bd0fb259bb5dd98725f27abb7f5da35d8`
MD5	`34bf60d59b427c53849919db95daa052`
BLAKE2b-256	`9ebf5294fd3331ceb5adb4d340ba7737642cadee75b8cab604934e12684a0a5c`

See more details on using hashes here.

caspredict 0.5.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CasPredict

It includes the following subtypes:

It can automatically draw gene maps of CRISPR-Cas systems and orphan Cas operons and CRISPR arrays

Citation

Table of contents

Quick start

Installation

Conda

pip

When installing with pip, you need to download the database manually:

CasPredict - How to

Activate environment

Run with a nucleotide fasta as input

Use multiple threads

Check the different options

Output

If run with --keep_tmp the following is also produced

Notes on output

Plotting

RepeatType - How to

Activate environment

Run with a simple textfile, containing only CRISPR repeats (in capital letters), one repeat per line.

Output

Notes on output

RepeatType - Train

Train

Use new model in RepeatType

Use new model in CasPredict

CasPredict and RepeatType will now use the new model for repeat prediction!

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

If run with `--keep_tmp` the following is also produced