Skip to main content

Automatic detection and subtyping of CRISPR-Cas operons

Project description

CasPredict

Detect CRISPR-Cas genes and arrays, and predict the subtype based on both Cas genes and CRISPR repeat sequence.

This software finds Cas genes with a large suite of HMMs, then groups these HMMs into operons, and predicts the subtype of the operons based on a scoring scheme. Furthermore, it finds CRISPR arrays with minced, and using a kmer-based machine learning approach (extreme gradient boosting trees) it predicts the subtype of the CRISPR arrays based on the consensus repeat. It then connects the Cas operons and CRISPR arrays, producing as output:

  • CRISPR-Cas loci, with consensus subtype prediction based on both Cas genes (mostly) and CRISPR consensus repeats
  • Orphan Cas operons, and their predicted subtype
  • Orphan CRISPR arrays, and their predicted associated subtype

It includes the following subtypes:

It can automatically draw gene maps of (putative) CRISPR-Cas systems and orphan Cas operons and CRISPR arrays

  • Cas genes are in red.
  • Arrays are in blue, with their predicted subtype association based on the consensus repeat sequence.

Table of contents

  1. Quick start
  2. Installation
  3. CasPredict - How to
  4. RepeatType - How to

Quick start

conda create -n caspredict -c conda-forge -c bioconda -c russel88 caspredict
conda activate caspredict
caspredict my.fasta my_output

Installation

Conda

It is advised to use miniconda or anaconda to install.

Create the environment with caspredict and all dependencies

conda create -n caspredict -c conda-forge -c bioconda -c russel88 caspredict

pip

If you have the dependencies (Python >= 3.8, HMMER >= 3.2, Prodigal >= 2.6, grep, sed) in your PATH you can install with pip

python -m pip install caspredict

When installing with pip, you need to download the database manually:

Coming soon...

CasPredict - How to

CasPredict takes as input a nucleotide fasta, and produces outputs with CRISPR-Cas predictions

Activate environment

conda activate caspredict

Run with a nucleotide fasta as input

caspredict genome.fa my_output

Use multiple threads

caspredict genome.fa my_output -t 20

Check the different options

caspredict -h

Output

  • CRISPR_Cas.tab: CRISPR_Cas loci, with consensus subtype prediction
    • Contains a consensus prediction (Prediction), and the separate predictions for the Cas operon (Prediction_Cas) and CRISPR arrays (Prediction_CRISPRs)
  • cas_operons.tab: All certain Cas operons
    • Contains a prediction of subtype (Prediction) and the subtype with the highest score (Best_type). If the score is high then Prediction = Best_type
  • crisprs_all.tab: All CRISPR arrays
    • Contains a prediction of the associated subtype based on the repeat sequence (Prediction).
    • The 'Subtype' column is the subtype with highest probability. Prediction = Subtype if Subtype_probability is high
  • crisprs_orphan.tab: Orphan CRISPRs (those not in CRISPR_Cas.tab)
    • Same columns as crisprs_all.tab
  • cas_operons_orphan.tab: Orphan Cas operons (those not in CRISPR_Cas.tab)
    • Same columns as cas_operons.tab
  • CRISPR_Cas_putative.tab: Putative CRISPR_Cas loci, often lonely Cas genes next to a CRISPR array
    • Same columns as CRISPR_Cas.tab
  • cas_operons_putative.tab: Putative Cas operons, mostly false positives, but also some ambiguous and partial systems
    • Same columns as cas_operons.tab
  • spacers/*.fa: Fasta files with all spacer sequences
  • hmmer.tab: All HMM vs. ORF matches, raw unfiltered results
  • arguments.tab: File with arguments given to CasPredict

Notes on output

Files are only created if there is any data. For example, the crisprs_orphan.tab file is only created if there are any orphan CRISPR arrays.

RepeatType - How to

With an input of CRISPR repeats (one per line, in a simple textfile) RepeatType will predict the subtype, based on the kmer composition of the repeat

Activate environment

conda activate caspredict

Run with a simple textfile, containing only CRISPR repeats (in capital letters), one repeat per line.

repeatType repeats.txt

Output

The script prints:

  • Repeat sequence
  • Predicted subtype
  • Probability of prediction

Notes on output

  • Predictions with probabilities below 0.75 are uncertain, and should be taken with a grain of salt.
  • The classifier was only trained on the subtypes for which there were enough (>20) repeats. It can therefore only predict subtypes of repeats associated with the following subtypes:
    • I-A, I-B, I-C, I-D, I-E, I-F, I-G
    • II-A, II-B, II-C
    • III-A, III-B, III-C, III-D
    • IV-A1, IV-A2, IV-A3
    • V-A
    • VI-B
  • This is the accuracy per subtype (on an unseen test dataset):
    • I-A 0.60
    • I-B 0.90
    • I-C 0.98
    • I-D 0.47
    • I-E 1.00
    • I-F 0.99
    • I-G 0.83
    • II-A 0.94
    • II-B 1.00
    • II-C 0.89
    • III-A 0.89
    • III-B 0.49
    • III-C 0.60
    • III-D 0.28
    • IV-A1 0.79
    • IV-A2 0.78
    • IV-A3 0.98
    • V-A 0.77
    • VI-B 1.00

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caspredict-0.4.2.tar.gz (19.2 kB view hashes)

Uploaded Source

Built Distribution

caspredict-0.4.2-py3.8.egg (40.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page