DeepBGC - Biosynthetic Gene Cluster detection and classification

These details have not been verified by PyPI

Project links

Homepage

Project description

DeepBGC: Biosynthetic Gene Cluster detection and classification

DeepBGC detects BGCs in bacterial and fungal genomes using deep learning. DeepBGC employs a Bidirectional Long Short-Term Memory Recurrent Neural Network and a word2vec-like vector embedding of Pfam protein domains. Product class and activity of detected BGCs is predicted using a Random Forest classifier.

PyPI - Downloads

DeepBGC architecture

📌 News 📌

DeepBGC 0.1.23: Predicted BGCs can now be uploaded for visualization in antiSMASH using a JSON output file
- Install and run DeepBGC as usual based on instructions below
- Upload antismash.json from the DeepBGC output folder using "Upload extra annotations" on the antiSMASH page
- Predicted BGC regions and their prediction scores will be displayed alongside antiSMASH BGCs

Publications

A deep learning genome-mining strategy for biosynthetic gene cluster prediction
Geoffrey D Hannigan, David Prihoda et al., Nucleic Acids Research, gkz654, https://doi.org/10.1093/nar/gkz654

Install using conda (recommended)

Install Bioconda by following Step 1 and 2 from: https://bioconda.github.io/
Add conda-forge channel using conda config --add channels conda-forge
Run conda install deepbgc to install DeepBGC and all of its dependencies

Install using pip (if conda is not available)

If you don't mind installing the HMMER and Prodigal dependencies manually, you can also install DeepBGC using pip:

Install Python version 3.6 or 3.7 (Note: Python 3.8 is not supported due to Tensorflow < 2.0 dependency)
Install Prodigal and put the prodigal binary it on your PATH: https://github.com/hyattpd/Prodigal/releases
Install HMMER and put the hmmscan and hmmpress binaries on your PATH: http://hmmer.org/download.html
Run pip install deepbgc to install DeepBGC

Use DeepBGC

Download models and Pfam database

Before you can use DeepBGC, download trained models and Pfam database:

deepbgc download

You can display downloaded dependencies and models using:

deepbgc info

Detection and classification

DeepBGC pipeline

Detect and classify BGCs in a genomic sequence. Proteins and Pfam domains are detected automatically if not already annotated (HMMER and Prodigal needed)

# Show command help docs
deepbgc pipeline --help

# Detect and classify BGCs in mySequence.fa using DeepBGC detector.
deepbgc pipeline mySequence.fa

# Detect and classify BGCs in mySequence.fa using custom DeepBGC detector trained on your own data.
deepbgc pipeline --detector path/to/myDetector.pkl mySequence.fa

This will produce a mySequence directory with multiple files and a README.txt with file descriptions.

See Train DeepBGC on your own data section below for more information about training a custom detector or classifier.

Example output

See the DeepBGC Example Result Notebook. Data can be downloaded on the releases page

Detected BGC Regions

Train DeepBGC on your own data

You can train your own BGC detection and classification models, see deepbgc train --help for documentation and examples.

Training and validation data can be found in release 0.1.0 and release 0.1.5. You will need:

Positive (BGC) training data - In most cases, this is your own BGC training set, see "Preparing training data" section below
Negative (Non-BGC) training data - Needed for BGC detection. You can use GeneSwap_Negatives.pfam.tsv from release https://github.com/Merck/deepbgc/releases/tag/v0.1.0
Validation data - Needed for BGC detection. Contigs with annotated BGC and non-BGC regions. A working example can be downloaded from https://github.com/Merck/deepbgc/releases/tag/v0.1.5
Trained Pfam2vec vectors - "Vocabulary" converting Pfam IDs to meaningful numeric vectors, you can reuse previously trained pfam2vec.csv results from https://github.com/Merck/deepbgc/releases/tag/v0.1.0
JSON configuration files - See JSON section below

If you have any questions about using or training DeepBGC, feel free to submit an issue.

Preparing training data

The training examples need to be prepared in Pfam TSV format, which can be prepared from your sequence using deepbgc prepare.

First, you will need to manually add an in_cluster column that will contain 0 for pfams outside a BGC and 1 for pfams inside a BGC. We recommend preparing a separate negative TSV and positive TSV file, where the column will be equal to all 0 or 1 respectively.

Finally, you will need to manually add a sequence_id column , which will identify a continuous sequence of Pfams from a single sample (BGC or negative sequence). The samples are shuffled during training to present the model with a random order of positive and negative samples. Pfams with the same sequence_id value will be kept together. For example, if your training set contains multiple BGCs, the sequence_id column should contain the BGC ID.

! New in version 0.1.17 ! You can now prepare protein FASTA sequences into a Pfam TSV file using deepbgc prepare --protein.

JSON model training template files

DeepBGC is using JSON template files to define model architecture and training parameters. All templates can be downloaded in release 0.1.0.

JSON template for DeepBGC LSTM detector with pfam2vec is structured as follows:

{
  "type": "KerasRNN", - Model architecture (KerasRNN/DiscreteHMM/GeneBorderHMM)
  "build_params": { - Parameters for model architecture
    "batch_size": 16, - Number of splits of training data that is trained in parallel 
    "hidden_size": 128, - Size of vector storing the LSTM inner state
    "stateful": true - Remember previous sequence when training next batch
  },
  "fit_params": {
    "timesteps": 256, - Number of pfam2vec vectors trained in one batch
    "validation_size": 0, - Fraction of training data to use for validation (if validation data is not provided explicitly). Use 0.2 for 20% data used for testing.
    "verbose": 1, - Verbosity during training
    "num_epochs": 1000, - Number of passes over your training set during training. You probably want to use a lower number if not using early stopping on validation data.
    "early_stopping" : { - Stop model training when at certain validation performance
      "monitor": "val_auc_roc", - Use validation AUC ROC to observe performance
      "min_delta": 0.0001, - Stop training when the improvement in the last epochs did not improve more than 0.0001
      "patience": 20, - How many of the last epochs to check for improvement
      "mode": "max" - Stop training when given metric stops increasing (use "min" for decreasing metrics like loss)
    },
    "shuffle": true, - Shuffle samples in each epoch. Will use "sequence_id" field to group pfam vectors belonging to the same sample and shuffle them together 
    "optimizer": "adam", - Optimizer algorithm
    "learning_rate": 0.0001, - Learning rate
    "weighted": true - Increase weight of less-represented class. Will give more weight to BGC training samples if the non-BGC set is larger.
  },
  "input_params": {
    "features": [ - Array of features to use in model, see deepbgc/features.py
      {
        "type": "ProteinBorderTransformer" - Add two binary flags for pfam domains found at beginning or at end of protein
      },
      {
        "type": "Pfam2VecTransformer", - Convert pfam_id field to pfam2vec vector using provided pfam2vec table
        "vector_path": "#{PFAM2VEC}" - PFAM2VEC variable is filled in using command line argument --config
      }
    ]
  }
}

JSON template for Random Forest classifier is structured as follows:

{
  "type": "RandomForestClassifier", - Type of classifier (RandomForestClassifier)
  "build_params": {
    "n_estimators": 100, - Number of trees in random forest
    "random_state": 0 - Random seed used to get same result each time
  },
  "input_params": {
    "sequence_as_vector": true, - Convert each sample into a single vector
    "features": [
      {
        "type": "OneHotEncodingTransformer" - Convert each sequence of Pfams into a single binary vector (Pfam set)
      }
    ]
  }
}

Using your trained model

Since version 0.1.10 you can provide a direct path to the detector or classifier model like so:

deepbgc pipeline \
    mySequence.fa \
    --detector path/to/myDetector.pkl \
    --classifier path/to/myClassifier.pkl

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.31

Nov 11, 2023

0.1.30

Apr 1, 2022

This version

0.1.29

Sep 30, 2021

0.1.28

Sep 20, 2021

0.1.27

Jul 14, 2021

0.1.26

Mar 6, 2021

0.1.25

Mar 5, 2021

0.1.24

Mar 5, 2021

0.1.23

Feb 23, 2021

0.1.22

Jan 20, 2021

0.1.21

Nov 23, 2020

0.1.20

Nov 21, 2020

0.1.19

Oct 5, 2020

0.1.18

Feb 12, 2020

0.1.17

Feb 7, 2020

0.1.16

Nov 20, 2019

0.1.15

Nov 19, 2019

0.1.14

Oct 14, 2019

0.1.13

Oct 1, 2019

0.1.12

Oct 1, 2019

0.1.11

Oct 1, 2019

0.1.10

Sep 6, 2019

0.1.9

Aug 1, 2019

0.1.8

Jul 30, 2019

0.1.7

Jun 11, 2019

0.1.6

Mar 21, 2019

0.1.5

Mar 21, 2019

0.1.4

Mar 18, 2019

0.1.3

Mar 13, 2019

0.1.2

Mar 12, 2019

0.1.1

Mar 12, 2019

0.1.0

Mar 12, 2019

0.0.2

Mar 7, 2019

0.0.1

Jan 31, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepbgc-0.1.29.tar.gz (53.4 kB view details)

Uploaded Sep 30, 2021 Source

Built Distribution

deepbgc-0.1.29-py3-none-any.whl (64.4 kB view details)

Uploaded Sep 30, 2021 Python 3

File details

Details for the file deepbgc-0.1.29.tar.gz.

File metadata

Download URL: deepbgc-0.1.29.tar.gz
Upload date: Sep 30, 2021
Size: 53.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for deepbgc-0.1.29.tar.gz
Algorithm	Hash digest
SHA256	`76aa3bac19fd42b4c6d0f83fbf54a6a43e3638b02e19df7444333aa0d48ab121`
MD5	`d6df28530191635ceb57154460cffad6`
BLAKE2b-256	`c2b763902c12258f38463fe78c4100fb8670f6403f4a64fe4f7457e123df13a2`

See more details on using hashes here.

File details

Details for the file deepbgc-0.1.29-py3-none-any.whl.

File metadata

Download URL: deepbgc-0.1.29-py3-none-any.whl
Upload date: Sep 30, 2021
Size: 64.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for deepbgc-0.1.29-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db4fe41e9dd9ad97d224cc50df6474b3c794adf4b837537be0bcc39696d20fac`
MD5	`b1c9338eaaa5fd13a28862bf3f982a77`
BLAKE2b-256	`c179e7f7679f7ae85e188b80e7c94b1dc59be76ea22571341c2c10d879f5c037`