Skip to main content

A Deep Learning Framework for Multi-class Secreted Effector Prediction in Gram-negative Bacteria.

Project description

DeepSecE

Implementation of Effector-specific Transformer model used in secretion effector prediction in Gram-negative bacteria. DeepSecE achieves state-of-the-art performance in multi-class effector prediction leveraging the power of pre-trained protein language model ESM-1b. An additional transformer layer enhances the understanding of secreted patterns. It also provide a rapid pipeline to identify type I-IV and VI secretion systems with corresponding effectors.

Performance Comparison

We choose various model architecture with different pre-trained models and training strategies, and evalute their model capacity on cross-validation and independent testing. Performance metrics are reported in the table.

Pre-trained Model Strategy ACC F1 AUPRC
Valid Test Valid Test Valid Test
/ PSSM+CNN 0.799 0.822 0.712 0.724 0.752 0.774
TAPEBert Linear probing 0.816 0.838 0.764 0.770 0.802 0.822
ESM-1b Linear probing 0.876 0.870 0.841 0.810 0.880 0.871
ESM-1b Finetuning 0.878 0.850 0.846 0.808 0.887 0.883
ESM-1b Effector-specific transformer 0.883 0.898 0.848 0.849 0.892 0.879

Set up

Requirements

  • python==3.9.7
  • torch==1.10.2
  • biopython==1.79
  • einops==0.4.1
  • fair-esm>=0.4.0
  • tqdm==4.64.0
  • numpy==1.21.2
  • scikit-learn==0.23.2
  • matplotlib==3.5.1
  • seaborn==0.11.0
  • tensorboardX==2.0
  • umap-learn==0.5.3
  • warmup-scheduler==0.3.2

While we have not tested with other versions, any reasonably recent versions of these requirements should work.

Installation

As a prerequisite, you must have PyTorch installed. It is recommended to create a new virtual environment for installation. For model training and prediction from seperate protein sequence(s), You can use this one-liner for installation.

pip install git+https://github.com/zhangyumeng1sjtu/DeepSecE.git

If you want to plot the sequence attention, you should install package logomarker first.

pip install logomaker

If you want to predict secretion systems and effectors, you should install macsyfinder and hmmer first. Meanwhile, you need to download the TXSS profiles from here, and decompress it into data directory.

pip install macsyfinder
conda install -c bioconda hmmer
cd data
wget https://tool2-mml.sjtu.edu.cn/DeepSecE/TXSS_profiles.tar.gz
tar -zxvf TXSS_profiles.tar.gz

The weights of DeepSecE model can be downloaded from https://tool2-mml.sjtu.edu.cn/DeepSecE/checkpoint.pt.

Usage

Train model

You can train the DeepSecE model by running train.py or scripts/kfold_train.sh for cross-validation.

for i in {0..4}
do
   python train.py --model effectortransformer \
		--data_dir data \
		--batch_size 32 \
		--lr 5e-5 \
		--weight_decay 4e-5 \
		--dropout_rate 0.4 \
		--num_layers 1 \
		--num_heads 4 \
		--warm_epochs 1 \
		--patience 5 \
		--lr_scheduler cosine \
		--lr_decay_steps 30 \
		--kfold 5 \
		--fold_num $i \
		--log_dir runs/attempt_cv
done

Parameters:

  • --model train a effector transformer or finetue a ESM-1b model.
  • --data_dir directory that stores training data (default: ./data).
  • --num_layers numbers of trainable transformer layer. (default: 1)
  • --num_heads numbers of attention heads in effector-specific transformer (default: 4).
  • --patience patience for early stopping used in training.
  • --lr_schedular learning rate schedular [step, consine].
  • --log_dir directory that stores training outputs (default: logs).

Prediction

You can predict your interested type of secreted effectors only or predict secretion systems and corresponding effectors from scratch.

Predict secretion effector

python predict.py --fasta_path examples/Test.fasta \
		--model_location [path to model weights] \
		--secretion_systems [I II III IV VI] \
		--out_dir examples [--save_attn --no_cuda]

Parameters:

  • --fasta_path path to the input protein FASTA file.
  • --model_location path to the model weights (download from here).
  • --secretion_systems type(s) of secretion system to predict (default: I II III IV VI).
  • --out_dir directory that stores prediction outputs.
  • --save_attn add to save sequence attention of effector.
  • --no_cuda add when CUDA is not available.

Predict secretion system and effectors

Note: Make sure the input file is ordered protein sequences coded in a bacterial genome.

python predict_genome.py --fasta_path examples/NC_002516.2_protein.fasta \
			--model_location [path to model weights] \
			--data_dir data \
			--out_dir examples/NC_002516.2 [--save_attn --no_cuda]

Parameters:

  • --fasta_path path to the input protein FASTA file.
  • --model_location path to the model weights (download from here).
  • --data_dir directory that stores TXSS profiles (download from here).
  • --out_dir directory that stores prediction outputs.
  • --save_attn add to save sequence attention of effector.
  • --no_cuda add when CUDA is not available.

It takes about 5 minutes to predict effectors from a bacterial genome containing 3000 proteion coding sequences on a NVIDIA GeForce RTX 2080 Super GPU.

Plot attention

If you save the attention output of the putative effectors (add --save_attn), you can run python scripts /plot_attention.py [directory of prediction output] to plot the saliency map from attention, and infer potentially import regions related to protein secretion.

Contact

Please contact Yumeng Zhang at zhangyumeng1@sjtu.edu.cn for questions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DeepSecE-0.1.0.tar.gz (8.7 MB view hashes)

Uploaded Source

Built Distribution

DeepSecE-0.1.0-py3-none-any.whl (9.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page