A Deep Learning Framework for Multi-class Secreted Effector Prediction in Gram-negative Bacteria.
Project description
DeepSecE
Implementation of Effector-specific Transformer model used in secretion effector prediction in Gram-negative bacteria. DeepSecE achieves state-of-the-art performance in multi-class effector prediction leveraging the power of pre-trained protein language model ESM-1b. An additional transformer layer enhances the understanding of secreted patterns. It also provide a rapid pipeline to identify type I-IV and VI secretion systems with corresponding effectors.
Performance Comparison
We choose various model architecture with different pre-trained models and training strategies, and evalute their model capacity on cross-validation and independent testing. Performance metrics are reported in the table.
Pre-trained Model | Strategy | ACC | F1 | AUPRC | |||
---|---|---|---|---|---|---|---|
Valid | Test | Valid | Test | Valid | Test | ||
/ | PSSM+CNN | 0.799 | 0.822 | 0.712 | 0.724 | 0.752 | 0.774 |
TAPEBert | Linear probing | 0.816 | 0.838 | 0.764 | 0.770 | 0.802 | 0.822 |
ESM-1b | Linear probing | 0.876 | 0.870 | 0.841 | 0.810 | 0.880 | 0.871 |
ESM-1b | Finetuning | 0.878 | 0.850 | 0.846 | 0.808 | 0.887 | 0.883 |
ESM-1b | Effector-specific transformer | 0.883 | 0.898 | 0.848 | 0.849 | 0.892 | 0.879 |
Set up
Requirements
- python==3.9.7
- torch==1.10.2
- biopython==1.79
- einops==0.4.1
- fair-esm>=0.4.0
- tqdm==4.64.0
- numpy==1.21.2
- scikit-learn==0.23.2
- matplotlib==3.5.1
- seaborn==0.11.0
- tensorboardX==2.0
- umap-learn==0.5.3
- warmup-scheduler==0.3.2
While we have not tested with other versions, any reasonably recent versions of these requirements should work.
Installation
As a prerequisite, you must have PyTorch installed. It is recommended to create a new virtual environment for installation. For model training and prediction from seperate protein sequence(s), You can use this one-liner for installation.
pip install git+https://github.com/zhangyumeng1sjtu/DeepSecE.git
If you want to plot the sequence attention, you should install package logomarker
first.
pip install logomaker
If you want to predict secretion systems and effectors, you should install macsyfinder
and hmmer
first. Meanwhile, you need to download the TXSS profiles from here, and decompress it into data directory.
pip install macsyfinder
conda install -c bioconda hmmer
cd data
wget https://tool2-mml.sjtu.edu.cn/DeepSecE/TXSS_profiles.tar.gz
tar -zxvf TXSS_profiles.tar.gz
The weights of DeepSecE model can be downloaded from https://tool2-mml.sjtu.edu.cn/DeepSecE/checkpoint.pt.
Usage
Train model
You can train the DeepSecE model by running train.py
or scripts/kfold_train.sh
for cross-validation.
for i in {0..4}
do
python train.py --model effectortransformer \
--data_dir data \
--batch_size 32 \
--lr 5e-5 \
--weight_decay 4e-5 \
--dropout_rate 0.4 \
--num_layers 1 \
--num_heads 4 \
--warm_epochs 1 \
--patience 5 \
--lr_scheduler cosine \
--lr_decay_steps 30 \
--kfold 5 \
--fold_num $i \
--log_dir runs/attempt_cv
done
Parameters:
--model
train a effector transformer or finetue a ESM-1b model.--data_dir
directory that stores training data (default: ./data).--num_layers
numbers of trainable transformer layer. (default: 1)--num_heads
numbers of attention heads in effector-specific transformer (default: 4).--patience
patience for early stopping used in training.--lr_schedular
learning rate schedular [step, consine].--log_dir
directory that stores training outputs (default: logs).
Prediction
You can predict your interested type of secreted effectors only or predict secretion systems and corresponding effectors from scratch.
Predict secretion effector
python predict.py --fasta_path examples/Test.fasta \
--model_location [path to model weights] \
--secretion_systems [I II III IV VI] \
--out_dir examples [--save_attn --no_cuda]
Parameters:
--fasta_path
path to the input protein FASTA file.--model_location
path to the model weights (download from here).--secretion_systems
type(s) of secretion system to predict (default: I II III IV VI).--out_dir
directory that stores prediction outputs.--save_attn
add to save sequence attention of effector.--no_cuda
add when CUDA is not available.
Predict secretion system and effectors
Note: Make sure the input file is ordered protein sequences coded in a bacterial genome.
python predict_genome.py --fasta_path examples/NC_002516.2_protein.fasta \
--model_location [path to model weights] \
--data_dir data \
--out_dir examples/NC_002516.2 [--save_attn --no_cuda]
Parameters:
--fasta_path
path to the input protein FASTA file.--model_location
path to the model weights (download from here).--data_dir
directory that stores TXSS profiles (download from here).--out_dir
directory that stores prediction outputs.--save_attn
add to save sequence attention of effector.--no_cuda
add when CUDA is not available.
It takes about 5 minutes to predict effectors from a bacterial genome containing 3000 proteion coding sequences on a NVIDIA GeForce RTX 2080 Super GPU.
Plot attention
If you save the attention output of the putative effectors (add --save_attn
), you can run python scripts /plot_attention.py [directory of prediction output]
to plot the saliency map from attention, and infer potentially import regions related to protein secretion.
Contact
Please contact Yumeng Zhang at zhangyumeng1@sjtu.edu.cn for questions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file DeepSecE-0.1.0.tar.gz
.
File metadata
- Download URL: DeepSecE-0.1.0.tar.gz
- Upload date:
- Size: 8.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c9104dbc5d5e8aac0f6afe7491c3de6e86b6c993f0591d7fbd234f277ffaf1dc |
|
MD5 | 765c33ff69f6fb9e3f6a5a8dcef57b6b |
|
BLAKE2b-256 | 349569c12ba202b89f960e32c98617d582bd41c11e29776b5aec2b3e48597301 |
File details
Details for the file DeepSecE-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: DeepSecE-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e49c04e988fd64018902f3738cff4dc52842a4061704b8129616a564572caf2c |
|
MD5 | 09336ce883f4cdd17478e18a195e1c54 |
|
BLAKE2b-256 | 423f67328c9080a6c77c05b03ae07455590b241888f45539e7daeaa6a31c556b |