SENSE-PPI: Sequence-based EvolutioNary ScalE Protein-Protein Interaction prediction
Project description
SENSE-PPI
SENSE-PPI is a Deep Learning model for predicting physical protein-protein interactions based on amino acid sequences. It is based on embeddings generated by ESM2 and uses Siamese RNN architecture to perform a binary classification.
Installation
SENSE-PPI requires Python 3.9 or higher. To install the package, run:
pip install senseppi
N.B.: if you intend to use the create_dataset
command to generate new datasets from STRING,
do not forget to additionally install the MMseqs2 software (instructions can be found at: https://github.com/soedinglab/MMseqs2).
The mmseqs
command should be available in your PATH.
Usage
There are 5 commands available in the package:
train
: trains SENSE-PPI on a given datasettest
: computes test metrics (AUROC, AUPRC, F1, MCC, Presicion, Recall, Accuracy) on a given datasetpredict
: predicts interactions for a given datasetpredict_string
: predicts interactions for a given dataset using STRING database: the interactions are taken from the STRING database (based on seed proteins). Predictions are compared with the STRING database. Optionally, the graphs can be constructed.create_dataset
: creates a dataset from the STRING database based on the taxonomic ID of the organism.
The package already comes with one pretrained version of the model fly_worm_human_chiken.ckpt
(checkpoint with weights) that is used by default if model path is not specified.
This model was trained on dataset that combined PPIs from D. melanogaster, C. elegans, H. sapiens and G. gallus, and it provides the best performance with respect to the other pretrained models.
The original SENSE-PPI repository also contains two human-based models pretrained on human PPIs: senseppi.ckpt
and dscript.ckpt
pretrained on SENSE-PPI and DSCRIPT human datasets respectively.
For information about the other models that can be found in the pretrained_models folder, please refer to the original article.
N.B.: All pretrained models were made to work with proteins in range 50-800 amino acids.
- By running the 'predict' command the model will automatically take 1 as the minimum length and the maximum length will be the length of the longest protein in the dataset. However, it is strongly recommended to use the proteins in range 50-800 amino acids for the best performance.
- if you use --min_len and --max_len arguments your fasta file will be filtered automatically, so make sure you have a backup.
In order to cite the original SENSE-PPI paper, please use the following link: https://doi.org/10.1101/2023.09.19.558413
The documentation for the package can be found here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for senseppi-0.7.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a01687dbc27ba8be411d407412ff3539c870862ad3fa7f181e53f3218fea1a2 |
|
MD5 | 6f5150df02508e92ef855639658d1c56 |
|
BLAKE2b-256 | 2d463406d13d2c6f26f361425fb7eb3e122a03319f920df0cbbe2cac1c3521b3 |