Python package for generating various biochemical, physiochemical and structural descriptors/features of protein sequences.
Project description
protpy - Used for generating protein physiochemical, biochemical and structural descriptors using their constituent amino acids
Table of Contents
Introduction
protpy
is a Python software package for generating a variety of physiochemical, biochemical and structural descriptors for proteins. All of these descriptors are calculated using sequence-derived or physiochemical features of the amino acids that make up the proteins. These descriptors have been highly studied and used in a series of Bioinformatic applications including protein engineering, SAR (sequence-activity-relationships), predicting protein structure & function, subcellular localization, protein-protein interactions, drug-target interactions etc. The descriptors that are available in protpy
include:
- Moreaubroto Autocorrelation (MBAuto)
- Moran Autocorrelation (MAuto)
- Geary Autocorrelation (GAuto)
- Amino Acid Composition (AAComp)
- Dipeptide Composition (DPComp)
- Tripeptide Composition (TPComp)
- Pseudo Amino Acid Composition (PAAComp)
- Amphiphilic Amino Acid Composition (AAAComp)
- Conjoint Triad (CTriad)
- CTD (Composition, Transition, Distribution) (CTD)
- Sequence Order Coupling Number (SOCN)
- Quasi Sequence Order (QSO)
This software is aimed at any researcher using protein sequence/structural data and was mainly created to use in my own project pySAR
which uses protein sequence data to identify Sequence Activity Relationships (SAR) using Machine Learning [1]. protpy
is built solely in Python3 and specifically developed in Python 3.10.
Requirements
Installation
Install the latest version of protpy
using pip:
pip3 install protpy --upgrade
Install by cloning repository:
git clone https://github.com/amckenna41/protpy.git
python3 setup.py install
Usage
Import protpy
after installation:
import protpy as protpy
Import protein sequence from fasta:
from Bio import SeqIO
with open("test_fasta.fasta") as pro:
protein_seq = str(next(SeqIO.parse(pro,'fasta')).seq)
Composition Descriptors
Calculate Amino Acid Composition (AAComp):
amino_acid_comp = protpy.amino_acid_composition(protein_seq)
#
Calculate Dipeptide Composition (DPComp):
dipeptide_comp = protpy.dipeptide_composition(protein_seq)
#
Calculate Tripeptide Composition (TPComp):
tripeptide_comp = protpy.tripeptide_composition(protein_seq)
#
Calculate Pseudo Amino Acid Composition (PAAComp):
pseudo_comp = protpy.pseudo_amino_acid_composition(protein_seq, lamda=30, weight=0.05)
#
Calculate Amphiphilic Amino Acid Composition (AAAComp):
amphiphilic_comp = protpy.amphiphilic_amino_acid_composition(protein_seq, lamda=30, weight=0.5)
#
Autocorrelation Descriptors
Calculate MoreauBroto Autocorrelation (MBAuto):
moreaubroto_autocorrelation = protpy.moreaubroto_autocorrelation(protein_seq, lag=30, normalize=True)
#
Calculate Moran Autocorrelation (MAuto):
moran_autocorrelation = protpy.moran_autocorrelation(protein_seq, lag=30, normalize=True)
#
Calculate Geary Autocorrelation (GAuto):
geary_autocorrelation = protpy.geary_autocorrelation(protein_seq, lag=30, normalize=True)
#
Conjoint Triad Descriptors
Calculate Conjoint Triad (CTriad):
conj_triad = protpy.conjoint_triad(protein_seq)
#
CTD
Calculate Composition from CTD (CTD):
ctd_composition = protpy.ctd_composition(protein_seq)
#
Sequence Order Descriptors
Calculate Sequence Order Coupling Number (SOCN):
socn = protpy.sequence_order_coupling_number(protein_seq, lag=30, distance_matrix="schneider-wrede-physiochemical-distance-matrix.json")
#
Calculate Quasi Sequence Order (QSO):
socn = protpy.quasi_sequence_order(protein_seq, lag=30, distance_matrix="schneider-wrede-physiochemical-distance-matrix.json")
#
Directories
/tests
- unit and integration tests forprotpy
package./protpy
- source code and all required external data files for package./docs
- protpy documentation.
Tests
To run all tests, from the main protpy
folder run:
python3 -m unittest discover tests
Contact
If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the Issues tab.
References
[1]: Mckenna, A., & Dubey, S. (2022). Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. Journal of Biomedical Informatics, 128(104016), 104016. https://doi.org/10.1016/j.jbi.2022.104016
[2]: Shuichi Kawashima, Minoru Kanehisa, AAindex: Amino Acid index database, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Page 374, https://doi.org/10.1093/nar/28.1.374
[3]: Dong, J., Yao, ZJ., Zhang, L. et al. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform 10, 16 (2018). https://doi.org/10.1186/s13321-018-0270-2
[4]: Reczko, M. and Bohr, H. (1994) The DEF data base of sequence based protein
fold class predictions. Nucleic Acids Res, 22, 3616-3619.
[5]: Hua, S. and Sun, Z. (2001) Support vector machine approach for protein
subcellular localization prediction. Bioinformatics, 17, 721-728.
[6]: Broto P, Moreau G, Vandicke C: Molecular structures: perception,
autocorrelation descriptor and SAR studies. Eur J Med Chem 1984, 19: 71–78.
[7]: Ong, S.A., Lin, H.H., Chen, Y.Z. et al. Efficacy of different protein
descriptors in predicting protein functional families. BMC Bioinformatics
8, 300 (2007). https://doi.org/10.1186/1471-2105-8-300
[8]: Inna Dubchak, Ilya Muchink, Stephen R.Holbrook and Sung-Hou Kim. Prediction
of protein folding class using global description of amino acid sequence.
Proc.Natl. Acad.Sci.USA, 1995, 92, 8700-8704.
[9]: Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen,
Yixue Li, Huanliang Jiang. Predicting proten-protein interactions based only
on sequences inforamtion. PNAS. 2007 (104) 4337-4341.
[19]: Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating
Quasi-Sequence-Order Effect. Biochemical and Biophysical Research
Communications 2000, 278, 477-483.
[11]: Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using
Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and
Genetics, 2001, 43: 246-255.
[12]: Kuo-Chen Chou. Using amphiphilic pseudo amino acid composition to predict enzyme
subfamily classes. Bioinformatics, 2005,21,10-19.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.