Python package for generating various biochemical, physiochemical and structural descriptors/features of protein sequences.
Project description
protpy - Used for generating protein physiochemical, biochemical and structural descriptors using their constituent amino acids.
Table of Contents
Introduction
protpy
is a Python software package for generating a variety of physiochemical, biochemical and structural descriptors for proteins. All of these descriptors are calculated using sequence-derived or physiochemical features of the amino acids that make up the proteins. These descriptors have been highly studied and used in a series of Bioinformatic applications including protein engineering, SAR (sequence-activity-relationships), predicting protein structure & function, subcellular localization, protein-protein interactions, drug-target interactions etc. The descriptors that are available in protpy
include:
- Moreaubroto Autocorrelation (MBAuto)
- Moran Autocorrelation (MAuto)
- Geary Autocorrelation (GAuto)
- Amino Acid Composition (AAComp)
- Dipeptide Composition (DPComp)
- Tripeptide Composition (TPComp)
- Pseudo Amino Acid Composition (PAAComp)
- Amphiphilic Amino Acid Composition (AAAComp)
- Conjoint Triad (CTriad)
- CTD (Composition, Transition, Distribution) (CTD)
- Sequence Order Coupling Number (SOCN)
- Quasi Sequence Order (QSO)
This software is aimed at any researcher using protein sequence/structural data and was mainly created to use in my own project pySAR
which uses protein sequence data to identify Sequence Activity Relationships (SAR) using Machine Learning [1]. protpy
is built solely in Python3 and specifically developed in Python 3.10.
A demo of the software is available here.
Requirements
Installation
Install the latest version of protpy
using pip:
pip3 install protpy --upgrade
Install by cloning repository:
git clone https://github.com/amckenna41/protpy.git
python3 setup.py install
Usage
Import protpy
after installation:
import protpy as protpy
Import protein sequence from fasta:
from Bio import SeqIO
with open("test_fasta.fasta") as pro:
protein_seq = str(next(SeqIO.parse(pro,'fasta')).seq)
Composition Descriptors
Calculate Amino Acid Composition (AAComp):
amino_acid_comp = protpy.amino_acid_composition(protein_seq)
#
Calculate Dipeptide Composition (DPComp):
dipeptide_comp = protpy.dipeptide_composition(protein_seq)
#
Calculate Tripeptide Composition (TPComp):
tripeptide_comp = protpy.tripeptide_composition(protein_seq)
#
Calculate Pseudo Amino Acid Composition (PAAComp):
pseudo_comp = protpy.pseudo_amino_acid_composition(protein_seq, lamda=30, weight=0.05)
#
Calculate Amphiphilic Amino Acid Composition (AAAComp):
amphiphilic_comp = protpy.amphiphilic_amino_acid_composition(protein_seq, lamda=30, weight=0.5)
#
Autocorrelation Descriptors
Calculate MoreauBroto Autocorrelation (MBAuto):
moreaubroto_autocorrelation = protpy.moreaubroto_autocorrelation(protein_seq, lag=30, normalize=True)
#
Calculate Moran Autocorrelation (MAuto):
moran_autocorrelation = protpy.moran_autocorrelation(protein_seq, lag=30, normalize=True)
#
Calculate Geary Autocorrelation (GAuto):
geary_autocorrelation = protpy.geary_autocorrelation(protein_seq, lag=30, normalize=True)
#
Conjoint Triad Descriptors
Calculate Conjoint Triad (CTriad):
conj_triad = protpy.conjoint_triad(protein_seq)
#
CTD
Calculate Composition from CTD (CTD):
ctd_composition = protpy.ctd_composition(protein_seq)
#
Sequence Order Descriptors
Calculate Sequence Order Coupling Number (SOCN):
socn = protpy.sequence_order_coupling_number(protein_seq, lag=30, distance_matrix="schneider-wrede-physiochemical-distance-matrix.json")
#
Calculate Quasi Sequence Order (QSO):
socn = protpy.quasi_sequence_order(protein_seq, lag=30, distance_matrix="schneider-wrede-physiochemical-distance-matrix.json")
#
Directories
/tests
- unit and integration tests forprotpy
package./protpy
- source code and all required external data files for package./docs
- protpy documentation.
Tests
To run all tests, from the main protpy
folder run:
python3 -m unittest discover tests
Contact
If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the Issues tab.
References
[1]: Mckenna, A., & Dubey, S. (2022). Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. Journal of Biomedical Informatics, 128(104016), 104016. https://doi.org/10.1016/j.jbi.2022.104016
[2]: Shuichi Kawashima, Minoru Kanehisa, AAindex: Amino Acid index database, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Page 374, https://doi.org/10.1093/nar/28.1.374
[3]: Dong, J., Yao, ZJ., Zhang, L. et al. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform 10, 16 (2018). https://doi.org/10.1186/s13321-018-0270-2
[4]: Reczko, M. and Bohr, H. (1994) The DEF data base of sequence based protein
fold class predictions. Nucleic Acids Res, 22, 3616-3619.
[5]: Hua, S. and Sun, Z. (2001) Support vector machine approach for protein
subcellular localization prediction. Bioinformatics, 17, 721-728.
[6]: Broto P, Moreau G, Vandicke C: Molecular structures: perception,
autocorrelation descriptor and SAR studies. Eur J Med Chem 1984, 19: 71–78.
[7]: Ong, S.A., Lin, H.H., Chen, Y.Z. et al. Efficacy of different protein
descriptors in predicting protein functional families. BMC Bioinformatics
8, 300 (2007). https://doi.org/10.1186/1471-2105-8-300
[8]: Inna Dubchak, Ilya Muchink, Stephen R.Holbrook and Sung-Hou Kim. Prediction
of protein folding class using global description of amino acid sequence.
Proc.Natl. Acad.Sci.USA, 1995, 92, 8700-8704.
[9]: Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen,
Yixue Li, Huanliang Jiang. Predicting proten-protein interactions based only
on sequences inforamtion. PNAS. 2007 (104) 4337-4341.
[10]: Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating
Quasi-Sequence-Order Effect. Biochemical and Biophysical Research
Communications 2000, 278, 477-483.
[11]: Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using
Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and
Genetics, 2001, 43: 246-255.
[12]: Kuo-Chen Chou. Using amphiphilic pseudo amino acid composition to predict enzyme
subfamily classes. Bioinformatics, 2005,21,10-19.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file protpy-1.0.3.tar.gz
.
File metadata
- Download URL: protpy-1.0.3.tar.gz
- Upload date:
- Size: 32.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 785fa9d17506fbca292afb1159fff51fd1d73720fce2372a095f74e1b7f317eb |
|
MD5 | 31a85d3a34b987d37660fefc23c0407c |
|
BLAKE2b-256 | ab3b573ef83c013caf39b95c8f7ac619c2c72ccc30c34f7c1d6210af5c819ffe |
File details
Details for the file protpy-1.0.3-py3.8.egg
.
File metadata
- Download URL: protpy-1.0.3-py3.8.egg
- Upload date:
- Size: 69.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b7433b5caf73ad095ab126f5f2472597162e3b8e3e470760f5dbd69dfa9579b |
|
MD5 | d4d3531d253eb4d84ee5ecb42660fc12 |
|
BLAKE2b-256 | bc6a96237d52be4c5193180407b92d23a57a41eb96155b36a3f1aab9cdde60a3 |
File details
Details for the file protpy-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: protpy-1.0.3-py3-none-any.whl
- Upload date:
- Size: 37.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83ddae25d6dcd0eb7bce1f96d37a5ee5847e635c1717ef066f9fff5db7f1eeb0 |
|
MD5 | 59211c649b7d050362ddacb8f7723120 |
|
BLAKE2b-256 | 9684e4a3610bbc0e076b9905b9627b9d34f3706a694e56b29437b8ea53cef40c |