Python package for generating various biochemical, physicochemical and structural descriptors/features of protein sequences.
Project description
protpy - Package for generating protein physicochemical, biochemical and structural descriptors using their constituent amino acids.
- 🧬 A demo of the software is available here
- 📝 A Medium article about
protPyand its background is available here
Table of Contents
- Introduction
- Requirements
- Installation
- Usage
- Documentation
- Directories
- Tests
- Issues
- Contact
- References
Introduction
protpy is a Python software package for generating a variety of physicochemical, biochemical and structural descriptors for proteins. All of these descriptors are calculated using sequence-derived or physicochemical features of the amino acids that make up the proteins. These descriptors have been highly studied and used in a series of Bioinformatic applications including protein engineering, SAR (sequence-activity-relationships), predicting protein structure & function, subcellular localization, protein-protein interactions, drug-target interactions etc.
This software is aimed at any researcher or developer using protein sequence/structural data, and was mainly created to use in my own project pySAR which uses protein sequence data to identify Sequence Activity Relationships (SAR) using Machine Learning [1]. protpy is built and developed in Python 3.10.
The descriptors available in protpy include:
Composition Descriptors (22)
- Amino Acid Composition (AAComp)
- Dipeptide Composition (DPComp)
- Tripeptide Composition (TPComp)
- Grand Average of Hydropathy (GRAVY)
- Aromaticity
- Instability Index
- Isoelectric Point
- Molecular Weight
- Charge Distribution
- Hydrophobic/Polar/Charged Composition (HPC)
- Secondary Structure Propensity (SSP)
- k-mer Composition
- Reduced Alphabet Composition
- Motif Composition
- Amino Acid Pair Composition
- Aliphatic Index
- Extinction Coefficient
- Boman Index
- Aggregation Propensity
- Hydrophobic Moment
- Shannon Entropy
- Pseudo Amino Acid Composition (PAAComp)
- Amphiphilic Amino Acid Composition (APAAComp)
Autocorrelation Descriptors (3)
- Moreaubroto Autocorrelation (MBAuto)
- Moran Autocorrelation (MAuto)
- Geary Autocorrelation (GAuto)
Conjoint Triad (1)
- Conjoint Triad (CTriad)
CTD Descriptors (4)
- CTD Composition
- CTD Transition
- CTD Distribution
- CTD Combined
Sequence Order Descriptors (5)
- Sequence Order Coupling Number — single (SOCN)
- Sequence Order Coupling Number — series
- Sequence Order Coupling Number — all matrices
- Quasi Sequence Order (QSO)
- Quasi Sequence Order — all matrices
More detail of each descriptor is listed in the markdown file: DESCRIPTORS.md
Requirements
- Python >= 3.9
- aaindex >= 1.2.0
- numpy >= 2.4.4
- pandas >= 3.0.2
- varname >= 0.15.1
- biopython >= 1.87 (only required for testing)
Installation
Install the latest version of protpy using pip:
pip3 install protpy --upgrade
Install by cloning repository:
git clone https://github.com/amckenna41/protpy.git
python3 setup.py install
Usage
Import protpy after installation:
import protpy as protpy
Import protein sequence from fasta:
from Bio import SeqIO
with open("test_fasta.fasta") as pro:
protein_seq = str(next(SeqIO.parse(pro,'fasta')).seq)
Composition Descriptors Usage Examples
Calculate Amino Acid Composition:
amino_acid_composition = protpy.amino_acid_composition(protein_seq)
# A C D E F ...
# 6.693 3.108 5.817 3.347 6.614 ...
Calculate Dipeptide Composition:
dipeptide_composition = protpy.dipeptide_composition(protein_seq)
# AA AC AD AE AF ...
# 0.72 0.16 0.48 0.4 0.24 ...
Calculate Tripeptide Composition:
tripeptide_composition = protpy.tripeptide_composition(protein_seq)
# AAA AAC AAD AAE AAF ...
# 1 0 0 2 0 ...
Calculate GRAVY (Grand Average of Hydropathy):
gravy = protpy.gravy(protein_seq)
# GRAVY
# -0.045
Calculate Aromaticity:
aromaticity = protpy.aromaticity(protein_seq)
# Aromaticity
# 0.118
Calculate Instability Index:
instability = protpy.instability_index(protein_seq)
# InstabilityIndex
# 31.836
Calculate Isoelectric Point:
pi = protpy.isoelectric_point(protein_seq)
# IsoelectricPoint
# 5.412
Calculate Molecular Weight:
mw = protpy.molecular_weight(protein_seq)
# MolecularWeight (Da)
# 139122.355
Calculate Charge Distribution:
charge = protpy.charge_distribution(protein_seq)
#using default parameters: ph=7.4
# PositiveCharge NegativeCharge NetCharge
# 99.526 114.956 -15.43
Calculate Hydrophobic/Polar/Charged Composition:
hpc = protpy.hydrophobic_polar_charged_composition(protein_seq)
# Hydrophobic Polar Charged
# 44.542 32.669 18.247
Calculate Secondary Structure Propensity:
ssp = protpy.secondary_structure_propensity(protein_seq)
# Helix Sheet Coil
# 0.983 1.05 1.043
Calculate k-mer Composition:
kmer = protpy.kmer_composition(protein_seq)
#using default parameters: k=2
# AA AC AD ...
# 0.797 0.159 ... ...
Calculate Reduced Alphabet Composition:
reduced = protpy.reduced_alphabet_composition(protein_seq)
#using default parameters: alphabet_size=6
# Group_1 Group_2 Group_3 Group_4 Group_5 Group_6
# 25.339 34.741 9.163 9.084 10.837 10.837
Calculate Motif Composition:
motif = protpy.motif_composition(protein_seq)
# NxST_glycosylation RGD_integrin KDEL_retention ...
# 23 0 0 ...
Calculate Amino Acid Pair Composition:
aapair = protpy.amino_acid_pair_composition(protein_seq)
# AA_Hydrophobic-Hydrophobic AA_Hydrophobic-Polar ...
# 0.797 0.159 ...
Calculate Aliphatic Index:
aliphatic = protpy.aliphatic_index(protein_seq)
# AliphaticIndex
# 82.725
Calculate Extinction Coefficient:
extinction = protpy.extinction_coefficient(protein_seq)
# ExtCoeff_Reduced ExtCoeff_Oxidized
# 140960 143335
Calculate Boman Index:
boman = protpy.boman_index(protein_seq)
# BomanIndex
# 0.119
Calculate Aggregation Propensity:
aggregation = protpy.aggregation_propensity(protein_seq)
# AggregProneRegions AggregProneFraction
# 58 11.793
Calculate Hydrophobic Moment:
hm = protpy.hydrophobic_moment(protein_seq)
#using default parameters: window=11, angle=100
# HydrophobicMoment_Mean HydrophobicMoment_Max
# 0.272 0.813
Calculate Shannon Entropy:
se = protpy.shannon_entropy(protein_seq)
# ShannonEntropy
# 4.163
Calculate Pseudo Composition:
pseudo_composition = protpy.pseudo_amino_acid_composition(protein_seq)
#using default parameters: lamda=30, weight=0.05, properties=[]
# PAAC_1 PAAC_2 PAAC_3 PAAC_4 PAAC_5 ...
# 0.127 0.059 0.111 0.064 0.126 ...
Calculate Amphiphilic Composition:
amphiphilic_composition = protpy.amphiphilic_pseudo_amino_acid_composition(protein_seq)
#using default parameters: lamda=30, weight=0.5, properties=[hydrophobicity_, hydrophilicity_]
# APAAC_1 APAAC_2 APAAC_3 APAAC_4 APAAC_5 ...
# 6.624 3.076 5.757 3.032 5.988 ...
Autocorrelation Descriptors Usage Examples
Calculate MoreauBroto Autocorrelation:
moreaubroto_autocorrelation = protpy.moreaubroto_autocorrelation(protein_seq)
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
# MBAuto_CIDH920105_1 MBAuto_CIDH920105_2 MBAuto_CIDH920105_3 MBAuto_CIDH920105_4 MBAuto_CIDH920105_5 ...
# -0.052 -0.104 -0.156 -0.208 0.246 ...
Calculate Moran Autocorrelation:
moran_autocorrelation = protpy.moran_autocorrelation(protein_seq)
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
# MAuto_CIDH920105_1 MAuto_CIDH920105_2 MAuto_CIDH920105_3 MAuto_CIDH920105_4 MAuto_CIDH920105_5 ...
# -0.07786 -0.07879 -0.07906 -0.08001 0.14911 ...
Calculate Geary Autocorrelation:
geary_autocorrelation = protpy.geary_autocorrelation(protein_seq)
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True
# GAuto_CIDH920105_1 GAuto_CIDH920105_2 GAuto_CIDH920105_3 GAuto_CIDH920105_4 GAuto_CIDH920105_5 ...
# 1.057 1.077 1.04 1.02 1.013 ...
Conjoint Triad Descriptors Usage Examples
Calculate Conjoint Triad:
conjoint_triad = protpy.conjoint_triad(protein_seq)
# 111 112 113 114 115 ...
# 7 17 11 3 6 ...
CTD Descriptors Usage Examples
Calculate CTD:
ctd = protpy.ctd(protein_seq)
#using default parameters: property="hydrophobicity", all_ctd=True
# hydrophobicity_CTD_C_01 hydrophobicity_CTD_C_02 hydrophobicity_CTD_C_03 normalized_vdwv_CTD_C_01 ...
# 0.279 0.386 0.335 0.389 ...
Sequence Order Descriptors Usage Examples
Calculate Sequence Order Coupling Number (SOCN):
socn = protpy.sequence_order_coupling_number_(protein_seq)
#using default parameters: d=1, distance_matrix="schneider-wrede"
#401.387
Calculate all SOCN's per distance matrix:
#using default parameters: lag=30, distance_matrix="schneider-wrede"
socn_all = protpy.sequence_order_coupling_number(protein_seq)
# SOCN_SW1 SOCN_SW2 SOCN_SW3 SOCN_SW4 SOCN_SW5 ...
# 401.387 409.243 376.946 393.042 396.196 ...
#using custom parameters: lag=10, distance_matrix="grantham"
socn_all = protpy.sequence_order_coupling_number(protein_seq, lag=10, distance_matrix="grantham")
# SOCN_Grant1 SOCN_Grant_2 SOCN_Grant_3 SOCN_Grant_4 SOCN_Grant_5 ...
# 399.125 402.153 387.820 393.111 409.096 ...
Calculate Quasi Sequence Order (QSO):
#using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede"
qso = protpy.quasi_sequence_order(protein_seq)
# QSO_SW1 QSO_SW2 QSO_SW3 QSO_SW4 QSO_SW5 ...
# 0.005692 0.002643 0.004947 0.002846 0.005625 ...
#using custom parameters: lag=10, weight=0.2, distance_matrix="grantham"
qso = protpy.quasi_sequence_order(protein_seq, lag=10, weight=0.2, distance_matrix="grantham")
# QSO_Grant1 QSO_Grant2 QSO_Grant3 QSO_Grant4 QSO_Grant5 ...
# 0.123287 0.079967 0.04332 0.039983 0.013332 ...
Documentation
The documentation for protpy is hosted on ReadTheDocs and is available here.
Directories
/tests- unit and integration tests forprotpypackage./protpy- source code and all required external data files for package./docs-protpydocumentation./examples- example notebook for protpy
Tests
To run all tests, from the main protpy folder run:
python3 -m unittest discover tests -v
-v: verbose output flag
Contact
If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the Issues tab.
References
[1]: Mckenna, A., & Dubey, S. (2022). Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. Journal of Biomedical Informatics, 128(104016), 104016. https://doi.org/10.1016/j.jbi.2022.104016
[2]: Shuichi Kawashima, Minoru Kanehisa, AAindex: Amino Acid index database, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Page 374, https://doi.org/10.1093/nar/28.1.374
[3]: Dong, J., Yao, ZJ., Zhang, L. et al. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform 10, 16 (2018). https://doi.org/10.1186/s13321-018-0270-2
[4]: Reczko, M. and Bohr, H. (1994) The DEF data base of sequence based protein
fold class predictions. Nucleic Acids Res, 22, 3616-3619.
[5]: Hua, S. and Sun, Z. (2001) Support vector machine approach for protein
subcellular localization prediction. Bioinformatics, 17, 721-728.
[6]: Broto P, Moreau G, Vandicke C: Molecular structures: perception,
autocorrelation descriptor and SAR studies. Eur J Med Chem 1984, 19: 71–78.
[7]: Ong, S.A., Lin, H.H., Chen, Y.Z. et al. Efficacy of different protein
descriptors in predicting protein functional families. BMC Bioinformatics
8, 300 (2007). https://doi.org/10.1186/1471-2105-8-300
[8]: Inna Dubchak, Ilya Muchink, Stephen R.Holbrook and Sung-Hou Kim. Prediction
of protein folding class using global description of amino acid sequence.
Proc.Natl. Acad.Sci.USA, 1995, 92, 8700-8704.
[9]: Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen,
Yixue Li, Huanliang Jiang. Predicting proten-protein interactions based only
on sequences inforamtion. PNAS. 2007 (104) 4337-4341.
[10]: Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating
Quasi-Sequence-Order Effect. Biochemical and Biophysical Research
Communications 2000, 278, 477-483.
[11]: Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using
Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and
Genetics, 2001, 43: 246-255.
[12]: Kuo-Chen Chou. Using amphiphilic pseudo amino acid composition to predict enzyme
subfamily classes. Bioinformatics, 2005,21,10-19.
Support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protpy-1.3.0.tar.gz.
File metadata
- Download URL: protpy-1.3.0.tar.gz
- Upload date:
- Size: 71.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71d5742e80bc03bcf8f0b5eddbd3d8cbeabed44b316bd0faa712998da9ea00e8
|
|
| MD5 |
b18c03939a79a8a524cb4eb5bbfb9156
|
|
| BLAKE2b-256 |
adad46866444f7137dca6dfc78cdaaa257871a9025c08936ec834ea5b97e579d
|
File details
Details for the file protpy-1.3.0-py3-none-any.whl.
File metadata
- Download URL: protpy-1.3.0-py3-none-any.whl
- Upload date:
- Size: 73.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f92f67495b08e2f9587d13031a1d340a8431730e1ec30d4a4c243cf2c4a8c6f3
|
|
| MD5 |
7032e6833c0c29db5252c88d563e04ab
|
|
| BLAKE2b-256 |
6ea7ad15f7a738e373a4a0f261f72c4ffb1f372295b47db44a7fdefd5fb5cf5a
|