Skip to main content

Python package for generating various biochemical, physiochemical and structural descriptors/features of protein sequences.

Project description

protpy - Used for generating protein physiochemical, biochemical and structural descriptors using their constituent amino acids.

PyPI pytest Platforms PythonV License: MIT

Issues Size Commits

Table of Contents

Introduction

protpy is a Python software package for generating a variety of physiochemical, biochemical and structural descriptors for proteins. All of these descriptors are calculated using sequence-derived or physiochemical features of the amino acids that make up the proteins. These descriptors have been highly studied and used in a series of Bioinformatic applications including protein engineering, SAR (sequence-activity-relationships), predicting protein structure & function, subcellular localization, protein-protein interactions, drug-target interactions etc. The descriptors that are available in protpy include:

  • Moreaubroto Autocorrelation (MBAuto)
  • Moran Autocorrelation (MAuto)
  • Geary Autocorrelation (GAuto)
  • Amino Acid Composition (AAComp)
  • Dipeptide Composition (DPComp)
  • Tripeptide Composition (TPComp)
  • Pseudo Amino Acid Composition (PAAComp)
  • Amphiphilic Amino Acid Composition (AAAComp)
  • Conjoint Triad (CTriad)
  • CTD (Composition, Transition, Distribution) (CTD)
  • Sequence Order Coupling Number (SOCN)
  • Quasi Sequence Order (QSO)

This software is aimed at any researcher using protein sequence/structural data and was mainly created to use in my own project pySAR which uses protein sequence data to identify Sequence Activity Relationships (SAR) using Machine Learning [1]. protpy is built solely in Python3 and specifically developed in Python 3.10.

A demo of the software is available here.

Requirements

Installation

Install the latest version of protpy using pip:

pip3 install protpy --upgrade

Install by cloning repository:

git clone https://github.com/amckenna41/protpy.git
python3 setup.py install

Usage

Import protpy after installation:

import protpy as protpy

Import protein sequence from fasta:

from Bio import SeqIO

with open("test_fasta.fasta") as pro:
    protein_seq = str(next(SeqIO.parse(pro,'fasta')).seq)

Composition Descriptors

Calculate Amino Acid Composition:

amino_acid_comp = protpy.amino_acid_composition(protein_seq)
# A      C      D      E      F ...
# 6.693  3.108  5.817  3.347  6.614 ...

Calculate Dipeptide Composition:

dipeptide_comp = protpy.dipeptide_composition(protein_seq)
# AA    AC    AD   AE    AF ...
# 0.72  0.16  0.48  0.4  0.24 ...

Calculate Tripeptide Composition:

tripeptide_comp = protpy.tripeptide_composition(protein_seq)
# AAA  AAC  AAD  AAE  AAF ...
# 1    0    0    2    0 ...

Calculate Pseudo Composition:

pseudo_comp = protpy.pseudo_amino_acid_composition(protein_seq) 
#using default parameters: lamda=30, weight=0.05, properties=[]

# PAAC_1  PAAC_2  PAAC_3  PAAC_4  PAAC_5 ...
# 0.127        0.059        0.111        0.064        0.126 ...

Calculate Amphiphilic Composition:

amphiphilic_comp = protpy.amphiphilic_amino_acid_composition(protein_seq)
#using default parameters: lamda=30, weight=0.5, properties=[hydrophobicity_, hydrophilicity_]

# APAAC_1  APAAC_2  APAAC_3  APAAC_4  APAAC_5 ...
# 6.06    2.814    5.267     3.03    5.988 ...

Autocorrelation Descriptors

Calculate MoreauBroto Autocorrelation:

moreaubroto_autocorrelation = protpy.moreaubroto_autocorrelation(protein_seq)
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True

# MBAuto_CIDH920105_1  MBAuto_CIDH920105_2  MBAuto_CIDH920105_3  MBAuto_CIDH920105_4  MBAuto_CIDH920105_5 ...  
# -0.052               -0.104               -0.156               -0.208               0.246 ...

Calculate Moran Autocorrelation:

moran_autocorrelation = protpy.moran_autocorrelation(protein_seq)
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True

# MAuto_CIDH920105_1  MAuto_CIDH920105_2  MAuto_CIDH920105_3  MAuto_CIDH920105_4  MAuto_CIDH920105_5 ...
# -0.07786            -0.07879            -0.07906            -0.08001            0.14911 ...

Calculate Geary Autocorrelation:

geary_autocorrelation = protpy.geary_autocorrelation(protein_seq)
#using default parameters: lag=30, properties=["CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"], normalize=True

# GAuto_CIDH920105_1  GAuto_CIDH920105_2  GAuto_CIDH920105_3  GAuto_CIDH920105_4  GAuto_CIDH920105_5 ...
# 1.057               1.077               1.04                1.02                1.013 ...

Conjoint Triad Descriptors

Calculate Conjoint Triad:

conj_triad = protpy.conjoint_triad(protein_seq)
# 111  112  113  114  115 ...
# 7    17   11   3    6 ...

CTD Descriptors

Calculate CTD:

ctd = protpy.ctd(protein_seq)
#using default parameters: property="hydrophobicity", all_ctd=True

# hydrophobicity_CTD_C_01  hydrophobicity_CTD_C_02  hydrophobicity_CTD_C_03  normalized_vdwv_CTD_C_01 ...
# 0.279                    0.386                    0.335                    0.389 ...                   

Sequence Order Descriptors

Calculate Sequence Order Coupling Number (SOCN):

socn = protpy.sequence_order_coupling_number_(protein_seq)
#using default parameters: d=1, distance_matrix="schneider-wrede-physiochemical-distance-matrix"

#401.387        

Calculate All SOCN per distance matrix:

socn_all = protpy.sequence_order_coupling_number(protein_seq)
#using default parameters: lag=30, distance_matrix="schneider-wrede-physiochemical-distance-matrix.json"

# SOCN_SW_1  SOCN_SW_2  SOCN_SW_3  SOCN_SW_4  SOCN_SW_5 ...
# 401.387    409.243    376.946    393.042    396.196 ...        

Calculate Quasi Sequence Order (QSO):

qso = protpy.quasi_sequence_order(protein_seq)
#using default parameters: lag=30, weight=0.1, distance_matrix="schneider-wrede-physiochemical-distance-matrix.json"

# QSO_SW1   QSO_SW2   QSO_SW3   QSO_SW4   QSO_SW5 ...
# 0.005692  0.002643  0.004947  0.002846  0.005625 ...        

Directories

  • /tests - unit and integration tests for protpy package.
  • /protpy - source code and all required external data files for package.
  • /docs - protpy documentation.

Tests

To run all tests, from the main protpy folder run:

python3 -m unittest discover tests

Contact

If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the Issues tab.

References

[1]: Mckenna, A., & Dubey, S. (2022). Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. Journal of Biomedical Informatics, 128(104016), 104016. https://doi.org/10.1016/j.jbi.2022.104016
[2]: Shuichi Kawashima, Minoru Kanehisa, AAindex: Amino Acid index database, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Page 374, https://doi.org/10.1093/nar/28.1.374
[3]: Dong, J., Yao, ZJ., Zhang, L. et al. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform 10, 16 (2018). https://doi.org/10.1186/s13321-018-0270-2
[4]: Reczko, M. and Bohr, H. (1994) The DEF data base of sequence based protein fold class predictions. Nucleic Acids Res, 22, 3616-3619.
[5]: Hua, S. and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721-728.
[6]: Broto P, Moreau G, Vandicke C: Molecular structures: perception, autocorrelation descriptor and SAR studies. Eur J Med Chem 1984, 19: 71–78.
[7]: Ong, S.A., Lin, H.H., Chen, Y.Z. et al. Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics 8, 300 (2007). https://doi.org/10.1186/1471-2105-8-300
[8]: Inna Dubchak, Ilya Muchink, Stephen R.Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proc.Natl. Acad.Sci.USA, 1995, 92, 8700-8704.
[9]: Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen, Yixue Li, Huanliang Jiang. Predicting proten-protein interactions based only on sequences inforamtion. PNAS. 2007 (104) 4337-4341.
[10]: Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications 2000, 278, 477-483.
[11]: Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255.
[12]: Kuo-Chen Chou. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 2005,21,10-19.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protpy-1.0.4.tar.gz (36.1 kB view details)

Uploaded Source

Built Distributions

protpy-1.0.4-py3.8.egg (74.6 kB view details)

Uploaded Source

protpy-1.0.4-py3-none-any.whl (40.4 kB view details)

Uploaded Python 3

File details

Details for the file protpy-1.0.4.tar.gz.

File metadata

  • Download URL: protpy-1.0.4.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for protpy-1.0.4.tar.gz
Algorithm Hash digest
SHA256 03790fbf6516f6534bcf66f6db31d6b04d38b7271de00892fad6ba7edb169689
MD5 65b9fd53720e24104262b90037e2177c
BLAKE2b-256 3f031d4d4d01f780b33ea80e6fa59bc07a09c8972d5a8cf6389b5ca2b781f744

See more details on using hashes here.

File details

Details for the file protpy-1.0.4-py3.8.egg.

File metadata

  • Download URL: protpy-1.0.4-py3.8.egg
  • Upload date:
  • Size: 74.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for protpy-1.0.4-py3.8.egg
Algorithm Hash digest
SHA256 6b125c32d2a3f2a2f4198ff4b30c020d542d7b877f92db485a920ee6af9fd7c7
MD5 67276143fbe199c89c6a913f8661fd3a
BLAKE2b-256 d17e30a20ec4c625c9608ceefa3dbd83a25a67ff4522373d5717b9cb0d64bef3

See more details on using hashes here.

File details

Details for the file protpy-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: protpy-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 40.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for protpy-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 57107a9e2343cd81589b42903f51b3b2b42c9798c24a93eae73d74c30f11eb80
MD5 c8ba7857748bda8a974d209d165dbb96
BLAKE2b-256 d69c8f8c4414cee9064278824650438ab3b4342b178430f0cd0f648c7427b0a0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page