Skip to main content

Python package for generating various biochemical, physiochemical and structural descriptors/features of protein sequences.

Project description

protpy - Used for generating protein physiochemical, biochemical and structural descriptors using their constituent amino acids

PyPI pytest Platforms PythonV License: MIT Build

Issues Size Commits

Table of Contents

Introduction

protpy is a Python software package for generating a variety of physiochemical, biochemical and structural descriptors for proteins. All of these descriptors are calculated using sequence-derived or physiochemical features of the amino acids that make up the proteins. These descriptors have been highly studied and used in a series of Bioinformatic applications including protein engineering, SAR (sequence-activity-relationships), predicting protein structure & function, subcellular localization, protein-protein interactions, drug-target interactions etc. The descriptors that are available in protpy include:

  • Moreaubroto Autocorrelation (MBAuto)
  • Moran Autocorrelation (MAuto)
  • Geary Autocorrelation (GAuto)
  • Amino Acid Composition (AAComp)
  • Dipeptide Composition (DPComp)
  • Tripeptide Composition (TPComp)
  • Pseudo Amino Acid Composition (PAAComp)
  • Amphiphilic Amino Acid Composition (AAAComp)
  • Conjoint Triad (CTriad)
  • CTD (Composition, Transition, Distribution) (CTD)
  • Sequence Order Coupling Number (SOCN)
  • Quasi Sequence Order (QSO)

This software is aimed at any researcher using protein sequence/structural data and was mainly created to use in my own project pySAR which uses protein sequence data to identify Sequence Activity Relationships (SAR) using Machine Learning [1]. protpy is built solely in Python3 and specifically developed in Python 3.10.

Requirements

Installation

Install the latest version of protpy using pip:

pip3 install protpy --upgrade

Install by cloning repository:

git clone https://github.com/amckenna41/protpy.git
python3 setup.py install

Usage

Import protpy after installation:

import protpy as protpy

Import protein sequence from fasta:

from Bio import SeqIO

with open("test_fasta.fasta") as pro:
    protein_seq = str(next(SeqIO.parse(pro,'fasta')).seq)

Composition Descriptors

Calculate Amino Acid Composition (AAComp):

amino_acid_comp = protpy.amino_acid_composition(protein_seq)
#

Calculate Dipeptide Composition (DPComp):

dipeptide_comp = protpy.dipeptide_composition(protein_seq)
#

Calculate Tripeptide Composition (TPComp):

tripeptide_comp = protpy.tripeptide_composition(protein_seq)
#

Calculate Pseudo Amino Acid Composition (PAAComp):

pseudo_comp = protpy.pseudo_amino_acid_composition(protein_seq, lamda=30, weight=0.05)
#

Calculate Amphiphilic Amino Acid Composition (AAAComp):

amphiphilic_comp = protpy.amphiphilic_amino_acid_composition(protein_seq, lamda=30, weight=0.5)
#

Autocorrelation Descriptors

Calculate MoreauBroto Autocorrelation (MBAuto):

moreaubroto_autocorrelation = protpy.moreaubroto_autocorrelation(protein_seq, lag=30, normalize=True)
#

Calculate Moran Autocorrelation (MAuto):

moran_autocorrelation = protpy.moran_autocorrelation(protein_seq, lag=30, normalize=True)
#

Calculate Geary Autocorrelation (GAuto):

geary_autocorrelation = protpy.geary_autocorrelation(protein_seq, lag=30, normalize=True)
#

Conjoint Triad Descriptors

Calculate Conjoint Triad (CTriad):

conj_triad = protpy.conjoint_triad(protein_seq)
#

CTD

Calculate Composition from CTD (CTD):

ctd_composition = protpy.ctd_composition(protein_seq)
#

Sequence Order Descriptors

Calculate Sequence Order Coupling Number (SOCN):

socn = protpy.sequence_order_coupling_number(protein_seq, lag=30, distance_matrix="schneider-wrede-physiochemical-distance-matrix.json")
#

Calculate Quasi Sequence Order (QSO):

socn = protpy.quasi_sequence_order(protein_seq, lag=30, distance_matrix="schneider-wrede-physiochemical-distance-matrix.json")
#

Directories

  • /tests - unit and integration tests for protpy package.
  • /protpy - source code and all required external data files for package.
  • /docs - protpy documentation.

Tests

To run all tests, from the main protpy folder run:

python3 -m unittest discover tests

Contact

If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the Issues tab.

References

[1]: Mckenna, A., & Dubey, S. (2022). Machine learning based predictive model for the analysis of sequence activity relationships using protein spectra and protein descriptors. Journal of Biomedical Informatics, 128(104016), 104016. https://doi.org/10.1016/j.jbi.2022.104016
[2]: Shuichi Kawashima, Minoru Kanehisa, AAindex: Amino Acid index database, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Page 374, https://doi.org/10.1093/nar/28.1.374
[3]: Dong, J., Yao, ZJ., Zhang, L. et al. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform 10, 16 (2018). https://doi.org/10.1186/s13321-018-0270-2
[4]: Reczko, M. and Bohr, H. (1994) The DEF data base of sequence based protein fold class predictions. Nucleic Acids Res, 22, 3616-3619.
[5]: Hua, S. and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721-728.
[6]: Broto P, Moreau G, Vandicke C: Molecular structures: perception, autocorrelation descriptor and SAR studies. Eur J Med Chem 1984, 19: 71–78.
[7]: Ong, S.A., Lin, H.H., Chen, Y.Z. et al. Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics 8, 300 (2007). https://doi.org/10.1186/1471-2105-8-300
[8]: Inna Dubchak, Ilya Muchink, Stephen R.Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proc.Natl. Acad.Sci.USA, 1995, 92, 8700-8704.
[9]: Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen, Yixue Li, Huanliang Jiang. Predicting proten-protein interactions based only on sequences inforamtion. PNAS. 2007 (104) 4337-4341.
[19]: Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications 2000, 278, 477-483.
[11]: Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255.
[12]: Kuo-Chen Chou. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 2005,21,10-19.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protpy-1.0.1.tar.gz (32.6 kB view details)

Uploaded Source

Built Distributions

protpy-1.0.1-py3.8.egg (68.9 kB view details)

Uploaded Source

protpy-1.0.1-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file protpy-1.0.1.tar.gz.

File metadata

  • Download URL: protpy-1.0.1.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for protpy-1.0.1.tar.gz
Algorithm Hash digest
SHA256 2ffdf08cd1f489a3cbb99f43d40e29451e94f4fbc858b8ef9c5b874599f5d00e
MD5 689f8bf5c86feb7ad19c72a767b15c70
BLAKE2b-256 81b93a4d0cc7d7e8f8580146eda4e41622913e22a22363f6e24236ae9016a665

See more details on using hashes here.

File details

Details for the file protpy-1.0.1-py3.8.egg.

File metadata

  • Download URL: protpy-1.0.1-py3.8.egg
  • Upload date:
  • Size: 68.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for protpy-1.0.1-py3.8.egg
Algorithm Hash digest
SHA256 70f0d2f806a9e8d47dcebd8b7ce98cb36210dd11b28c633d1c79232a3d30c856
MD5 df1845f4cff12a52f31f166867378cba
BLAKE2b-256 75c40336c68fa5f23e987be1a3349df261a9aea662b773f09282734944186222

See more details on using hashes here.

File details

Details for the file protpy-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: protpy-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for protpy-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5e505b459a362cbfa1332b836546dbde5879b6f50115c44092c0269335bcdec2
MD5 c61dd98c34cad0c5afc1bf97e627d0eb
BLAKE2b-256 842c8a1133b3310a2e5c5c3bb92de120da8678f1e9b5a1c020a9a13e055ec3e9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page