Skip to main content

Encode protein sequence as a distribution of its physicochemical properties

Project description

proteinko

Python package

Encode protein sequence as a distribution of its physicochemical properties.

Introduction

Protein is as sequence of amino acid residues connected by peptide bonds. Each amino acid residue is characterized by a unique combination of its physical and chemical properties. proteinko takes advantage of this to represent protein sequence as a spatial distribution of the physicochemical properties of its amino acid residues, capturing the complementing or cancelling effect of neighbouring amino acid residues.

proteinko enables numerical representation of a protein sequence while preserving the information of its underlying physicochemical properties. This allows the investigation of relationships and interactions between proteins as well as potential discovery of underlying physicochemical properties which facilitate those interactions.

Methods

proteinko implements a fairly simple algorithm. The protein sequence is mapped to a vector V representing a distribution of a certain physicochemical property of the entire protein. Each amino acid residue Ai is modeled independently as a Gaussian curve Gi and scaled by the corresponding value from the encoding scheme. Gi is mapped to the slice of V which is centered at a position correspondig to the position of Ai in the sequence and which spans L neighbouring slices on each side. The overlap allows to sum the complementing or cancelling effects that the neighbouring amino acid residues may exert on the local physicochemical property of the protein. The extent of overlap is determined by two factors: overlap distance (L) and sigma factor. Overlap distance determines how many neighbouring slices Gi spans on each side. Sigma determines the shape of the Gaussian curve of each of the amino acid residues (see example). Both of these parameters proteinko accepts as function arguments allowing users to modify the shape of final distribution as needed.

plot1

Instalation

pip install proteinko

Usage

proteinko implements two functions: model_distribution and encode_sequence. Both functions have encoding_scheme parameter which accepts a python dictionary with amino acid one-letter codes as keys.

Example 1:

from proteinko import model_distribution, encode_sequence
import matplotlib.pyplot as plt
from pyaaisc import Aaindex


sequence = 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP'
encoding_scheme = Aaindex().get('ARGP820101', dbkey='aaindex1').index_data

dist_1 = model_distribution(sequence, encoding_scheme, overlap_distance=2, sigma=0.4)
dist_2 = model_distribution(sequence, encoding_scheme, overlap_distance=3, sigma=0.8)
encoded = encode_sequence(sequence, encoding_scheme)

fig, ax = plt.subplots(3, 1, sharey=True, figsize=(12,5))
ax[0].plot(dist_1)
ax[0].grid()
ax[0].set_xticklabels([])
ax[0].set_title('Modeled distribution, overlap_distance=2, sigma=0.4')
ax[1].plot(dist_2)
ax[1].grid()
ax[1].set_xticklabels([])
ax[1].set_title('Modeled distribution, overlap_distance=3, sigma=0.8')
ax[1].set_ylabel('Hydrophobicity index - ARGP820101')
ax[2].bar(range(len(encoded)), encoded)
ax[2].grid()
ax[2].set_xticks(range(len(sequence)))
ax[2].set_xticklabels([x for x in sequence])
ax[2].set_title('Sequence')

plt.show()

plot2

Example 2

from proteinko import model_distribution
import matplotlib.pyplot as plt
from pyaaisc import Aaindex


sequence = 'MEEPQSDPSVE'
encoding_scheme = Aaindex().get('ARGP820101', dbkey='aaindex1').index_data

dist = model_distribution(sequence, encoding_scheme, overlap_distance=2, sigma=0.4)
sampled_dist = model_distribution(sequence, encoding_scheme, overlap_distance=2, sigma=0.4, sampling_points=16)

fig, ax = plt.subplots(2, 1, figsize=(6,4))
ax[0].plot(dist)
ax[0].grid()
ax[0].set_xticklabels([])
ax[0].set_title('Modeled distribution')
ax[0].set_ylabel('Hydrophobicity index')

ax[1].bar(range(16), sampled_dist)
ax[1].grid()
ax[1].set_xticklabels([])
ax[1].set_title('Sampled distribution')
ax[1].set_ylabel('Hydrophobicity index')

plt.show()

Release Notes

release 5.0

Algorithm changes:

  • Number of overlaping neigbouring amino acid residues has been added as function argument and default value set to overlap_distance=2.
  • Default sigma value has been changed from 0.8 to 0.4.
  • Normalization and standardization of modeled distribution are deprecated. No pre or post processing is applied.
  • Scaling factor has been decreased from 100 to 40, reducing the number of computations and increasing the performance of algorithm.

Major code changes:

  • Proteinko class has been removed and algorithm is implemented under model_distribution function.
  • New function encode_sequence has been introduced which simply encodes sequence with values provided in the encoding table.
  • Encoding tables are now passed as python dictionaries instead of pandas dataframe.
  • Use of pandas and scipy packages has been replaced with python functions making the code more lightweight and increasing the performance of algorithm.

Minor code changes:

  • vlen parameter has been renamed to sampling_points because it is the number of points to sample from final distribution.
  • schema parameter has been renamed to encoding_scheme.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proteinko-5.0.tar.gz (4.5 kB view details)

Uploaded Source

Built Distributions

proteinko-5.0-py3.7.egg (8.0 kB view details)

Uploaded Source

proteinko-5.0-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file proteinko-5.0.tar.gz.

File metadata

  • Download URL: proteinko-5.0.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3

File hashes

Hashes for proteinko-5.0.tar.gz
Algorithm Hash digest
SHA256 cc9999e6f0314137e7d7895d5c2f9284576d17f4e51ab9ce0cd20e686a2227fe
MD5 45eb5a4ff5f590b09fa0cee169b28a5d
BLAKE2b-256 4052b3c9d7dcd25735f043577fb994985eee98bf74b057ca7227460d406af240

See more details on using hashes here.

File details

Details for the file proteinko-5.0-py3.7.egg.

File metadata

  • Download URL: proteinko-5.0-py3.7.egg
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3

File hashes

Hashes for proteinko-5.0-py3.7.egg
Algorithm Hash digest
SHA256 d82deff81eeb7724c12cdc011817754cb11c5943baed0e6b967814ffb8e72e98
MD5 0eaac29d3c3d84dfa450ef78af4098c7
BLAKE2b-256 524920c210d77aec79a9776d787f04314cc8678c26f110eb88c4bbde5dab806d

See more details on using hashes here.

File details

Details for the file proteinko-5.0-py3-none-any.whl.

File metadata

  • Download URL: proteinko-5.0-py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3

File hashes

Hashes for proteinko-5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fa1a0a6dfe0b1738d0b5f41230e7e2965c60897f93c3a874937762760d824669
MD5 32249a3163e871ee8404c37e80acc9c6
BLAKE2b-256 ec97bcd0376d2611bddbdfa5f422ce441f72b40acb2efb22df63687bc5d82e92

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page