Encode protein sequence as a distribution of its physicochemical properties
Project description
proteinko
Encode protein sequence as a distribution of its physicochemical properties.
Introduction
Protein is as sequence of amino acid residues connected by peptide bonds. Each
amino acid residue is characterized by a unique combination of its physical and chemical
properties. proteinko
takes advantage of this to represent protein sequence as
a spatial distribution of the physicochemical properties of its amino acid
residues, capturing the complementing or cancelling effect of neighbouring amino acid
residues.
proteinko
enables numerical representation of a protein sequence
while preserving the information of its underlying physicochemical properties. This
allows the investigation of relationships and interactions between proteins as well as
potential discovery of underlying physicochemical properties which facilitate those interactions.
Methods
proteinko
implements a fairly simple algorithm. The protein sequence is mapped to a
vector V
representing a distribution of a certain physicochemical property of the entire protein.
Each amino acid residue Ai
is modeled independently as a Gaussian curve Gi
and
scaled by the corresponding value from the encoding scheme. Gi
is mapped to
the slice of V
which is centered at a position correspondig to the position of Ai
in the sequence and
which spans L
neighbouring slices on each side.
The overlap allows to sum the complementing or cancelling effects
that the neighbouring amino acid residues may exert on the local physicochemical
property of the protein. The extent of overlap is determined by two factors:
overlap distance (L
) and sigma factor. Overlap distance determines how many
neighbouring slices Gi
spans on each side. Sigma determines the shape of the Gaussian curve
of each of the amino acid residues (see example). Both of these parameters proteinko
accepts as
function arguments allowing users to modify the shape of final distribution as needed.
Instalation
pip install proteinko
Usage
proteinko
implements two functions: model_distribution
and encode_sequence
.
Both functions have encoding_scheme
parameter which accepts a python dictionary with
amino acid one-letter codes as keys.
Example 1:
from proteinko import model_distribution, encode_sequence
import matplotlib.pyplot as plt
from pyaaisc import Aaindex
sequence = 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP'
encoding_scheme = Aaindex().get('ARGP820101', dbkey='aaindex1').index_data
dist_1 = model_distribution(sequence, encoding_scheme, overlap_distance=2, sigma=0.4)
dist_2 = model_distribution(sequence, encoding_scheme, overlap_distance=3, sigma=0.8)
encoded = encode_sequence(sequence, encoding_scheme)
fig, ax = plt.subplots(3, 1, sharey=True, figsize=(12,5))
ax[0].plot(dist_1)
ax[0].grid()
ax[0].set_xticklabels([])
ax[0].set_title('Modeled distribution, overlap_distance=2, sigma=0.4')
ax[1].plot(dist_2)
ax[1].grid()
ax[1].set_xticklabels([])
ax[1].set_title('Modeled distribution, overlap_distance=3, sigma=0.8')
ax[1].set_ylabel('Hydrophobicity index - ARGP820101')
ax[2].bar(range(len(encoded)), encoded)
ax[2].grid()
ax[2].set_xticks(range(len(sequence)))
ax[2].set_xticklabels([x for x in sequence])
ax[2].set_title('Sequence')
plt.show()
Example 2
from proteinko import model_distribution
import matplotlib.pyplot as plt
from pyaaisc import Aaindex
sequence = 'MEEPQSDPSVE'
encoding_scheme = Aaindex().get('ARGP820101', dbkey='aaindex1').index_data
dist = model_distribution(sequence, encoding_scheme, overlap_distance=2, sigma=0.4)
sampled_dist = model_distribution(sequence, encoding_scheme, overlap_distance=2, sigma=0.4, sampling_points=16)
fig, ax = plt.subplots(2, 1, figsize=(6,4))
ax[0].plot(dist)
ax[0].grid()
ax[0].set_xticklabels([])
ax[0].set_title('Modeled distribution')
ax[0].set_ylabel('Hydrophobicity index')
ax[1].bar(range(16), sampled_dist)
ax[1].grid()
ax[1].set_xticklabels([])
ax[1].set_title('Sampled distribution')
ax[1].set_ylabel('Hydrophobicity index')
plt.show()
Release Notes
release 5.0
Algorithm changes:
- Number of overlaping neigbouring amino acid residues has been added as function argument
and default value set to
overlap_distance=2
. - Default
sigma
value has been changed from0.8
to0.4
. - Normalization and standardization of modeled distribution are deprecated. No pre or post processing is applied.
- Scaling factor has been decreased from
100
to40
, reducing the number of computations and increasing the performance of algorithm.
Major code changes:
Proteinko
class has been removed and algorithm is implemented undermodel_distribution
function.- New function
encode_sequence
has been introduced which simply encodes sequence with values provided in the encoding table. - Encoding tables are now passed as python dictionaries instead of
pandas
dataframe. - Use of
pandas
andscipy
packages has been replaced with python functions making the code more lightweight and increasing the performance of algorithm.
Minor code changes:
vlen
parameter has been renamed tosampling_points
because it is the number of points to sample from final distribution.schema
parameter has been renamed toencoding_scheme
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file proteinko-5.0.tar.gz
.
File metadata
- Download URL: proteinko-5.0.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc9999e6f0314137e7d7895d5c2f9284576d17f4e51ab9ce0cd20e686a2227fe |
|
MD5 | 45eb5a4ff5f590b09fa0cee169b28a5d |
|
BLAKE2b-256 | 4052b3c9d7dcd25735f043577fb994985eee98bf74b057ca7227460d406af240 |
File details
Details for the file proteinko-5.0-py3.7.egg
.
File metadata
- Download URL: proteinko-5.0-py3.7.egg
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d82deff81eeb7724c12cdc011817754cb11c5943baed0e6b967814ffb8e72e98 |
|
MD5 | 0eaac29d3c3d84dfa450ef78af4098c7 |
|
BLAKE2b-256 | 524920c210d77aec79a9776d787f04314cc8678c26f110eb88c4bbde5dab806d |
File details
Details for the file proteinko-5.0-py3-none-any.whl
.
File metadata
- Download URL: proteinko-5.0-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa1a0a6dfe0b1738d0b5f41230e7e2965c60897f93c3a874937762760d824669 |
|
MD5 | 32249a3163e871ee8404c37e80acc9c6 |
|
BLAKE2b-256 | ec97bcd0376d2611bddbdfa5f422ce441f72b40acb2efb22df63687bc5d82e92 |