A Python package used to analysis Protein Sequence Activity Relationships

Project description

alt text

pySAR

Platforms PythonV

pySAR is a Python library for analysing Sequence Activity Relationships (SARs) of protein sequences. pySAR offers extensive and verbose functionalities that allow you to numerically encode a dataset of protein sequences using a large abundance of available methodologies and features. The software uses physiochemical and biochemical features from the Amino Acid Index (AAI) database as well as allowing for the calculation of a range of structural protein descriptors.
After finding the optimal technique and feature set at which to encode your dataset of sequences, pySAR can then be used to build a predictive regression model with the training data being that of the encoded sequences and training labels being the experimentally pre-calculated activity values for each protein sequence. The model can then be used to predict the activity/fitness value of a new unseen sequence.

status

Development Stage

To DO List:

Add Github Workflow CI thing
Add Category and Descriptor Group to pySAR results DF.
Condense comments in functions, remove some whitespace lines
Add help function
Mention that PyBioMed package duplicated here as it is not available via pyPI and would mean that user would have to install the full pybiomed zip
raise type errors instead of Value ?
index errors?
remove plot func from DSP
do StanardScaler after every AAIndex encoding and before model building####
Change importing globals : import globals / globals.OUTPUT_DIR
Split up autocorrelation descriptors into their own functions
Allow fasta file to be input to Descriptor class?
github workflow with Twine that automatically published to pypi
provide example script for running on GCP or AWS resources?
don't return None after raising an exception??
add descriptions to each methods in each class
remove spacing in equals in keyword args in class/function defintiion
setters and getters to Evaluate class? using @property
add python version badge to readme
add pypi badge to readme
add introduction to readme
add references to descriptor module
integrate descriptor and AAIndex when using properties from AAIndex
look into setup.cfg or setup.py
add distance matrices json to dara ? : https://github.com/MartinThoma/propy3/blob/master/propy/QuasiSequenceOrder.py
split up QuasiSequenceOrder descriptor into its consitent quasi-seq-order
in readme show example usage for each module/class
change AAI method names from get_feature etc to get_record...
change get_feature_names to get_feature_desc
add AAI category to each AAI record
change all 'aa_index' to 'aaindex'
Add assertion comments to each unit test, got X wanted Y..
add test numbers in comments for each block of unit tests.
Go through each parameters list and refer to its previous reference rather than repeating it.
add cutoff index/value again just for testing
print out default parameters if using them.
remove verbose argument - dont need since tqdm prints progress bar
add if name == "main" to encoding and pySAR class.
split function defs to two lines?
publish to conda?
pypi logo
license logo
leave = False on 2nd loop

Installation

Install using pip:

pi3 install pySAR

Usage

Building predictive model from AAI and or protein descriptors:

e.g the below code will build a PlsRegression model using the AAI index CIDH920105 and the amino acid composition descriptor. The index is passed through a DSP pipeline and is transformed into its informational protein spectra using the power spectra, with a hamming window function applied to the output of the FFT. spectrum after a window function is applied.

#first-party imports
from globals import OUTPUT_DIR, OUTPUT_FOLDER, DATA_DIR
from aaindex import  AAIndex
from model import Model
from proDSP import ProDSP
from evaluate import Evaluate
import utils as utils
from plots import plot_reg
import descriptors as desc

pySAR = PySAR(dataset="dataset.txt",seq_col="sequence", activity="activity",algorithm = "PlsRegression", parameters={}, test_split=0.2)

results_df = pySAR.encode_aai_desc(indices="CIDH920105", descriptors="aa_composition", spectrum="power", window="hamming")

Encoding using all 566 AAIndex indices

#create instance of Encoding class, inherits from pySAR class
encoding = Encoding(dataset="dataset.txt", activity="activity_col",
  algorithm="RandomForest", parameters={"n_estimators":"200","max_depth":"50"})

aai_encoding = encoding.aai_encoding(spectrum='imaginary', window='blackman')

Encoding using list of 4 AAIndex indices, with no DSP functionalities

encoding = Encoding(dataset="dataset.txt", activity="activity_col",
  algorithm="PLSRegression", parameters={"":"","":"", })

aai_encoding = encoding.aai_encoding(use_dsp=False, aai_list=["PONP800102","RICJ880102","ROBB760107","KARS160113"])

Encoding using protein descriptors

encoding = Encoding(dataset="dataset.txt", activity="activity_col",
  algorithm="RandomForest", parameters={"":"","":"", }, descriptors_csv="descriptors.csv")

desc_encoding = encoding.desc_encoding(desc_combo = 2, verbose = True)
def descriptor_encoding(self, desc_list=None, desc_combo=1, verbose=True):

Encoding using AAI + protein descriptors

Generate all protein descriptors

  desc = Descriptor(protein_seqs = data, desc_dataset = "descriptors.csv",
      all_desc=True)

where protein_seqs is the dataset of protein sequences, desc_dataset is the name of the ouput csv used to store the calculated descriptors of the protein sequences and all_desc means that the class will get and calculate all descriptors.

Get record from AAIndex database

  desc = Descriptor(protein_seqs = data, desc_dataset = "descriptors.csv",
      all_desc=True)

Output Results

| Descriptor | Index | | R2 | RMSE | MSE
| ------------- | ------------- | | Content Cell | Content Cell | | Content Cell | Content Cell |

Command	Description
git status	List all new or modified files
git diff	Show file differences that haven't been staged

System Requirements

Python > 3.6 numpy >= 1.16.6 pandas >= 1.1.0 scikit-learn >= 0.24 scipy >= 1.4.1

Running Tests

To run tests, from the main pySAR folder run:

python -m unittest tests.MODULE_NAME -v

MODULE_NAME ->

Directory folders:

/pySAR/PyBioMed - package partially forked from https://github.com/gadsbyfly/PyBioMed, used in the calculation of the protein descriptors.
/Results - stores all calculated results that were generated for the research article, studying the SAR for a thermostability dataset.
/pySAR/tests - unit and integration tests for pySAR.
/pySAR/data - all required data and datasets are stored in this folder.

Contact

If you have any questions or comments, please contact: amckenna41@qub.ac.uk @

|Logo| image:: https://raw.githubusercontent.com/pySAR/pySAR/master/pySAR.png

Install required dependencies and packages:

python setup.py install

Project details

Release history Release notifications | RSS feed

2.4.3

Nov 23, 2023

2.4.2

Nov 16, 2023

2.4.1

Nov 8, 2023

2.4.0

Oct 31, 2023

2.3.6

Oct 19, 2023

2.3.5

Oct 17, 2023

2.3.4

Apr 16, 2023

2.3.3

Apr 4, 2023

2.3.2

Mar 26, 2023

2.3.1

Mar 25, 2023

2.3.0

Mar 24, 2023

2.2.2

Mar 8, 2023

2.2.1

Mar 8, 2023

2.2.0

Mar 8, 2023

2.1.5

Mar 7, 2023

2.1.4

Mar 6, 2023

2.1.3

Feb 25, 2023

2.1.2

Feb 23, 2023

2.1.1

Feb 23, 2023

2.1.0

Feb 20, 2023

2.0.6

Jul 16, 2022

2.0.5

Mar 31, 2022

2.0.4

Mar 3, 2022

2.0.3

Mar 3, 2022

2.0.2

Mar 2, 2022

2.0.1

Feb 19, 2022

2.0.0

Feb 6, 2022

1.0.1

May 9, 2021

1.0.0

May 9, 2021

0.2.0

May 15, 2021

0.1.0

May 15, 2021

0.0.9

May 9, 2021

0.0.8

May 9, 2021

0.0.7

May 9, 2021

This version

0.0.3

May 7, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pySAR-0.0.3.tar.gz (93.0 kB view hashes)

Uploaded May 7, 2021 Source

Built Distributions

pySAR-0.0.3-py3.8.egg (232.4 kB view hashes)

Uploaded May 7, 2021 Source

pySAR-0.0.3-py3-none-any.whl (106.3 kB view hashes)

Uploaded May 7, 2021 Python 3

Hashes for pySAR-0.0.3.tar.gz

Hashes for pySAR-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`93da66d40cd0a931883132135eeb4793c4a63792fb0329a3212ccafcb1fa8ad1`
MD5	`e8ebf955213e3354747fc2859a4a81a2`
BLAKE2b-256	`bf7cf2817d67f543d1e91549d1955edef916b9854a5f1c0d2e6906b5bdf076f2`

Hashes for pySAR-0.0.3-py3.8.egg

Hashes for pySAR-0.0.3-py3.8.egg
Algorithm	Hash digest
SHA256	`196320f5e79881fcc92aa7d65818e65bda7c095bf8bca894ddadca5a900c7917`
MD5	`51da633f1827aa248ade5030b406276d`
BLAKE2b-256	`8c00f6ccaa8049442784878e11c51a5833f7c9b844a5a6a8b86f742def106425`

Hashes for pySAR-0.0.3-py3-none-any.whl

Hashes for pySAR-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b8ee074b7f51c9cb534d8fdd0b3ce0ad58396a5953b31c50581fc35c4a1ef6c`
MD5	`c311657caa0dff1fd684fbc5214f52d5`
BLAKE2b-256	`8bcc418c95b513584e8a48d259a014b499cea1810fca6dd9e79cade2cc848a4e`