A Python package used to analysis Protein Sequence Activity Relationships
Project description
pySAR
pySAR is a Python library for analysing Sequence Activity Relationships (SARs) of protein sequences. pySAR offers extensive and verbose functionalities that allow you to numerically encode a dataset of protein sequences using a large abundance of available methodologies and features. The software uses physiochemical and biochemical features from the Amino Acid Index (AAI) database as well as allowing for the calculation of a range of structural protein descriptors.
After finding the optimal technique and feature set at which to encode your dataset of sequences, pySAR can then be used to build a predictive regression model with the training data being that of the encoded sequences and training labels being the experimentally pre-calculated activity values for each protein sequence. The model can then be used to predict the activity/fitness value of a new unseen sequence.
status
Development Stage
To DO List:
- Add Github Workflow CI thing
- Add Category and Descriptor Group to pySAR results DF.
- Condense comments in functions, remove some whitespace lines
- Add help function
- Mention that PyBioMed package duplicated here as it is not available via pyPI and would mean that user would have to install the full pybiomed zip
- raise type errors instead of Value ?
- index errors?
- remove plot func from DSP
- do StanardScaler after every AAIndex encoding and before model building####
- Change importing globals : import globals / globals.OUTPUT_DIR
- Split up autocorrelation descriptors into their own functions
- Allow fasta file to be input to Descriptor class?
- github workflow with Twine that automatically published to pypi
- provide example script for running on GCP or AWS resources?
- don't return None after raising an exception??
- add descriptions to each methods in each class
- remove spacing in equals in keyword args in class/function defintiion
- setters and getters to Evaluate class? using @property
- add python version badge to readme
- add pypi badge to readme
- add introduction to readme
- add references to descriptor module
- integrate descriptor and AAIndex when using properties from AAIndex
- look into setup.cfg or setup.py
- add distance matrices json to dara ? : https://github.com/MartinThoma/propy3/blob/master/propy/QuasiSequenceOrder.py
- split up QuasiSequenceOrder descriptor into its consitent quasi-seq-order
- in readme show example usage for each module/class
- change AAI method names from get_feature etc to get_record...
- change get_feature_names to get_feature_desc
- add AAI category to each AAI record
- change all 'aa_index' to 'aaindex'
- Add assertion comments to each unit test, got X wanted Y..
- add test numbers in comments for each block of unit tests.
- Go through each parameters list and refer to its previous reference rather than repeating it.
- add cutoff index/value again just for testing
- print out default parameters if using them.
- remove verbose argument - dont need since tqdm prints progress bar
- add if name == "main" to encoding and pySAR class.
- split function defs to two lines?
- publish to conda?
- pypi logo
- license logo
- leave = False on 2nd loop
Installation
Install using pip:
pi3 install pySAR
Usage
Building predictive model from AAI and or protein descriptors:
e.g the below code will build a PlsRegression model using the AAI index CIDH920105 and the amino acid composition descriptor. The index is passed through a DSP pipeline and is transformed into its informational protein spectra using the power spectra, with a hamming window function applied to the output of the FFT. spectrum after a window function is applied.
#first-party imports
from globals import OUTPUT_DIR, OUTPUT_FOLDER, DATA_DIR
from aaindex import AAIndex
from model import Model
from proDSP import ProDSP
from evaluate import Evaluate
import utils as utils
from plots import plot_reg
import descriptors as desc
pySAR = PySAR(dataset="dataset.txt",seq_col="sequence", activity="activity",algorithm = "PlsRegression", parameters={}, test_split=0.2)
results_df = pySAR.encode_aai_desc(indices="CIDH920105", descriptors="aa_composition", spectrum="power", window="hamming")
Encoding using all 566 AAIndex indices
#create instance of Encoding class, inherits from pySAR class
encoding = Encoding(dataset="dataset.txt", activity="activity_col",
algorithm="RandomForest", parameters={"n_estimators":"200","max_depth":"50"})
aai_encoding = encoding.aai_encoding(spectrum='imaginary', window='blackman')
Encoding using list of 4 AAIndex indices, with no DSP functionalities
encoding = Encoding(dataset="dataset.txt", activity="activity_col",
algorithm="PLSRegression", parameters={"":"","":"", })
aai_encoding = encoding.aai_encoding(use_dsp=False, aai_list=["PONP800102","RICJ880102","ROBB760107","KARS160113"])
Encoding using protein descriptors
encoding = Encoding(dataset="dataset.txt", activity="activity_col",
algorithm="RandomForest", parameters={"":"","":"", }, descriptors_csv="descriptors.csv")
desc_encoding = encoding.desc_encoding(desc_combo = 2, verbose = True)
def descriptor_encoding(self, desc_list=None, desc_combo=1, verbose=True):
Encoding using AAI + protein descriptors
Generate all protein descriptors
desc = Descriptor(protein_seqs = data, desc_dataset = "descriptors.csv",
all_desc=True)
where protein_seqs is the dataset of protein sequences, desc_dataset is the name of the ouput csv used to store the calculated descriptors of the protein sequences and all_desc means that the class will get and calculate all descriptors.
Get record from AAIndex database
desc = Descriptor(protein_seqs = data, desc_dataset = "descriptors.csv",
all_desc=True)
Output Results
| Descriptor | Index | | R2 | RMSE | MSE
| ------------- | ------------- |
| Content Cell | Content Cell |
| Content Cell | Content Cell |
Command | Description |
---|---|
git status | List all new or modified files |
git diff | Show file differences that haven't been staged |
System Requirements
Python > 3.6 numpy >= 1.16.6 pandas >= 1.1.0 scikit-learn >= 0.24 scipy >= 1.4.1
Running Tests
To run tests, from the main pySAR folder run:
python -m unittest tests.MODULE_NAME -v
MODULE_NAME ->
Directory folders:
/pySAR/PyBioMed
- package partially forked from https://github.com/gadsbyfly/PyBioMed, used in the calculation of the protein descriptors./Results
- stores all calculated results that were generated for the research article, studying the SAR for a thermostability dataset./pySAR/tests
- unit and integration tests for pySAR./pySAR/data
- all required data and datasets are stored in this folder.
Contact
If you have any questions or comments, please contact: amckenna41@qub.ac.uk @
|Logo| image:: https://raw.githubusercontent.com/pySAR/pySAR/master/pySAR.png
Install required dependencies and packages:
python setup.py install
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.