A Python package used to analysis Protein Sequence Activity Relationships
Project description
pySAR
pySAR is a Python library for analysing Sequence Activity Relationships (SARs) of protein sequences. pySAR offers extensive and verbose functionalities that allow you to numerically encode a dataset of protein sequences using a large abundance of available methodologies and features. The software uses physiochemical and biochemical features from the Amino Acid Index (AAI) database as well as allowing for the calculation of a range of structural protein descriptors.
After finding the optimal technique and feature set at which to encode your dataset of sequences, pySAR can then be used to build a predictive regression model with the training data being that of the encoded sequences and training labels being the experimentally pre-calculated activity values for each protein sequence. The model can then be used to predict the activity/fitness value of a new unseen sequence.
status
Development Stage
To DO List:
- Add Github Workflow CI thing
- Add Category and Descriptor Group to pySAR results DF.
- Condense comments in functions, remove some whitespace lines
- Add help function
- Mention that PyBioMed package duplicated here as it is not available via pyPI and would mean that user would have to install the full pybiomed zip
- raise type errors instead of Value ?
- index errors?
- remove plot func from DSP
- do StanardScaler after every AAIndex encoding and before model building####
- Change importing globals : import globals / globals.OUTPUT_DIR
- Split up autocorrelation descriptors into their own functions
- Allow fasta file to be input to Descriptor class?
- github workflow with Twine that automatically published to pypi
- provide example script for running on GCP or AWS resources?
- don't return None after raising an exception??
- add descriptions to each methods in each class
- remove spacing in equals in keyword args in class/function defintiion
- setters and getters to Evaluate class? using @property
- add python version badge to readme
- add pypi badge to readme
- add introduction to readme
- add references to descriptor module
- integrate descriptor and AAIndex when using properties from AAIndex
- look into setup.cfg or setup.py
- add distance matrices json to dara ? : https://github.com/MartinThoma/propy3/blob/master/propy/QuasiSequenceOrder.py
- split up QuasiSequenceOrder descriptor into its consitent quasi-seq-order
- in readme show example usage for each module/class
- change AAI method names from get_feature etc to get_record...
- change get_feature_names to get_feature_desc
- add AAI category to each AAI record
- change all 'aa_index' to 'aaindex'
- Add assertion comments to each unit test, got X wanted Y..
- add test numbers in comments for each block of unit tests.
- Go through each parameters list and refer to its previous reference rather than repeating it.
- add cutoff index/value again just for testing
- print out default parameters if using them.
- remove verbose argument - dont need since tqdm prints progress bar
- add if name == "main" to encoding and pySAR class.
- split function defs to two lines?
- publish to conda?
- pypi logo
- license logo
- leave = False on 2nd loop
Installation
Install using pip:
pi3 install pySAR
Usage
Building predictive model from AAI and or protein descriptors:
e.g the below code will build a PlsRegression model using the AAI index CIDH920105 and the amino acid composition descriptor. The index is passed through a DSP pipeline and is transformed into its informational protein spectra using the power spectra, with a hamming window function applied to the output of the FFT. spectrum after a window function is applied.
#first-party imports
from globals import OUTPUT_DIR, OUTPUT_FOLDER, DATA_DIR
from aaindex import AAIndex
from model import Model
from proDSP import ProDSP
from evaluate import Evaluate
import utils as utils
from plots import plot_reg
import descriptors as desc
pySAR = PySAR(dataset="dataset.txt",seq_col="sequence", activity="activity",algorithm = "PlsRegression", parameters={}, test_split=0.2)
results_df = pySAR.encode_aai_desc(indices="CIDH920105", descriptors="aa_composition", spectrum="power", window="hamming")
Encoding using all 566 AAIndex indices
#create instance of Encoding class, inherits from pySAR class
encoding = Encoding(dataset="dataset.txt", activity="activity_col",
algorithm="RandomForest", parameters={"n_estimators":"200","max_depth":"50"})
aai_encoding = encoding.aai_encoding(spectrum='imaginary', window='blackman')
Encoding using list of 4 AAIndex indices, with no DSP functionalities
encoding = Encoding(dataset="dataset.txt", activity="activity_col",
algorithm="PLSRegression", parameters={"":"","":"", })
aai_encoding = encoding.aai_encoding(use_dsp=False, aai_list=["PONP800102","RICJ880102","ROBB760107","KARS160113"])
Encoding using protein descriptors
encoding = Encoding(dataset="dataset.txt", activity="activity_col",
algorithm="RandomForest", parameters={"":"","":"", }, descriptors_csv="descriptors.csv")
desc_encoding = encoding.desc_encoding(desc_combo = 2, verbose = True)
def descriptor_encoding(self, desc_list=None, desc_combo=1, verbose=True):
Encoding using AAI + protein descriptors
Generate all protein descriptors
desc = Descriptor(protein_seqs = data, desc_dataset = "descriptors.csv",
all_desc=True)
where protein_seqs is the dataset of protein sequences, desc_dataset is the name of the ouput csv used to store the calculated descriptors of the protein sequences and all_desc means that the class will get and calculate all descriptors.
Get record from AAIndex database
desc = Descriptor(protein_seqs = data, desc_dataset = "descriptors.csv",
all_desc=True)
Output Results
| Descriptor | Index | | R2 | RMSE | MSE
| ------------- | ------------- |
| Content Cell | Content Cell |
| Content Cell | Content Cell |
Command | Description |
---|---|
git status | List all new or modified files |
git diff | Show file differences that haven't been staged |
System Requirements
Python > 3.6 numpy >= 1.16.6 pandas >= 1.1.0 scikit-learn >= 0.24 scipy >= 1.4.1
Running Tests
To run tests, from the main pySAR folder run:
python -m unittest tests.MODULE_NAME -v
MODULE_NAME ->
Directory folders:
/pySAR/PyBioMed
- package partially forked from https://github.com/gadsbyfly/PyBioMed, used in the calculation of the protein descriptors./Results
- stores all calculated results that were generated for the research article, studying the SAR for a thermostability dataset./pySAR/tests
- unit and integration tests for pySAR./pySAR/data
- all required data and datasets are stored in this folder.
Contact
If you have any questions or comments, please contact: amckenna41@qub.ac.uk @
|Logo| image:: https://raw.githubusercontent.com/pySAR/pySAR/master/pySAR.png
Install required dependencies and packages:
python setup.py install
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file pySAR-0.0.3.tar.gz
.
File metadata
- Download URL: pySAR-0.0.3.tar.gz
- Upload date:
- Size: 93.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
93da66d40cd0a931883132135eeb4793c4a63792fb0329a3212ccafcb1fa8ad1
|
|
MD5 |
e8ebf955213e3354747fc2859a4a81a2
|
|
BLAKE2b-256 |
bf7cf2817d67f543d1e91549d1955edef916b9854a5f1c0d2e6906b5bdf076f2
|
File details
Details for the file pySAR-0.0.3-py3.8.egg
.
File metadata
- Download URL: pySAR-0.0.3-py3.8.egg
- Upload date:
- Size: 232.4 kB
- Tags: Egg
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
196320f5e79881fcc92aa7d65818e65bda7c095bf8bca894ddadca5a900c7917
|
|
MD5 |
51da633f1827aa248ade5030b406276d
|
|
BLAKE2b-256 |
8c00f6ccaa8049442784878e11c51a5833f7c9b844a5a6a8b86f742def106425
|
File details
Details for the file pySAR-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: pySAR-0.0.3-py3-none-any.whl
- Upload date:
- Size: 106.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
0b8ee074b7f51c9cb534d8fdd0b3ce0ad58396a5953b31c50581fc35c4a1ef6c
|
|
MD5 |
c311657caa0dff1fd684fbc5214f52d5
|
|
BLAKE2b-256 |
8bcc418c95b513584e8a48d259a014b499cea1810fca6dd9e79cade2cc848a4e
|