...
Project description
ProtLoc-Mex1
Introduction ProtLoc-Mex1
This project offers a comprehensive pipeline for the rapid development of subcellular localization prediction and model interpretation. It encompasses 42 amino acid feature characterization algorithms and Gene Ontology (GO) feature extraction based on the Doc2Vec approach. Additionally, two random forest models for protein localization prediction are provided.
with support by SHAP package warped into ProtLoc-mex1 module,everyone can easily use two random forest models above to get the global explanation and local explanation of the feature (physiochemical characteristics of amino acid sequence and GO annotation semantics) and model in protein localization prediction.
Note: If you are refer to our work titled "Interpretable Feature Extraction and Dimensionality Reduction in ESM2 for Protein Localization Prediction" , please only focus on the sections of Installation
, Dependencies
, and AA_count
. If you wish to explore more details related to the interpretation methods related to this work, please proceed to the GitHub repository directly. The content following the GO_count
section is independent of the mentioned work.
Installation
This project's core code has been uploaded to the PyPI repository. To get it using a conda virtual environment, follow the steps below:
First, create a new conda environment. For Windows systems, it is recommended to use Conda Prompt for this task. On Linux systems, you can use the Terminal. (You can also modify the environment name as needed, here, we use "myenv" as an example):
conda create -n myenv python=3.9
Then, activate the environment you just created:
conda activate myenv
Finally, use pip to install 'protloc_mex1' within this environment:
pip install protloc_mex1
Dependencies
ProtLoc-Mex1 requires Python == 3.9
Below are the Python packages required by ProtLoc-Mex1, which are automatically installed with it:
dependencies = [
"biopython==1.79",
"numpy==1.20.3",
"pandas==1.4.1",
"seaborn==0.11.2",
"matplotlib==3.5.1",
"shap==0.41.0",
"gensim==4.2.0"
]
and other not automatically installed but also required Python packages:
dependencies = [
"scikit-learn==1.2.2",
"captum == 0.6.0"
"torch == 1.12.1"
]
It is advised to obtain these dependent packages from their respective official sources, while carefully considering the implications of version compatibility.
How to use ProtLoc-Mex1
ProtLoc-Mex1 includes 6 modules: AA_count, GO_count, classifier_evalute, SHAP_conduct, SHAP_plus.
AA_count
In this module, we can perform protein sequence analysis. AA_count include three functions, dna_sequence_conduct()
, rna_sequence_conduct()
, and protein_sequence_conduct()
, they are designed to process DNA, RNA, and protein sequences, respectively, in a given DataFrame df
.
The dna_sequence_conduct()
and rna_sequence_conduct()
functions:
- Filter out illegal bases in the DNA/RNA sequences in the DataFrame.
- Calculate and add columns for the frequency of carbon, hydrogen, nitrogen, and oxygen elements in the DNA/RNA sequences.
- Calculate and add columns for the frequency of each base (A, T, C, G for DNA and A, U, C, G for RNA) in the DNA/RNA sequences.
The protein_sequence_conduct()
function:
- Adds new columns for 20 standard amino acids, elemental properties, and various protein properties including molecular weight, aromaticity, instability index, flexibility, isoelectric point, secondary structure fraction, molar extinction coefficient for reduced cysteines and disulfide bridges, and gravy.
- Calculates and adds the frequency of each amino acid in the protein sequences.
- For sequences containing only 20 standard amino acids, it calculates and adds values for molecular weight, aromaticity, instability index, flexibility, isoelectric point, secondary structure fraction, molar extinction coefficient for reduced cysteines and disulfide bridges, and gravy.
- Calculates and adds the count of each element (oxygen, sulfur, carbon, hydrogen, nitrogen) and acidic, basic, polar, and non-polar properties in the protein sequences.
for using AA_count example:
>>> import pandas as pd
>>> from protloc_mex1.AA_count import dna_sequence_conduct, rna_sequence_conduct, protein_sequence_conduct
# Example DataFrame with DNA sequences
>>> df_dna = pd.DataFrame({
... 'gene_seq': ['ATCGTGCA', 'TGCTAGCT', 'CTAGCTAG']
... })
# Call the dna_sequence_conduct function
>>> df_dna_processed = dna_sequence_conduct(df_dna, 'gene_seq')
>>> print(df_dna_processed)
# Assume you have a DataFrame containing RNA sequences
>>> df_rna = pd.DataFrame({
... 'gene_seq': ['AUCGUGCA', 'UGCUAGCU', 'CUAGCUAG']
... })
# Call the rna_sequence_conduct function
>>> df_rna_processed = rna_sequence_conduct(df_rna, 'gene_seq')
>>> print(df_rna_processed)
# Assume you have a DataFrame containing protein sequences
>>> df_protein = pd.DataFrame({
... 'Sequence': ['ACDEFGHIKLMNPQRSTVWY', 'ACDEFGHIKLMNPQRSTVWY']
... })
# Call the protein_sequence_conduct function
>>> df_protein_processed = protein_sequence_conduct(df_protein, 'Sequence')
>>> print(df_protein_processed)
GO_count
GO_count are capability in using Doc2vec to get GO representation,for initialize model can see below:
>>> import gensim
>>> from gensim.models.doc2vec import Doc2Vec
>>> from protloc_mex1.GO_count import GO_pre_Process, model_training, vec_create_GO
# Create dummy data
>>> data = pd.DataFrame({
... 'Entry': ['Gene1', 'Gene2', 'Gene3'],
... 'GO_BP': ['some document GO:0008150', 'some document GO:0008150;GO:0009987', 'some document GO:0009987'],
... 'GO_MF': ['some document GO:0003674', 'some document GO:0003674;GO:0003824', 'some document GO:0003824']
... })
# Preprocess data
>>> data = GO_pre_Process(data, '\[GO:\d+\]', 'GO_BP')
>>> data = GO_pre_Process(data, '\[GO:\d+\]', 'GO_MF')
# Initialize models
>>> BP_model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40, window=5, workers=1, dm=0, seed=0)
>>> MF_model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40, window=5, workers=1, dm=0, seed=0)
# Train models
>>> BP_model_trained = model_training(data, BP_model, 'GO_BP')
>>> MF_model_trained = model_training(data, MF_model, 'GO_MF')
# Generate word vectors
>>> BP_data = vec_create_GO(data, BP_model_trained, 'GO_BP')
>>> MF_data = vec_create_GO(data, MF_model_trained, 'GO_MF')
# Print word vectors
>>> print(BP_data)
>>> print(MF_data)
for using pre-training BP and MF model in this article, can see below:
## this example code and file will realse soon
classifier_evalute
## this example code and file will realse soon
SHAP_conduct
## this example code and file will realse soon
SHAP_plus
## this example code and file will realse soon
IG_calculator
## this example code and file will realse soon
Supplementary materials
All supplementary materials associated with the article can be found in the Supplementary_material folder. Furthermore, the code and detailed explanations related to the two experimental research cases mentioned in the article, Case 1 and Case 2, are available in their respective directories.
The subcellular localization classification models trained for Case 1 and Case 2 are stored separately in <Case1/Classification and feature filtering module/csae1_localization_model.pkl> and <Case2/Classification and feature filtering module/csae1_localization_model.pkl>. These models accept protein feature inputs identical to those in the demo file and generate corresponding predictions.
Citation
If this library has been helpful in your work, please cite it using the following format:
@misc{protloc_mex1,
title={{protloc_mex1}: a comprehensive pipeline for the rapid development of subcellular localization prediction and model interpretation},
author={Zeyu Luo, Rui Wang},
howpublished = {\url{https://pypi.org/project/protloc-mex1/}},
year={2023}
}
If you intend to use the SHAP method, please refer to Lundberg's work: Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In NIPS (2017). arXiv preprint arXiv:1705.07874.2017
Acknowledgments
we are acknowledge the contributions of the open-source community and the developers of the Python libraries used in this study.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for protloc_mex1-0.0.24-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38f3136781c1c80146c50d0df5ee0a1aca53ac16be0afa41eaed87c05219b739 |
|
MD5 | 920a1c96a4c9047451678cc5d4e6764f |
|
BLAKE2b-256 | 0066d767431531daeeef455628d55b8f3bc46530452102d8306f04d9833576a3 |