Skip to main content

...

Project description

ProtLoc-Mex1

Introduction ProtLoc-Mex1

This project offers a comprehensive pipeline for the rapid development of subcellular localization prediction and model interpretation. It encompasses 42 amino acid feature characterization algorithms and Gene Ontology (GO) feature extraction based on the Doc2Vec approach. Additionally, two random forest models for protein localization prediction are provided.

with support by SHAP package warped into ProtLoc-mex1 module,everyone can easily use two random forest models above to get the global explanation and local explanation of the feature (physiochemical characteristics of amino acid sequence and GO annotation semantics) and model in protein localization prediction.

Note: If you are refer to our work titled "Interpretable Feature Extraction and Dimensionality Reduction in ESM2 for Protein Localization Prediction" , please only focus on the sections of Installation, Dependencies, and AA_count. If you wish to explore more details related to the interpretation methods related to this work, please proceed to the GitHub repository directly. The content following the GO_count section is independent of the mentioned work.

Installation

This project's core code has been uploaded to the PyPI repository. To get it using a conda virtual environment, follow the steps below:

First, create a new conda environment. For Windows systems, it is recommended to use Conda Prompt for this task. On Linux systems, you can use the Terminal. (You can also modify the environment name as needed, here, we use "myenv" as an example):

conda create -n myenv python=3.9

Then, activate the environment you just created:

conda activate myenv

Finally, use pip to install 'protloc_mex1' within this environment:

pip install protloc_mex1

Dependencies

ProtLoc-Mex1 requires Python == 3.9

Below are the Python packages required by ProtLoc-Mex1, which are automatically installed with it:

dependencies = [
        "biopython==1.79",
        "numpy==1.20.3",
        "pandas==1.4.1",
        "seaborn==0.11.2",
        "matplotlib==3.5.1",
        "shap==0.41.0",
        "gensim==4.2.0"
]

and other not automatically installed but also required Python packages:

dependencies = [
       "scikit-learn==1.2.2",
       "captum == 0.6.0"
       "torch == 1.12.1"
]

It is advised to obtain these dependent packages from their respective official sources, while carefully considering the implications of version compatibility.

How to use ProtLoc-Mex1

ProtLoc-Mex1 includes 6 modules: AA_count, GO_count, classifier_evalute, SHAP_conduct, SHAP_plus.

AA_count

In this module, we can perform protein sequence analysis. AA_count include three functions, dna_sequence_conduct(), rna_sequence_conduct(), and protein_sequence_conduct(), they are designed to process DNA, RNA, and protein sequences, respectively, in a given DataFrame df.

The dna_sequence_conduct() and rna_sequence_conduct() functions:

  1. Filter out illegal bases in the DNA/RNA sequences in the DataFrame.
  2. Calculate and add columns for the frequency of carbon, hydrogen, nitrogen, and oxygen elements in the DNA/RNA sequences.
  3. Calculate and add columns for the frequency of each base (A, T, C, G for DNA and A, U, C, G for RNA) in the DNA/RNA sequences.

The protein_sequence_conduct() function:

  1. Adds new columns for 20 standard amino acids, elemental properties, and various protein properties including molecular weight, aromaticity, instability index, flexibility, isoelectric point, secondary structure fraction, molar extinction coefficient for reduced cysteines and disulfide bridges, and gravy.
  2. Calculates and adds the frequency of each amino acid in the protein sequences.
  3. For sequences containing only 20 standard amino acids, it calculates and adds values for molecular weight, aromaticity, instability index, flexibility, isoelectric point, secondary structure fraction, molar extinction coefficient for reduced cysteines and disulfide bridges, and gravy.
  4. Calculates and adds the count of each element (oxygen, sulfur, carbon, hydrogen, nitrogen) and acidic, basic, polar, and non-polar properties in the protein sequences.

for using AA_count example:

>>> import pandas as pd
>>> from protloc_mex1.AA_count import dna_sequence_conduct, rna_sequence_conduct, protein_sequence_conduct

# Example DataFrame with DNA sequences
>>> df_dna = pd.DataFrame({
...     'gene_seq': ['ATCGTGCA', 'TGCTAGCT', 'CTAGCTAG']
... })

# Call the dna_sequence_conduct function
>>> df_dna_processed = dna_sequence_conduct(df_dna, 'gene_seq')
>>> print(df_dna_processed)

# Assume you have a DataFrame containing RNA sequences
>>> df_rna = pd.DataFrame({
...     'gene_seq': ['AUCGUGCA', 'UGCUAGCU', 'CUAGCUAG']
... })

# Call the rna_sequence_conduct function
>>> df_rna_processed = rna_sequence_conduct(df_rna, 'gene_seq')
>>> print(df_rna_processed)

# Assume you have a DataFrame containing protein sequences
>>> df_protein = pd.DataFrame({
...     'Sequence': ['ACDEFGHIKLMNPQRSTVWY', 'ACDEFGHIKLMNPQRSTVWY']
... })

# Call the protein_sequence_conduct function
>>> df_protein_processed = protein_sequence_conduct(df_protein, 'Sequence')
>>> print(df_protein_processed)

GO_count

GO_count are capability in using Doc2vec to get GO representation,for initialize model can see below:

>>> import gensim
>>> from gensim.models.doc2vec import Doc2Vec
>>> from protloc_mex1.GO_count import GO_pre_Process, model_training, vec_create_GO

# Create dummy data
>>> data = pd.DataFrame({
...     'Entry': ['Gene1', 'Gene2', 'Gene3'],
...     'GO_BP': ['some document GO:0008150', 'some document GO:0008150;GO:0009987', 'some document GO:0009987'],
...     'GO_MF': ['some document GO:0003674', 'some document GO:0003674;GO:0003824', 'some document GO:0003824']
... })

# Preprocess data
>>> data = GO_pre_Process(data, '\[GO:\d+\]', 'GO_BP')
>>> data = GO_pre_Process(data, '\[GO:\d+\]', 'GO_MF')

# Initialize models
>>> BP_model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40, window=5, workers=1, dm=0, seed=0)
>>> MF_model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40, window=5, workers=1, dm=0, seed=0)

# Train models
>>> BP_model_trained = model_training(data, BP_model, 'GO_BP')
>>> MF_model_trained = model_training(data, MF_model, 'GO_MF')

# Generate word vectors
>>> BP_data = vec_create_GO(data, BP_model_trained, 'GO_BP')
>>> MF_data = vec_create_GO(data, MF_model_trained, 'GO_MF')

# Print word vectors
>>> print(BP_data)
>>> print(MF_data)

for using pre-training BP and MF model in this article, can see below:

## this example code and file will realse soon

classifier_evalute

## this example code and file will realse soon

SHAP_conduct

## this example code and file will realse soon

SHAP_plus

## this example code and file will realse soon

IG_calculator

## this example code and file will realse soon

Supplementary materials

All supplementary materials associated with the article can be found in the Supplementary_material folder. Furthermore, the code and detailed explanations related to the two experimental research cases mentioned in the article, Case 1 and Case 2, are available in their respective directories.

The subcellular localization classification models trained for Case 1 and Case 2 are stored separately in <Case1/Classification and feature filtering module/csae1_localization_model.pkl> and <Case2/Classification and feature filtering module/csae1_localization_model.pkl>. These models accept protein feature inputs identical to those in the demo file and generate corresponding predictions.

Citation

If this library has been helpful in your work, please cite it using the following format:

@misc{protloc_mex1,
    title={{protloc_mex1}: a comprehensive pipeline for the rapid development of subcellular localization prediction and model interpretation},
    author={Zeyu Luo, Rui Wang},
    howpublished = {\url{https://pypi.org/project/protloc-mex1/}},
    year={2023}
}

If you intend to use the SHAP method, please refer to Lundberg's work: Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In NIPS (2017). arXiv preprint arXiv:1705.07874.2017

Acknowledgments

we are acknowledge the contributions of the open-source community and the developers of the Python libraries used in this study.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protloc_mex1-0.0.24.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

protloc_mex1-0.0.24-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file protloc_mex1-0.0.24.tar.gz.

File metadata

  • Download URL: protloc_mex1-0.0.24.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.4

File hashes

Hashes for protloc_mex1-0.0.24.tar.gz
Algorithm Hash digest
SHA256 f1bc776c021de46844c80e51d1c0b76570ba2a39bdc5ae365c18123e30902b9f
MD5 60080a48870ec58ac353c1a3afb9ce66
BLAKE2b-256 a09160259656ae100e2aed62e8a847ca1414728febc1022602ca4f1d4dd7a288

See more details on using hashes here.

File details

Details for the file protloc_mex1-0.0.24-py3-none-any.whl.

File metadata

File hashes

Hashes for protloc_mex1-0.0.24-py3-none-any.whl
Algorithm Hash digest
SHA256 38f3136781c1c80146c50d0df5ee0a1aca53ac16be0afa41eaed87c05219b739
MD5 920a1c96a4c9047451678cc5d4e6764f
BLAKE2b-256 0066d767431531daeeef455628d55b8f3bc46530452102d8306f04d9833576a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page