These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
- Python :: 3.9

Project description

ProtLoc-Mex1

Introduction ProtLoc-Mex1

This project offers a comprehensive pipeline for the rapid development of subcellular localization prediction and model interpretation. It encompasses 42 amino acid feature characterization algorithms and Gene Ontology (GO) feature extraction based on the Doc2Vec approach. Additionally, two random forest models for protein localization prediction are provided.

with support by SHAP package warped into ProtLoc-mex1 module，everyone can easily use two random forest models above to get the global explanation and local explanation of the feature (physiochemical characteristics of amino acid sequence and GO annotation semantics) and model in protein localization prediction.

Note: If you are refer to our work titled "Interpretable Feature Extraction and Dimensionality Reduction in ESM2 for Protein Localization Prediction" , please only focus on the sections of Installation, Dependencies, and AA_count. If you wish to explore more details related to the interpretation methods related to this work, please proceed to the GitHub repository directly. The content following the GO_count section is independent of the mentioned work.

Installation

This project's core code has been uploaded to the PyPI repository. To get it using a conda virtual environment, follow the steps below:

First, create a new conda environment. For Windows systems, it is recommended to use Conda Prompt for this task. On Linux systems, you can use the Terminal. (You can also modify the environment name as needed, here, we use "myenv" as an example):

conda create -n myenv python=3.9

Then, activate the environment you just created:

conda activate myenv

Finally, use pip to install 'protloc_mex1' within this environment:

pip install protloc_mex1

Dependencies

ProtLoc-Mex1 requires Python == 3.9

Below are the Python packages required by ProtLoc-Mex1, which are automatically installed with it:

dependencies = [
        "biopython==1.79",
        "numpy==1.20.3",
        "pandas==1.4.1",
        "seaborn==0.11.2",
        "matplotlib==3.5.1",
        "shap==0.41.0",
        "gensim==4.2.0"
]

and other not automatically installed but also required Python packages：

dependencies = [
       "scikit-learn==1.2.2",
       "captum == 0.6.0"
       "torch == 1.12.1"
]

It is advised to obtain these dependent packages from their respective official sources, while carefully considering the implications of version compatibility.

How to use ProtLoc-Mex1

ProtLoc-Mex1 includes 6 modules: AA_count, GO_count, classifier_evalute, SHAP_conduct, SHAP_plus.

AA_count

In this module, we can perform protein sequence analysis. AA_count include three functions, dna_sequence_conduct(), rna_sequence_conduct(), and protein_sequence_conduct(), they are designed to process DNA, RNA, and protein sequences, respectively, in a given DataFrame df.

The dna_sequence_conduct() and rna_sequence_conduct() functions:

Filter out illegal bases in the DNA/RNA sequences in the DataFrame.
Calculate and add columns for the frequency of carbon, hydrogen, nitrogen, and oxygen elements in the DNA/RNA sequences.
Calculate and add columns for the frequency of each base (A, T, C, G for DNA and A, U, C, G for RNA) in the DNA/RNA sequences.

The protein_sequence_conduct() function:

Adds new columns for 20 standard amino acids, elemental properties, and various protein properties including molecular weight, aromaticity, instability index, flexibility, isoelectric point, secondary structure fraction, molar extinction coefficient for reduced cysteines and disulfide bridges, and gravy.
Calculates and adds the frequency of each amino acid in the protein sequences.
For sequences containing only 20 standard amino acids, it calculates and adds values for molecular weight, aromaticity, instability index, flexibility, isoelectric point, secondary structure fraction, molar extinction coefficient for reduced cysteines and disulfide bridges, and gravy.
Calculates and adds the count of each element (oxygen, sulfur, carbon, hydrogen, nitrogen) and acidic, basic, polar, and non-polar properties in the protein sequences.

for using AA_count example:

>>> import pandas as pd
>>> from protloc_mex1.AA_count import dna_sequence_conduct, rna_sequence_conduct, protein_sequence_conduct

# Example DataFrame with DNA sequences
>>> df_dna = pd.DataFrame({
...     'gene_seq': ['ATCGTGCA', 'TGCTAGCT', 'CTAGCTAG']
... })

# Call the dna_sequence_conduct function
>>> df_dna_processed = dna_sequence_conduct(df_dna, 'gene_seq')
>>> print(df_dna_processed)

# Assume you have a DataFrame containing RNA sequences
>>> df_rna = pd.DataFrame({
...     'gene_seq': ['AUCGUGCA', 'UGCUAGCU', 'CUAGCUAG']
... })

# Call the rna_sequence_conduct function
>>> df_rna_processed = rna_sequence_conduct(df_rna, 'gene_seq')
>>> print(df_rna_processed)

# Assume you have a DataFrame containing protein sequences
>>> df_protein = pd.DataFrame({
...     'Sequence': ['ACDEFGHIKLMNPQRSTVWY', 'ACDEFGHIKLMNPQRSTVWY']
... })

# Call the protein_sequence_conduct function
>>> df_protein_processed = protein_sequence_conduct(df_protein, 'Sequence')
>>> print(df_protein_processed)

GO_count

GO_count are capability in using Doc2vec to get GO representation，for initialize model can see below：

>>> import gensim
>>> from gensim.models.doc2vec import Doc2Vec
>>> from protloc_mex1.GO_count import GO_pre_Process, model_training, vec_create_GO

# Create dummy data
>>> data = pd.DataFrame({
...     'Entry': ['Gene1', 'Gene2', 'Gene3'],
...     'GO_BP': ['some document GO:0008150', 'some document GO:0008150;GO:0009987', 'some document GO:0009987'],
...     'GO_MF': ['some document GO:0003674', 'some document GO:0003674;GO:0003824', 'some document GO:0003824']
... })

# Preprocess data
>>> data = GO_pre_Process(data, '\[GO:\d+\]', 'GO_BP')
>>> data = GO_pre_Process(data, '\[GO:\d+\]', 'GO_MF')

# Initialize models
>>> BP_model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40, window=5, workers=1, dm=0, seed=0)
>>> MF_model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40, window=5, workers=1, dm=0, seed=0)

# Train models
>>> BP_model_trained = model_training(data, BP_model, 'GO_BP')
>>> MF_model_trained = model_training(data, MF_model, 'GO_MF')

# Generate word vectors
>>> BP_data = vec_create_GO(data, BP_model_trained, 'GO_BP')
>>> MF_data = vec_create_GO(data, MF_model_trained, 'GO_MF')

# Print word vectors
>>> print(BP_data)
>>> print(MF_data)

for using pre-training BP and MF model in this article, can see below：

## this example code and file will realse soon

classifier_evalute

## this example code and file will realse soon

SHAP_conduct

## this example code and file will realse soon

SHAP_plus

## this example code and file will realse soon

IG_calculator

## this example code and file will realse soon

Supplementary materials

All supplementary materials associated with the article can be found in the Supplementary_material folder. Furthermore, the code and detailed explanations related to the two experimental research cases mentioned in the article, Case 1 and Case 2, are available in their respective directories.

The subcellular localization classification models trained for Case 1 and Case 2 are stored separately in <Case1/Classification and feature filtering module/csae1_localization_model.pkl> and <Case2/Classification and feature filtering module/csae1_localization_model.pkl>. These models accept protein feature inputs identical to those in the demo file and generate corresponding predictions.

Citation

If this library has been helpful in your work, please cite it using the following format:

@misc{protloc_mex1,
    title={{protloc_mex1}: a comprehensive pipeline for the rapid development of subcellular localization prediction and model interpretation},
    author={Zeyu Luo, Rui Wang},
    howpublished = {\url{https://pypi.org/project/protloc-mex1/}},
    year={2023}
}

If you intend to use the SHAP method, please refer to Lundberg's work: Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In NIPS (2017). arXiv preprint arXiv:1705.07874.2017

Acknowledgments

we are acknowledge the contributions of the open-source community and the developers of the Python libraries used in this study.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
- Python :: 3.9

Release history Release notifications | RSS feed

This version

0.0.24

Jun 24, 2024

0.0.23

Jun 14, 2024

0.0.22

Jun 1, 2024

0.0.21

Jan 24, 2024

0.0.20

Jan 22, 2024

0.0.19

Jan 22, 2024

0.0.18

Jan 22, 2024

0.0.17

Jan 21, 2024

0.0.16

Oct 17, 2023

0.0.15

Oct 2, 2023

0.0.14

Sep 11, 2023

0.0.13

Sep 9, 2023

0.0.12

Jul 14, 2023

0.0.11

Jul 14, 2023

0.0.10

Jul 14, 2023

0.0.9

Jul 14, 2023

0.0.8

Jul 7, 2023

0.0.7

Jun 12, 2023

0.0.6

May 29, 2023

0.0.5

May 15, 2023

0.0.4

May 10, 2023

0.0.3

Apr 29, 2023

0.0.2

Apr 28, 2023

0.0.1

Apr 28, 2023

0.0.0

Apr 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protloc_mex1-0.0.24.tar.gz (29.5 kB view hashes)

Uploaded Jun 24, 2024 Source

Built Distribution

protloc_mex1-0.0.24-py3-none-any.whl (28.3 kB view hashes)

Uploaded Jun 24, 2024 Python 3

Hashes for protloc_mex1-0.0.24.tar.gz

Hashes for protloc_mex1-0.0.24.tar.gz
Algorithm	Hash digest
SHA256	`f1bc776c021de46844c80e51d1c0b76570ba2a39bdc5ae365c18123e30902b9f`
MD5	`60080a48870ec58ac353c1a3afb9ce66`
BLAKE2b-256	`a09160259656ae100e2aed62e8a847ca1414728febc1022602ca4f1d4dd7a288`

Hashes for protloc_mex1-0.0.24-py3-none-any.whl

Hashes for protloc_mex1-0.0.24-py3-none-any.whl
Algorithm	Hash digest
SHA256	`38f3136781c1c80146c50d0df5ee0a1aca53ac16be0afa41eaed87c05219b739`
MD5	`920a1c96a4c9047451678cc5d4e6764f`
BLAKE2b-256	`0066d767431531daeeef455628d55b8f3bc46530452102d8306f04d9833576a3`