Skip to main content

...

Project description

ProtLoc-mex_X

Introduction ProtLoc-mex_X

protloc_mex_X integrates two modules: ESM2_fr and feature_correlation. ESM2_fr is based on the ESM2(Supported by ESM2_650m) model and is capable of extracting feature representations from protein sequences, including 'cls', 'eos', 'mean', 'segment_mean', and 'pho'. On the other hand, the feature_correlation module provides Spearman correlation analysis functionality, enabling users to visualize correlation heatmaps and conduct feature crossover regression analysis. This allows users to explore the relationships between different data features and identify features that are relevant to the target feature.

Installation

This project's core code has been uploaded to the PyPI repository. To get it using a conda virtual environment, follow the steps below:

First, create a new conda environment. For Windows systems, it is recommended to use Conda Prompt for this task. On Linux systems, you can use the Terminal. (You can also modify the environment name as needed, here, we use "myenv" as an example):

conda create -n myenv python=3.10

Then, activate the environment you just created:

conda activate myenvs

Finally, use pip to install 'protloc_mex_X' within this environment:

pip install protloc_mex_X

Dependencies

ProtLoc-mex_X requires Python == 3.9 or 3.10.

Below are the Python packages required by ProtLoc-mex_X, which are automatically installed with it:

dependencies = [
        "numpy >=1.20.3",
        "pandas >=1.4.1",
        "seaborn >=0.11.2",
        "matplotlib >=3.5.1"
]

and other not automatically installed but also required Python packages:

dependencies = [
       "torch ==1.12.1",
       "tqdm ==4.63.0",
       "re ==2.2.1",
       "sklearn ==1.0.2",
       "transformers ==4.26.1"
]

It is advised to obtain these dependent packages from their respective official sources, while carefully considering the implications of version compatibility.

How to use ProtLoc-mex_X

ProtLoc-mex_X includes 2 modules: ESM2_fr and feature_corrlation.

ESM2_fr

ESM2_fr is a pre-trained deep learning model based on the ESM2 model. It is capable of extracting representation features from protein sequences and further optimizing the feature representation through weighted averaging.

It contains one class and three functions. The class is named Esm2LastHiddenFeatureExtractor, which includes the following three methods: get_last_hidden_features_combine(), get_last_hidden_phosphorylation_position_feature(), and get_amino_acid_representation(). The functions present in the code are get_last_hidden_features_single(), NetPhos_classic_txt_DataFrame(), and phospho_feature_sim_cosine_weighted_average().

Function get_last_hidden_features_single()

The get_last_hidden_features_single() function is utilized for extracting different types of representation features from the input protein sequences. It accepts protein sequence data X_input, along with the model tokenizer and model as inputs, and subsequently returns a DataFrame containing the extracted features.(note: Only single-batch inputs are supported.)

Class Esm2LastHiddenFeatureExtractor()

The Esm2LastHiddenFeatureExtractor() class is used for extracting various types of representation features from protein sequences. It accepts amino acid sequence input, invokes the pre-trained ESM2 model, and obtains pre-trained representation vectors ('cls', 'eos', 'mean', 'segment_mean', 'pho').

The get_last_hidden_features_combine() function serves the same purpose as get_last_hidden_features_single(), but it is designed to handle multiple batches of input data. This function takes protein sequence data X_input as input and returns a DataFrame containing the combined features extracted from the multiple batches of protein sequence.

The get_last_hidden_phosphorylation_position_feature() function extracts phosphorylation representation features from the input protein sequences. It takes protein sequence data X_input and returns a DataFrame containing phosphorylation representation features.

The get_amino_acid_representation() function is used to calculate representation features for a specific amino acid at a given position in a protein sequence. The main purpose is to support the characterization of phosphorylation sites.

Function NetPhos_classic_txt_DataFrame()

The NetPhos_classic_txt_DataFrame() function is designed to extract sequence information from the provided text data, which is derived from NetPhos (https://services.healthtech.dtu.dk/services/NetPhos-3.1/), and then it returns the extracted data in the form of a DataFrame.

Function phospho_feature_sim_cosine_weighted_average()

The phospho_feature_sim_cosine_weighted_average() function calculates the weighted average of phosphorylation features for protein sequences and returns the input DataFrame updated with weighted average values, which provide a characterization of the entire amino acid sequence's phosphorylation pattern.

for using ESM2_fr example:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import pandas as pd
from protloc_mex_X.ESM2_fr import Esm2LastHiddenFeatureExtractor, get_last_hidden_features_single, phospho_feature_sim_cosine_weighted_average

tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D")

protein_sequence_df = pd.DataFrame({
    'Entry' : ['protein1','protein2'],
    'Sequence': ['ACDEFGHIKLMNPQRSTVWY', 'ACDEFGHIKLMNPQRSTVWY']
})

feature_extractor = Esm2LastHiddenFeatureExtractor(tokenizer, model,
                                                   compute_cls=True, compute_eos=True, compute_mean=True, compute_segments=True)

human_df_represent = feature_extractor.get_last_hidden_features_combine(protein_sequence_df, sequence_name='Sequence', batch_size= 1)

Example for pho feature representation:

import os
import protloc_mex_X
from protloc_mex_X.ESM2_fr import NetPhos_classic_txt_DataFrame
import random
import re

example_data = os.path.join(protloc_mex_X.__path__[0], "examples", "test1.txt")
#The example_data is generated data from protein sequences analyzed using Netpho.

with open(example_data, "r") as f:
     data = f.read()
# print(data)
pattern = r".*YES"

result_df = NetPhos_classic_txt_DataFrame(pattern, data)
result_df.loc[:,'Entry']=result_df.loc[:,'Sequence']

"""
To generate a corresponding sequence randomly, please note that this is just an example. 
In real scenarios, accurate gene sequences should be used because Netpho only provides 6-bp phosphorylation sites,
which are not complete gene sequences.
Additionally, you need to convert the 'position' column in result_df to an integer type.
"""

# Function to generate random amino acid sequence with a minimum length
def generate_random_sequence(min_length):
    amino_acids = 'ACDEFGHIKLMNPQRSTVWY'  # 20 standard amino acids
    return ''.join(random.choice(amino_acids) for _ in range(min_length))

# Find the max position for each unique Entry
max_positions = result_df['position'].astype(int).groupby(result_df['Entry']).max()

# Generate a sequence for each Entry based on its max position
generated_sequences = {entry: generate_random_sequence(pos) for entry, pos in max_positions.items()}

# Update the 'Sequence' column
def update_sequence(row):
    entry = row['Entry']
    return generated_sequences[entry]

result_df['Sequence'] = result_df.apply(update_sequence, axis=1)


from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import pandas as pd
from protloc_mex_X.ESM2_fr import Esm2LastHiddenFeatureExtractor, phospho_feature_sim_cosine_weighted_average

tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D",output_hidden_states=True)

protein_sequence_df = pd.DataFrame({
    'Entry' : ['seq1','seq2'],
    'Sequence': ['ACDEFGHIKLMNPQRSTVWY', 'ACDEFGHIKLMNPQRSTVWY']
})

feature_extractor = Esm2LastHiddenFeatureExtractor(tokenizer, model,
                                                   compute_cls=False, compute_eos=False, 
                                                   compute_mean=False, compute_segments=False)

Netphos_df_represent = feature_extractor.get_last_hidden_phosphorylation_position_feature(result_df, sequence_name='Sequence', phosphorylation_positions='position', batch_size=2)
human_df_represent = feature_extractor.get_last_hidden_features_combine(protein_sequence_df, sequence_name='Sequence', batch_size= 1)

Netphos_df_represent.set_index('Entry', inplace=True)

# Extract all column names that match the 'ESM2_clsX' format.
cols = [col for col in human_df_represent.columns if re.match(r'ESM2_cls\d+', col)]

# Obtain a sub DataFrame consisting of these columns.
human_df_represent.set_index('Entry', inplace=True)
human_df_represent_cls = human_df_represent[cols]

# Extract all column names that match the 'ESM2_phospho_posX' format.
pho_cols = [col for col in Netphos_df_represent.columns if re.match(r'ESM2_phospho_pos\d+', col)]

# Obtain a sub DataFrame consisting of these columns.
Netphos_df_represent_pho = Netphos_df_represent[pho_cols]

#Set feature dimensions.
dim=1280

#Because the calculation function requires the amino acids of pho to be consistent with those of cls, we are removing 'seq3' from Netphos_df_represent_pho.
Netphos_df_represent_pho = Netphos_df_represent_pho.drop(Netphos_df_represent_pho[Netphos_df_represent_pho.index == 'seq3'].index)

#Return cls and pho_average (which is the result of pho_total).
human_df_represent_cls_pho=phospho_feature_sim_cosine_weighted_average(dim, human_df_represent_cls, Netphos_df_represent_pho)
Due to the ongoing review process of the articles related to the toolkit, not all information can be fully disclosed at the moment. If you require additional details or have specific inquiries about the toolkit, kindly contact the author, Zeyu Luo 1024226968@qq.com , for further information. The author will be able to provide more comprehensive and accurate details about the toolkit's functionalities and features. Thank you for your understanding.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protloc_mex_x-0.0.9.tar.gz (13.9 kB view hashes)

Uploaded Source

Built Distribution

protloc_mex_x-0.0.9-py3-none-any.whl (14.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page