These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

ProtLoc-mex_X

Introduction ProtLoc-mex_X

protloc_mex_X integrates two modules: ESM2_fr and feature_correlation. ESM2_fr is based on the ESM2(Supported by ESM2_650m) model and is capable of extracting feature representations from protein sequences, including 'cls', 'eos', 'mean', 'segment_mean', and 'pho'. On the other hand, the feature_correlation module provides Spearman correlation analysis functionality, enabling users to visualize correlation heatmaps and conduct feature crossover regression analysis. This allows users to explore the relationships between different data features and identify features that are relevant to the target feature.

Installation

This project's core code has been uploaded to the PyPI repository. To get it using a conda virtual environment, follow the steps below:

First, create a new conda environment. For Windows systems, it is recommended to use Conda Prompt for this task. On Linux systems, you can use the Terminal. (You can also modify the environment name as needed, here, we use "myenv" as an example):

conda create -n myenv python=3.10

Then, activate the environment you just created:

conda activate myenvs

Finally, use pip to install 'protloc_mex_X' within this environment:

pip install protloc_mex_X

Dependencies

ProtLoc-mex_X requires Python == 3.9 or 3.10.

Below are the Python packages required by ProtLoc-mex_X, which are automatically installed with it:

dependencies = [
        "numpy >=1.20.3",
        "pandas >=1.4.1",
        "seaborn >=0.11.2",
        "matplotlib >=3.5.1"
]

and other not automatically installed but also required Python packages：

dependencies = [
       "torch ==1.12.1",
       "tqdm ==4.63.0",
       "re ==2.2.1",
       "sklearn ==1.0.2",
       "transformers ==4.26.1"
]

It is advised to obtain these dependent packages from their respective official sources, while carefully considering the implications of version compatibility.

How to use ProtLoc-mex_X

ProtLoc-mex_X includes 2 modules: ESM2_fr and feature_corrlation.

ESM2_fr

ESM2_fr is a pre-trained deep learning model based on the ESM2 model. It is capable of extracting representation features from protein sequences and further optimizing the feature representation through weighted averaging.

It contains one class and three functions. The class is named Esm2LastHiddenFeatureExtractor, which includes the following three methods: get_last_hidden_features_combine(), get_last_hidden_phosphorylation_position_feature(), and get_amino_acid_representation(). The functions present in the code are get_last_hidden_features_single(), NetPhos_classic_txt_DataFrame(), and phospho_feature_sim_cosine_weighted_average().

Function `get_last_hidden_features_single()`：

The get_last_hidden_features_single() function is utilized for extracting different types of representation features from the input protein sequences. It accepts protein sequence data X_input, along with the model tokenizer and model as inputs, and subsequently returns a DataFrame containing the extracted features.(note: Only single-batch inputs are supported.)

Class `Esm2LastHiddenFeatureExtractor()`：

The Esm2LastHiddenFeatureExtractor() class is used for extracting various types of representation features from protein sequences. It accepts amino acid sequence input, invokes the pre-trained ESM2 model, and obtains pre-trained representation vectors ('cls', 'eos', 'mean', 'segment_mean', 'pho').

The get_last_hidden_features_combine() function serves the same purpose as get_last_hidden_features_single(), but it is designed to handle multiple batches of input data. This function takes protein sequence data X_input as input and returns a DataFrame containing the combined features extracted from the multiple batches of protein sequence.

The get_last_hidden_phosphorylation_position_feature() function extracts phosphorylation representation features from the input protein sequences. It takes protein sequence data X_input and returns a DataFrame containing phosphorylation representation features.

The get_amino_acid_representation() function is used to calculate representation features for a specific amino acid at a given position in a protein sequence. The main purpose is to support the characterization of phosphorylation sites.

Function `NetPhos_classic_txt_DataFrame()` ：

The NetPhos_classic_txt_DataFrame() function is designed to extract sequence information from the provided text data, which is derived from NetPhos (https://services.healthtech.dtu.dk/services/NetPhos-3.1/), and then it returns the extracted data in the form of a DataFrame.

Function `phospho_feature_sim_cosine_weighted_average()` ：

The phospho_feature_sim_cosine_weighted_average() function calculates the weighted average of phosphorylation features for protein sequences and returns the input DataFrame updated with weighted average values, which provide a characterization of the entire amino acid sequence's phosphorylation pattern.

for using ESM2_fr example:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import pandas as pd
from protloc_mex_X.ESM2_fr import Esm2LastHiddenFeatureExtractor, get_last_hidden_features_single, phospho_feature_sim_cosine_weighted_average

tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D")

protein_sequence_df = pd.DataFrame({
    'Entry' : ['protein1','protein2'],
    'Sequence': ['ACDEFGHIKLMNPQRSTVWY', 'ACDEFGHIKLMNPQRSTVWY']
})

feature_extractor = Esm2LastHiddenFeatureExtractor(tokenizer, model,
                                                   compute_cls=True, compute_eos=True, compute_mean=True, compute_segments=True)

human_df_represent = feature_extractor.get_last_hidden_features_combine(protein_sequence_df, sequence_name='Sequence', batch_size= 1)

Example for pho feature representation:

import os
import protloc_mex_X
from protloc_mex_X.ESM2_fr import NetPhos_classic_txt_DataFrame
import random
import re

example_data = os.path.join(protloc_mex_X.__path__[0], "examples", "test1.txt")
#The example_data is generated data from protein sequences analyzed using Netpho.

with open(example_data, "r") as f:
     data = f.read()
# print(data)
pattern = r".*YES"

result_df = NetPhos_classic_txt_DataFrame(pattern, data)
result_df.loc[:,'Entry']=result_df.loc[:,'Sequence']

"""
To generate a corresponding sequence randomly, please note that this is just an example. 
In real scenarios, accurate gene sequences should be used because Netpho only provides 6-bp phosphorylation sites,
which are not complete gene sequences.
Additionally, you need to convert the 'position' column in result_df to an integer type.
"""

# Function to generate random amino acid sequence with a minimum length
def generate_random_sequence(min_length):
    amino_acids = 'ACDEFGHIKLMNPQRSTVWY'  # 20 standard amino acids
    return ''.join(random.choice(amino_acids) for _ in range(min_length))

# Find the max position for each unique Entry
max_positions = result_df['position'].astype(int).groupby(result_df['Entry']).max()

# Generate a sequence for each Entry based on its max position
generated_sequences = {entry: generate_random_sequence(pos) for entry, pos in max_positions.items()}

# Update the 'Sequence' column
def update_sequence(row):
    entry = row['Entry']
    return generated_sequences[entry]

result_df['Sequence'] = result_df.apply(update_sequence, axis=1)


from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import pandas as pd
from protloc_mex_X.ESM2_fr import Esm2LastHiddenFeatureExtractor, phospho_feature_sim_cosine_weighted_average

tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D",output_hidden_states=True)

protein_sequence_df = pd.DataFrame({
    'Entry' : ['seq1','seq2'],
    'Sequence': ['ACDEFGHIKLMNPQRSTVWY', 'ACDEFGHIKLMNPQRSTVWY']
})

feature_extractor = Esm2LastHiddenFeatureExtractor(tokenizer, model,
                                                   compute_cls=False, compute_eos=False, 
                                                   compute_mean=False, compute_segments=False)

Netphos_df_represent = feature_extractor.get_last_hidden_phosphorylation_position_feature(result_df, sequence_name='Sequence', phosphorylation_positions='position', batch_size=2)
human_df_represent = feature_extractor.get_last_hidden_features_combine(protein_sequence_df, sequence_name='Sequence', batch_size= 1)

Netphos_df_represent.set_index('Entry', inplace=True)

# Extract all column names that match the 'ESM2_clsX' format.
cols = [col for col in human_df_represent.columns if re.match(r'ESM2_cls\d+', col)]

# Obtain a sub DataFrame consisting of these columns.
human_df_represent.set_index('Entry', inplace=True)
human_df_represent_cls = human_df_represent[cols]

# Extract all column names that match the 'ESM2_phospho_posX' format.
pho_cols = [col for col in Netphos_df_represent.columns if re.match(r'ESM2_phospho_pos\d+', col)]

# Obtain a sub DataFrame consisting of these columns.
Netphos_df_represent_pho = Netphos_df_represent[pho_cols]

#Set feature dimensions.
dim=1280

#Because the calculation function requires the amino acids of pho to be consistent with those of cls, we are removing 'seq3' from Netphos_df_represent_pho.
Netphos_df_represent_pho = Netphos_df_represent_pho.drop(Netphos_df_represent_pho[Netphos_df_represent_pho.index == 'seq3'].index)

#Return cls and pho_average (which is the result of pho_total).
human_df_represent_cls_pho=phospho_feature_sim_cosine_weighted_average(dim, human_df_represent_cls, Netphos_df_represent_pho)

Due to the ongoing review process of the articles related to the toolkit, not all information can be fully disclosed at the moment. If you require additional details or have specific inquiries about the toolkit, kindly contact the author, Zeyu Luo 1024226968@qq.com , for further information. The author will be able to provide more comprehensive and accurate details about the toolkit's functionalities and features. Thank you for your understanding.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.0.25

Jun 19, 2024

0.0.24

Jun 1, 2024

0.0.23

Jan 24, 2024

0.0.22

Jan 24, 2024

0.0.21

Dec 30, 2023

0.0.20

Dec 25, 2023

0.0.19

Dec 22, 2023

0.0.18

Dec 8, 2023

0.0.17

Nov 2, 2023

0.0.16

Nov 1, 2023

0.0.15

Oct 2, 2023

0.0.14

Sep 10, 2023

0.0.13

Sep 9, 2023

0.0.12

Sep 9, 2023

0.0.11

Sep 9, 2023

0.0.10

Sep 9, 2023

This version

0.0.9

Sep 9, 2023

0.0.8

Sep 9, 2023

0.0.7

Sep 9, 2023

0.0.6

Aug 2, 2023

0.0.5

Aug 2, 2023

0.0.4

Jun 2, 2023

0.0.3

May 31, 2023

0.0.2

May 31, 2023

0.0.1

May 29, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protloc_mex_x-0.0.9.tar.gz (13.9 kB view hashes)

Uploaded Sep 9, 2023 Source

Built Distribution

protloc_mex_x-0.0.9-py3-none-any.whl (14.3 kB view hashes)

Uploaded Sep 9, 2023 Python 3

Hashes for protloc_mex_x-0.0.9.tar.gz

Hashes for protloc_mex_x-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`8d48d2268789e3d53a62b701698db5aadaa78c950a2391923d87555e31ff3e51`
MD5	`871b9093345c04ff861d5e798fb2936d`
BLAKE2b-256	`673adbc6fea60d0e376c4a868cdd0f3784bcc284901fc9c86ef4bbf3fc53e893`

Hashes for protloc_mex_x-0.0.9-py3-none-any.whl

Hashes for protloc_mex_x-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07ff47357610faca731124527b1526d689e9a2fb5331fd52368bb409efce3043`
MD5	`f42ca7f3273be729b2b9238be62e223b`
BLAKE2b-256	`d94db07764639916c2d0f2f6c961e19ce26ff18c82e11be062c224d76a660efa`

protloc_mex_X 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ProtLoc-mex_X

Introduction ProtLoc-mex_X

Installation

Dependencies

How to use ProtLoc-mex_X

ESM2_fr

Function `get_last_hidden_features_single()`：

Class `Esm2LastHiddenFeatureExtractor()`：

Function `NetPhos_classic_txt_DataFrame()` ：

Function `phospho_feature_sim_cosine_weighted_average()` ：

for using ESM2_fr example:

Example for pho feature representation:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

protloc_mex_X 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ProtLoc-mex_X

Introduction ProtLoc-mex_X

Installation

Dependencies

How to use ProtLoc-mex_X

ESM2_fr

Function get_last_hidden_features_single()：

Class Esm2LastHiddenFeatureExtractor()：

Function NetPhos_classic_txt_DataFrame() ：

Function phospho_feature_sim_cosine_weighted_average() ：

for using ESM2_fr example:

Example for pho feature representation:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Function `get_last_hidden_features_single()`：

Class `Esm2LastHiddenFeatureExtractor()`：

Function `NetPhos_classic_txt_DataFrame()` ：

Function `phospho_feature_sim_cosine_weighted_average()` ：