ec_number_prediction

These details have not been verified by PyPI

Project description

EC number prediction pipeline

Description

Enzyme Commission (EC) numbers serve as a hierarchical system that categorizes and organizes enzyme activities. Within the realm of genome annotation, EC numbers are attributed to protein sequences to concisely represent specific chemical reaction patterns. These patterns mirror the chemical transformations enabled by the enzymes associated with their respective EC numbers. The structure of EC numbers is divided into four levels: (1) the primary category (e.g., 1 for oxidoreductases, 2 for transferases, 3 for hydrolases, 4 for lyases, 5 for isomerases, 6 for ligases, and 7 for translocases), (2) the subclass (for instance, 1.2: Targets the aldehyde or oxo group of donors), (3) the sub-subclass (such as With NAD(+) or NADP(+) as acceptor), and (4) the final level, which identifies the enzyme's substrate (for example, 1.2.1.3: aldehyde dehydrogenase (NAD(+))).

Requirements

Python >= 3.9
BLAST >= 2.12.0

Install with conda

conda create -n ec_numbers_prediction python=3.9
conda activate ec_numbers_prediction
conda install bioconda::blast==2.12.0

Installation

Pip

pip install ec_numbers_prediction

From github

pip install git+https://github.com/jcapels/ec_numbers_prediction.git

Run pipeline to obtain the data

Here is how you can obtain the data from the pipeline using luigi. This pipeline will download the data from the UniProt database, and will process it to obtain the data used in the project. This will generate several files in the working directory, some are intermediate files, other are the final division into training, validation and test sets. The final files are the ones that will be used in the next steps of the pipeline.

The steps of this pipeline are the following:

Download of UniProt data.
Scraping of UniProt enzymes.
Filter UniProt enzymes with UniRef90 cluster representatives.
Enrich underrepresented EC classes with TrEMBL data.
Generate a multi-label binary matrix for the resultant dataset.
Underrepresented classes removal.
Temporary EC numbers removal (e.g. EC 3.5.1.n3)
Split the data with a division of 60/20/20.

As for the intermediate files, those are the following:

uniprot_sprot.xml.gz - SwissProt raw data.
uniprot_trembl.xml.gz - TrEMBL raw data.
uniref90.xml.gz - UniRef90 raw data.
trembl_prot_ec.csv - TrEMBL enzymes with the respective assigned EC number.
swiss_prot_ec.csv - SwissProt enzymes with the respective assigned EC number.
trembl_prot_ec_filtered.csv - TrEMBL enzymes filtered with UniRef90 cluster representatives.
swiss_prot_ec_filtered.csv - SwissProt enzymes filtered with UniRef90 cluster representatives.
dataset_enriched.csv - Dataset with SwissProt enzymes enriched with TrEMBL.
dataset_binarized.csv - Dataset with a multi-label binary matrix.
dataset_binarized_filtered.csv - Dataset with underrepresented classes removed.
dataset_binarized_filtered_without_n.csv - Dataset with underrepresented classes and temporary EC numbers removed.
train.csv - Training data.
test.csv - Test data.
validation.csv - Validation data.

from ec_number_prediction.run_data_processing_pipeline import EnzymesPipeline
import luigi

luigi.build([EnzymesPipeline()], workers=1, scheduler_host = '127.0.0.1',
        scheduler_port = 8083, local_scheduler = True)

After this, we shuffled all the datasets for being ready for training and evaluation. Here's an example:

import pandas as pd

training_data = pd.read_csv("train.csv")
training_data = training_data.sample(frac = 1)
training_data.to_csv("train_shuffled.csv", index=False)

Extract features

The following functions will take the data you have in your working directory (please pass its path as input to the following functions) and generate features using ESM, ProtBERT and one-hot encoding.

from ec_number_prediction.feature_extraction.generate_features import generate_esm_vectors, generate_prot_bert_vectors, \
    generate_one_hot_encodings

generate_esm_vectors(esm_function="esm2_t33_650M_UR50D", 
                     save_folder="/home/working_dir", dataset_directory="/home/working_dir/data")

generate_prot_bert_vectors(save_folder="/home/working_dir", dataset_directory="/home/working_dir/data")

generate_one_hot_encodings(save_folder="/home/working_dir", dataset_directory="/home/working_dir/data")

Train models

Training the baselines is also easy:

Train baselines

from ec_number_prediction.train_models.train_baselines import train_dnn_baselines

train_dnn_baselines(model = "esm2_t33_650M_UR50D", working_dir="/home/working_dir/")

Train models

Train models with the specific sets, DeepEC and DSPACE. Note that the set chosen in the following examples is set 1, but you can choose any of the sets 1, 2, 3, 4. Also, note that the model chosen in the following examples is esm2_t33_650M_UR50D, but you can choose any of the models:

esm2_t33_650M_UR50D
esm1b_t33_650M_UR50S
esm2_t30_150M_UR50D
esm2_t12_35M_UR50D
esm2_t6_8M_UR50D
esm2_t36_3B_UR50D
prot_bert_vectors
esm2_t48_15B_UR50D (it requires GPUs with more than 25GB of memory)

from ec_number_prediction.train_models.optimize_dnn import train_dnn_optimization
from ec_number_prediction.train_models.train_deep_ec import train_deep_ec
from ec_number_prediction.train_models.train_dspace import train_dspace 

train_dnn_optimization(set_="1", model = "esm2_t33_650M_UR50D", working_dir="/home/working_dir/")
train_deep_ec(working_dir="/home/working_dir/")
train_dspace(working_dir="/home/working_dir/")

Train models with both training and validation sets

Train models with the training and validation sets merged.

from ec_number_prediction.train_models.optimize_dnn import train_dnn_trials_merged
from ec_number_prediction.train_models.train_deep_ec import train_deep_ec_merged
from ec_number_prediction.train_models.train_dspace import train_dspace_merged 

train_dnn_trials_merged(set_="1", model = "esm2_t33_650M_UR50D", working_dir="/home/working_dir/")
train_deep_ec_merged(working_dir="/home/working_dir/")
train_dspace_merged(working_dir="/home/working_dir/")

Train models with the whole data

Train models with the whole data.

from ec_number_prediction.train_models.optimize_dnn import train_dnn_optimization_all_data

train_dnn_optimization_all_data(set_ = "1", model = "esm2_t33_650M_UR50D", working_dir="/home/working_dir/")

Predict EC numbers

Predict with model

Here you can see how to predict EC numbers with a model. Note that the model chosen in the following examples is "DNN ProtBERT all data", but you can choose any of the models:

DNN ProtBERT all data
DNN ESM1b all data
DNN ESM2 3B all data - note that this model requires at least 12 GB of RAM to be run. If you intend to use GPU to make the predictions, you need to have at least 20 GB of GPU memory or 4 GPUs with 8 GB.
ProtBERT trial 2 train plus validation (for this model, you need to pass all_data=False)
DNN ESM1b trial 4 train plus validation (for this model, you need to pass all_data=False)
DNN ESM2 3B trial 2 train plus validation (for this model, you need to pass all_data=False)

Here you can see the time taken and memory usage for each model to predict for different number of data points:

Model	Data Points	Time Taken	Memory Usage
DNN ProtBERT	25	0:00:05	1G
DNN ProtBERT	100	0:00:08	1G
DNN ProtBERT	1000	0:00:56	1G
DNN ProtBERT	10000	0:09:00	1G
DNN ProtBERT	100000	1:55:08	7G
DNN ESM1b	25	0:00:28	2G
DNN ESM1b	100	0:00:40	2G
DNN ESM1b	1000	0:02:22	2G
DNN ESM1b	10000	0:19:22	2G
DNN ESM1b	100000	3:35:04	7G
DNN ESM2 3B	25	0:01:35	10G
DNN ESM2 3B	100	0:03:40	10G
DNN ESM2 3B	1000	0:28:27	10G

The parameters of the function are the following:

pipeline: name of the model to use.
dataset_path: path to the dataset to predict.
output_path: path to the output file.
ids_field: name of the column with the ids.
all_data: whether to use all the data or not.
sequences_field: name of the column with the sequences.
device: device to use for the predictions.

from ec_number_prediction.predictions import predict_with_model


predict_with_model(pipeline="DNN ProtBERT all data",
                    dataset_path="/home/jcapela/ec_numbers_prediction/data/test_data.csv",
                    output_path="predictions_prot_bert.csv",
                    ids_field="id",
                    all_data=True,
                    sequences_field="sequence",
                    device="cuda:1")

You can also make predictions using a FASTA file:

from ec_number_prediction.predictions import predict_with_model_from_fasta

predict_with_model_from_fasta(pipeline="DNN ProtBERT all data",
                    fasta_path="/home/jcapela/ec_numbers_prediction/data/test_data.fasta",
                    output_path="predictions_prot_bert.csv",
                    all_data=True,
                    device="cuda:1")

Predict with BLAST

Here you can see how to predict EC numbers with BLAST. Note that the database chosen in the following examples is "BLAST all data", but you can choose any of the databases:

BLAST all data
BLAST train plus validation

The parameters of the function are the following:

database_name: name of the database to use.
dataset_path: path to the dataset to predict.
output_path: path to the output file.
ids_field: name of the column with the ids.
sequences_field: name of the column with the sequences.

from ec_number_prediction.predictions import predict_with_blast

predict_with_blast(database_name="BLAST all data",
                            dataset_path="/home/jcapela/ec_numbers_prediction/data/test_data.csv",
                            output_path="test_blast_predictions.csv",
                            ids_field="id",
                            sequences_field="sequence")

You can also make predictions using a FASTA file:

from ec_number_prediction.predictions import predict_with_blast_from_fasta

predict_with_blast_from_fasta(database_name="BLAST all data",
                            fasta_path="/home/jcapela/ec_numbers_prediction/data/test_data.fasta",
                            output_path="test_blast_predictions.csv")

Predict with an ensemble of BLAST and DL models

Here you can see how to predict EC numbers with an ensemble between BLAST and models.

The parameters of the function are the following:

dataset_path: path to the dataset to predict.
output_path: path to the output file.
ids_field: name of the column with the ids.
sequences_field: name of the column with the sequences.
device: device to use for the predictions.

from ec_number_prediction.predictions import predict_with_ensemble

predict_with_ensemble(dataset_path="/home/jcapela/ec_numbers_prediction/data/test_data.csv",
                        output_path="predictions_ensemble.csv",
                        ids_field="id",
                        sequences_field="sequence",
                        device="cuda:3")

You can also make predictions using a FASTA file:

from ec_number_prediction.predictions import predict_with_ensemble_from_fasta

predict_with_ensemble_from_fasta(fasta_path="/home/jcapela/ec_numbers_prediction/data/test_data.fasta",
                        output_path="predictions_ensemble.csv",
                        device="cuda:3")

Post analysis - generate results and plots

Here you can see how to perform the post analysis of the predictions.

We made use of the notebooks present in the notebooks folder.

Here is an explanation of each notebook:

0.1-make_predictions_for_all_models.ipynb: this notebook contains the code to generate the predictions for the test set of all models and save them in pickle files.
0.2-test_ensembles/0.2-test_ensembles.ipynb: this notebook contains the code to generate the predictions and evaluate the performance results of the ensemble of BLAST and DL models.
1-general_results.ipynb: this notebook contains the code to generate the general results the plots for the performance results of the models.
2-identity_intervals/: this notebook contains the code to generate the plots for the performance results of the models for each identity interval.
3-test_hierarchical_prediction/3-test_consistency_predictions.ipynb: this notebook contains the code to generate the plots for the Hierarchical Consistency Error (HCE) of the predictions of the models.
4-test_models_for_other_datasets/: this folder contains the code to generate predictions and the plots for the performance results of the models for evidence-level and promiscuous enzymes datasets.
- 4-test_models_for_other_datasets/4.1-test_evidence_based.ipynb: this notebook contains the code to generate predictions and the plots for the performance results of the models for the evidence-level dataset.
- 4-test_models_for_other_datasets/4.2-test_promiscuous.ipynb: this notebook contains the code to generate predictions and the plots for the performance results of the models for the promiscuous dataset.
5-differences_blast_and_dl_models/: this folder contains the code to generate the plots for the differences between BLAST and DL models.
- 5-differences_blast_and_dl_models/5.1-differences_blast_and_dl_models.ipynb: this notebook contains the code to generate the predictions and evaluate the performance results of the models for the whole dataset.
- 5-differences_blast_and_dl_models/5.2-analysis_on_specific_ec_numbers.ipynb: this notebook contains the code to generate the predictions and evaluate the performance results of the models for specific EC numbers.
6-analysis_for_benchmarks/: this notebook contains the code to generate the plots for the performance results of the models for the halogenases and Price et al datasets.
- 6.1-analysis_for_benchmarks/6.1-test_for_halogenases.ipynb: this notebook contains the code to generate the predictions and evaluate the performance results of the models for the halogenases dataset.
- 6.2-analysis_for_benchmarks/6.2-test_for_price_et_al.ipynb: this notebook contains the code to generate the predictions and evaluate the performance results of the models for the Price et al dataset.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.5

May 29, 2024

0.0.4

Mar 7, 2024

0.0.3

Feb 29, 2024

0.0.2

Feb 1, 2024

This version

0.0.1

Jan 31, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ec_number_prediction-0.0.1.tar.gz (52.2 kB view details)

Uploaded Jan 31, 2024 Source

Built Distribution

ec_number_prediction-0.0.1-py3-none-any.whl (53.9 kB view details)

Uploaded Jan 31, 2024 Python 3

File details

Details for the file ec_number_prediction-0.0.1.tar.gz.

File metadata

Download URL: ec_number_prediction-0.0.1.tar.gz
Upload date: Jan 31, 2024
Size: 52.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for ec_number_prediction-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`67e9dfbee7be2b2c8592d1f40c7942588698736f3377ba03f0114acc6fbf9cb9`
MD5	`54ae1fc3c0fd925cc2a48e06b4da1c9b`
BLAKE2b-256	`b04e6e86b45d069436e6d745f65f31592252a887466190cfaf56f7d7d0fb598e`

See more details on using hashes here.

File details

Details for the file ec_number_prediction-0.0.1-py3-none-any.whl.

File metadata

Download URL: ec_number_prediction-0.0.1-py3-none-any.whl
Upload date: Jan 31, 2024
Size: 53.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for ec_number_prediction-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6d3ee517abec972ce4bba202ea9c92e7b48db2c87e2dc7cd077a9cc16c43f933`
MD5	`4579095158768f9d9c5fd0e2aa16f857`
BLAKE2b-256	`ec13c4eed458a206a458f296b5c75d839d6bf18accaf3d386f45ef4fc31a3b30`