Tools for pre-processing and evaluating data associated with biocatalysis models.

These details have not been verified by PyPI

Project links

Homepage

Project description

:leaves: Biocatalysis Model

Biocatalysed Synthesis Planning using Data-driven Learning

Abstract
Data
- ECREACT
- Data Sources
Using the Pre-trained Model
Training your own Model

Abstract

Enzyme catalysts are an integral part of green chemistry strategies towards a more sustainable and resource-efficient chemical synthesis. However, the use of biocatalysed reactions in retrosynthetic planning clashes with the difficulties in predicting the enzymatic activity on unreported substrates and enzyme-specific stereo- and regioselectivity. As of now, only rule-based systems support retrosynthetic planning using biocatalysis, while initial data-driven approaches are limited to forward predictions. Here, we extend the data-driven forward reaction as well as retrosynthetic pathway prediction models based on the Molecular Transformer architecture to biocatalysis. The enzymatic knowledge is learned from an extensive data set of publicly available biochemical reactions with the aid of a new class token scheme based on the enzyme commission classification number, which captures catalysis patterns among different enzymes belonging to the same hierarchy. The forward reaction prediction model (top-1 accuracy of 49.6%), the retrosynthetic pathway (top-1 single-step round-trip accuracy of 39.6%) and the curated data set are made publicly available to facilitate the adoption of enzymatic catalysis in the design of greener chemistry processes.

Data

Enzymatic reactions and the accompanying EC numbers were extracted from four databases, namely Rhea, BRENDA, PathBank, and MetaNetX and merged into a new data set, named ECREACT, containing enzyme-catalysed reactions with the respective EC number.

ECREACT

The ECREACT data set contains samples of all 7 EC classes (1: Oxidoreductases, 2: Transferases, 3: Hydrolases, 4: Lyases, 5: Isomerases, 6: Ligases, 7: Translocases) distributed as shown in (a). The distributions of the substrates and products are shown in the TMAPS (b) and (c) respectively.

The data set is available as a .csv file in the following format:

rxn_smiles,ec,source
CC=O.O.O=O|1.2.3.1>>CC(=O)[O-],1.2.3.1,brenda_reaction_smiles
CC(N)=O.O|3.5.1.4>>CC(=O)O,3.5.1.4,brenda_reaction_smiles
...

The field rxn_smiles is a reaction SMILES extended with the EC number on the reactant side. The reactants and the EC number are separated by a pipe |. The field ec is the EC number. Be aware that they can also contain entries such as 1.20.1.M1 or 1.14.-.-. The field source describes the source database of the reaction information.

Download ECREACT 1.0

Data Sources

ECREACT is composed of data from four publicly accessible databases:

Rhea, an expert-curated knowledgebase of chemical and transport reactions of biological interest - and the standard for enzyme and transporter annotation in UniProtKB. :file_cabinet:
BRENDA, an information system representing one of the most comprehensive enzyme repositories. :file_cabinet:
PathBank, an interactive, visual database containing more than 100 000 machine-readable pathways found in model organisms such as humans, mice, E. coli, yeast, and Arabidopsis thaliana. :file_cabinet:
MetaNetX, an online platform for accessing, analyzing and manipulating genome-scale metabolic networks (GSM) as well as biochemical pathways. :file_cabinet:

The contributions by EC class from each of the data sources is shown in the plot below.

Figure Sources

Using the Pre-trained Model

We provide a model for retrosynthetic pathway prediction pre-trained on ECREACT as part of the IBM RXN for Chemistry platform. This model can also be used through the Python wrapper for the IBM RXN for Chemistry API. You can get a free API key here.

api_key = 'API_KEY'
from rxn4chemistry import RXN4ChemistryWrapper

rxn4chemistry_wrapper = RXN4ChemistryWrapper(api_key=api_key)

# NOTE: you can create a project or set an esiting one using:
# rxn4chemistry_wrapper.set_project('PROJECT_ID')
rxn4chemistry_wrapper.create_project('test_wrapper')
print(rxn4chemistry_wrapper.project_id)

response = rxn4chemistry_wrapper.predict_automatic_retrosynthesis(
    'OC1C(O)C=C(Br)C=C1', ai_model='enzymatic-2021-04-16'
)
results = rxn4chemistry_wrapper.get_predict_automatic_retrosynthesis_results(
    response['prediction_id']
)

print(results['status'])

# NOTE: upon 'SUCCESS' you can inspect the predicted retrosynthetic paths.
print(results['retrosynthetic_paths'][0])

Training your own Model

Setup Environment

As not all dependencies are available from PyPI for all platforms, we suggest you create a conda environment from the supplied conda.yml:

conda env create -f conda.yml
conda activate rxn-biocatalysis-tools

Alternatively, :leaves: RXN Biocatalysis Tools are available as a PyPI package and can be installed usining pip; however, not all dependencies will be installed depending on your platform.

pip install rxn_biocatalysis_tools

Data Pre-processing

The :leaves: RXN Biocatalysis Tools Python package installs a script that can be used to preprocess reaction data. Reaction data can be combined, filtered, or augmented as explained in the usage documentation below. After these initial steps, the data is tokenized and split into training, validation, and testing src (reactants + EC) and tgt (product/s) files. The output data structure generated by the script is the following, depending on the options set.

.
└── experiments
    ├── 1
    │   ├── combined.txt
    │   ├── src-test.txt
    │   ├── src-train.txt
    │   ├── src-valid.txt
    │   ├── tgt-test.txt
    │   ├── tgt-train.txt
    │   └── tgt-valid.txt
    ├── 2
    │   ├── combined.txt
    │   ├── src-test.txt
    │   ├── src-train.txt
    │   ├── src-valid.txt
    │   ├── tgt-test.txt
    │   ├── tgt-train.txt
    │   └── tgt-valid.txt
    └── 3
        ├── combined.txt
        ├── src-test.txt
        ├── src-train.txt
        ├── src-valid.txt
        ├── tgt-test.txt
        ├── tgt-train.txt
        └── tgt-valid.txt

Usage of the script:

rbt-preprocess INPUT_FILE ... OUTPUT_DIRECTORY [--remove-patterns=FILE_PATH] [--remove-molecules=FILE_PATH] [--ec-level=LEVEL; default=3] [-max-products=MAX_PRODUCTS; default=1] [--min-atom-count=MIN_ATOM_COUNT; default=4] [--bi-directional] [--split-products]

Argument / Option	Example	Description	Default
INPUT_FILE ...	file1.csv file2.csv	File(s) containing enzymatic reaction SMILES¹
OUTPUT_DIRECTORY	/output/directory/	The directory to which output files will be written
--remove-patterns	patterns.txt	SMARTS patterns for molecules to be removed²
--remove-molecules	molecules.txt	Molecule SMILES to be removed³
--ec-level	--ec-level 1 --ec-level 2	The number of EC levels to be exported, can be repreated	3
--max-products	--max-products 1	The max number of products (rxns with more are dropped)	1
--min-atom-count	--min-atom-count 4	The min atom count (smaller molecules are removed)	4
--bi-directional	--bi-directional	Whether to create the inverse of every reaction⁴
--split-products	--split-products	Whether to split reactions with multiple prodcuts⁵

¹Example of an enzymatic reaction SMILES: CC(N)=O.O|3.5.1.4>>CC(=O)O
²See patterns.txt for an example
³See molecules.txt for an example
⁴Example: For the reaction CC(N)=O.O|3.5.1.4>>CC(=O)O, the reaction CC(=O)O|3.5.1.4>>CC(N)=O.O will be added
⁵Example: The reaction A|3.5.1.4>>B.C is split into reactions A|3.5.1.4>>B and A|3.5.1.4>>C

Training using OpenNMT-py

The first step in the OpenNMT is to run onmt_preprocess for both the forward and backward models. In the examples below, the data with 3 EC-levels is used. You will probably have to adapt the paths, depending on your directory structure and platform.

The pre-processed USPTO files can be found here.

DATASET=data/uspto_dataset
DATASET_TRANSFER=experiments/3

# forward
onmt_preprocess -train_src "${DATASET}/src-train.txt" "${DATASET_TRANSFER}/src-train.txt" \
    -train_tgt "${DATASET}/tgt-train.txt" "${DATASET_TRANSFER}/tgt-train.txt" -train_ids uspto transfer \
    -valid_src "${DATASET}/src-valid.txt" -valid_tgt "${DATASET_TRANSFER}/tgt-valid.txt" \
    -save_data "preprocessing/multitask_forward" \
    -src_seq_length 3000 -tgt_seq_length 3000 \
    -src_vocab_size 3000 -tgt_vocab_size 3000 \
    -share_vocab

# backward
onmt_preprocess -train_src "${DATASET}/tgt-train.txt" "${DATASET_TRANSFER}/tgt-train.txt" \
    -train_tgt "${DATASET}/src-train.txt" "${DATASET_TRANSFER}/src-train.txt" -train_ids uspto transfer \
    -valid_src "${DATASET}/tgt-valid.txt" -valid_tgt "${DATASET_TRANSFER}/src-valid.txt" \
    -save_data "preprocessing/multitask_backward" \
    -src_seq_length 3000 -tgt_seq_length 3000 \
    -src_vocab_size 3000 -tgt_vocab_size 3000 \
    -share_vocab

Once the OpenNMT pre-preprocessing has finished, the actual training can be started:

# if forward
DATASET="preprocessing/multitask_forward"
OUTDIR="/model/multitask_forward"
LOGDIR="/logs/forward"
# end if

# if backward
DATASET="preprocessing/multitask_backward"
OUTDIR="model/multitask_backward"
LOGDIR="logs/backward"
# end if

W1=9
W2=1

onmt_train -data ${DATASET} \
    -save_model ${OUTDIR} \
    -data_ids uspto transfer --data_weights ${W1} ${W2} \
    -seed 42 -gpu_ranks 0 \
    -train_steps 250000 -param_init 0 \
    -param_init_glorot -max_generator_batches 32 \
    -batch_size 6144 -batch_type tokens \
    -normalization tokens -max_grad_norm 0  -accum_count 4 \
    -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam \
    -warmup_steps 8000 -learning_rate 2 -label_smoothing 0.0 \
    -layers 4 -rnn_size  384 -word_vec_size 384 \
    -encoder_type transformer -decoder_type transformer \
    -dropout 0.1 -position_encoding -share_embeddings  \
    -global_attention general -global_attention_function softmax \
    -self_attn_type scaled-dot -heads 8 -transformer_ff 2048 \
    --tensorboard --tensorboard_log_dir ${LOGDIR}

Evaluation

The test set is evaluated using onmt_translate. Three new files are generated:

tgt-pred.txt (forward prediction)
src-pred.txt (backward prediction)
tgt-pred-rtrp.txt (roundtriip prediction, a backward prediction followed by a forward prediction)

Before the roundtrip prediction, the SMILES in src-pred.txt should be standardized using the script rbt-preprocess. The script does not edit the file in place, so good practise is to rename src-pred.txt to src-pred-noncanon.txt and then run:

rbt-canonicalize src-pred-noncanon.txt src-pred.txt

Example evaluation scripts:

# forward prediction

DATASET_TRANSFER="experiments/3"

# Get the newest file from the model directory
MODEL=$(ls model/multitask_forward*.pt -t | head -1)

onmt_translate -model "${MODEL}" \
    -src "${DATASET_TRANSFER}/src-test.txt" \
    -output "${DATASET_TRANSFER}/tgt-pred.txt" \
    -n_best 10 -beam_size 10 -max_length 300 -batch_size 64 \
    -gpu 0

# backward prediction

DATASET_TRANSFER="experiments/3"

# Get the newest file from the model directory
MODEL=$(ls model/multitask_backward*.pt -t | head -1)

onmt_translate -model "${MODEL}" \
    -src "${DATASET_TRANSFER}/tgt-test.txt" \
    -output "${DATASET_TRANSFER}/src-pred.txt" \
    -n_best 10 -beam_size 10 -max_length 300 -batch_size 64 \
    -gpu 0

# roundtrip prediction

DATASET_TRANSFER="experiments/3"

# Get the newest file from the model directory
MODEL=$(ls model/multitask_forward*.pt -t | head -1)

onmt_translate -model "${MODEL}" \
    -src "${DATASET_TRANSFER}/src-pred.txt" \
    -output "${DATASET_TRANSFER}/tgt-pred-rtrp.txt" \
    -n_best 1 -beam_size 5 -max_length 300 -batch_size 64 \
    -gpu 0

The :leaves: RXN Biocatalysis Tools PyPI package contains an evaluation script that calculates accuracies from the files produced above. For these examples, the INPUT_FOLDER is the same as DATASET_TRANSFER, --n-best-fw, --top-n-fw, --top-n-bw, --top-n-rtr are all set to 10.

rbt-evaluate INPUT_FOLDER --name=NAME [--n-best-fw=N_BEST_FW; default=5] [--n-best-bw=N_BEST_BW; default=10] [--n-best-rtr=N_BEST_RTR; default=1] [--top-n-fw=TOP_N_FW; default=1] [--top-n-fw=TOP_N_FW; default=1] [--top-n-bw=TOP_N_BW; default=1] [--top-n-rtr=TOP_N_RTR; default=1] [--top-n-range] [--isomeric-smiles | --no-isomeric-smiles; default:--isomeric-smiles]

Argument / Option	Example	Description	Default
INPUT_FOLDER	/input/directory/	The folder containing the src and tgt files
--name	experiment-3	The name of the output evaluation csv file
--n-best-fw	--n-best-fw 10	The number of calculated tgt predictions per src	5
--n-best-bw	--n-best-bw 10	The number of calculated src predictions per tgt	10
--n-best-rtr	--n-best-rtr 1	The number of calculated (roundtrip) tgt predictions per predicted src	1
--top-n-fw	--top-n-fw 10	The number of forward predictions to consider in the evaluation	1
--top-n-bw	--top-n-bw 10	The number of backward predictions to consider in the evaluation	1
--top-n-rtr	--top-n-rtr 10	The number of roundtrip predictions to consider in the evaluation	1
--top-n-range	--top-n-range	Whether to consider the forward, backward, and roundtrip predictions up to the respective top-n numbers in the evaluation
--isomeric-smiles	--isomeric-smiles	Do not ignore stereochemistry during the evaluation
--no-isomeric-smiles	--no-isomeric-smiles	Ignore stereochemistry during the evaluation

The evaluation will produce multiple new files in INPUT_FOLDER: A .csv file with the fields metric, type, top, ec, and value; and multiple files containing correct and incorrect forward, backward, and roundtrip predictions for each top-n. In addition, the accuracy of only predicting the EC number is calculated and the respective files written out as well.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.1

Feb 17, 2022

1.0.0

Feb 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rxn-biocatalysis-tools-1.0.1.tar.gz (26.0 kB view details)

Uploaded Feb 17, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rxn_biocatalysis_tools-1.0.1-py3-none-any.whl (25.1 kB view details)

Uploaded Feb 17, 2022 Python 3

File details

Details for the file rxn-biocatalysis-tools-1.0.1.tar.gz.

File metadata

Download URL: rxn-biocatalysis-tools-1.0.1.tar.gz
Upload date: Feb 17, 2022
Size: 26.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.13

File hashes

Hashes for rxn-biocatalysis-tools-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`58f52291a5c611f1b76f921b69bb8d979151f443e765ae7c51e9374b3d026769`
MD5	`052d26cc4b471b5e2e070d4ed1b06fe9`
BLAKE2b-256	`b598f901748a8ed352db0dd082c94b34e57bbea439c93f246571a586c3ca5e93`

See more details on using hashes here.

File details

Details for the file rxn_biocatalysis_tools-1.0.1-py3-none-any.whl.

File metadata

Download URL: rxn_biocatalysis_tools-1.0.1-py3-none-any.whl
Upload date: Feb 17, 2022
Size: 25.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.13

File hashes

Hashes for rxn_biocatalysis_tools-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77ef0cffc8579bfc434d43895a4fc3184da076df9e158f33bed2b409c5b8f2e8`
MD5	`f482450586261971927c51e6fae245c6`
BLAKE2b-256	`58915dd01edbae7008a37b7ae384ce4568282ee36e101b63974442a75fcc3eb9`

See more details on using hashes here.

rxn-biocatalysis-tools 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

:leaves: Biocatalysis Model

Table of Contents

Abstract

Data

ECREACT

Data Sources

Using the Pre-trained Model

Training your own Model

Setup Environment

Data Pre-processing

Training using OpenNMT-py

Evaluation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes