Reaction preprocessing tools
Project description
RXN reaction preprocessing
This repository is devoted to preprocessing chemical reactions: standardization, filtering, etc. It also includes code for stable train/test/validation splits and data augmentation.
The documentation can be found here.
System Requirements
This package is supported on all operating systems. It has been tested on the following systems:
-
macOS: Big Sur (11.1)
-
Linux: Ubuntu 18.04.4
A Python version of 3.6 or greater is recommended.
Installation guide
The package can be installed from Pypi:
pip install rxn-reaction-preprocessing
The RDKit
dependency is not installed automatically and can be installed via Conda or Pypi:
# Install RDKit from Conda
conda install -c conda-forge rdkit
# Install RDKit from Pypi
pip install rdkit
# for Python<3.7
# pip install rdkit-pypi
For local development, the package can be installed with:
pip install -e ".[dev]"
Usage
The following command line scripts are installed with the package.
rxn-data-pipeline
Wrapper for all other scripts. Allows constructing flexible data pipelines. Entrypoint for Hydra structured configuration.
For an overview of all available configuration parameters and default values, run: rxn-data-pipeline --cfg job
.
Configuration using YAML (see the file config.py
for more options and their meaning):
defaults:
- base_config
data:
path: /tmp/inference/input.csv
proc_dir: /tmp/rxn-preproc/exp
common:
sequence:
# Define which steps and in which order to execute:
- IMPORT
- STANDARDIZE
- PREPROCESS
- SPLIT
- TOKENIZE
fragment_bond: TILDE
preprocess:
min_products: 0
split:
split_ratio: 0.05
tokenize:
input_output_pairs:
- inp: ${data.proc_dir}/${data.name}.processed.train.csv
out: ${data.proc_dir}/${data.name}.processed.train
- inp: ${data.proc_dir}/${data.name}.processed.validation.csv
out: ${data.proc_dir}/${data.name}.processed.validation
- inp: ${data.proc_dir}/${data.name}.processed.test.csv
out: ${data.proc_dir}/${data.name}.processed.test
rxn-data-pipeline --config-dir . --config-name example_config
Configuration using command line arguments (example):
rxn-data-pipeline \
data.path=/path/to/data/rxns-small.csv \
data.proc_dir=/path/to/proc/dir \
common.fragment_bond=TILDE \
rxn_import.data_format=TXT \
tokenize.input_output_pairs.0.out=train.txt \
tokenize.input_output_pairs.1.out=validation.txt \
tokenize.input_output_pairs.2.out=test.txt
Note about reading CSV files
Pandas appears not to always be able to write a CSV and re-read it if it contains Windows carriage returns.
In order for the scripts to work despite this, all the pd.read_csv
function calls should include the argument lineterminator='\n'
.
Examples
A pipeline supporting augmentation
A config supporting augmentation of the training split called train-augmentation-config.yaml
:
defaults:
- base_config
data:
name: pipeline-with-augmentation
path: /tmp/file-with-reactions.txt
proc_dir: /tmp/rxn-preprocessing/experiment
common:
sequence:
# Define which steps and in which order to execute:
- IMPORT
- STANDARDIZE
- PREPROCESS
- SPLIT
- AUGMENT
- TOKENIZE
fragment_bond: TILDE
rxn_import:
data_format: TXT
preprocess:
min_products: 1
split:
input_file_path: ${preprocess.output_file_path}
split_ratio: 0.05
augment:
input_file_path: ${data.proc_dir}/${data.name}.processed.train.csv
output_file_path: ${data.proc_dir}/${data.name}.augmented.train.csv
permutations: 10
tokenize: false
random_type: rotated
tokenize:
input_output_pairs:
- inp: ${data.proc_dir}/${data.name}.augmented.train.csv
out: ${data.proc_dir}/${data.name}.augmented.train
reaction_column_name: rxn_rotated
- inp: ${data.proc_dir}/${data.name}.processed.validation.csv
out: ${data.proc_dir}/${data.name}.processed.validation
- inp: ${data.proc_dir}/${data.name}.processed.test.csv
out: ${data.proc_dir}/${data.name}.processed.test
rxn-data-pipeline --config-dir . --config-name train-augmentation-config
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for rxn-reaction-preprocessing-2.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e5a370d5f878fc9875da71c791a3a5eb0c125beb85c75d93197c01b496e2e68 |
|
MD5 | 2da3f3cfd03d0f1c6c03a4b07a249eec |
|
BLAKE2b-256 | 1e2c5bc4a68e764221bdb63ec5a9ceec6fb849f75eed4406afcc45be206d07ff |
Hashes for rxn_reaction_preprocessing-2.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12a1c1e1fb6d98205cd58905d58f0d97f5070144d7ffac7ee7d229d95ba43d93 |
|
MD5 | 0e86ed512331c12929c3137845af0d94 |
|
BLAKE2b-256 | 9784595f640b0d24e980454862229bb2dca629db0369b5c52511807fbfdf8596 |