Skip to main content

Reaction preprocessing tools

Project description

RXN reaction preprocessing

Actions tests

This repository is devoted to preprocessing chemical reactions: standardization, filtering, etc. It also includes code for stable train/test/validation splits and data augmentation.

Links:

System Requirements

This package is supported on all operating systems. It has been tested on the following systems:

  • macOS: Big Sur (11.1)
  • Linux: Ubuntu 18.04.4

A Python version of 3.7 or greater is recommended.

Installation guide

The package can be installed from Pypi:

pip install rxn-reaction-preprocessing[rdkit]

You can leave out [rdkit] if you prefer to install rdkit manually (via Conda or Pypi).

For local development, the package can be installed with:

pip install -e ".[dev]"

Usage

The following command line scripts are installed with the package.

rxn-data-pipeline

Wrapper for all other scripts. Allows constructing flexible data pipelines. Entrypoint for Hydra structured configuration.

For an overview of all available configuration parameters and default values, run: rxn-data-pipeline --cfg job.

Configuration using YAML (see the file config.py for more options and their meaning):

defaults:
  - base_config

data:
  path: /tmp/inference/input.csv
  proc_dir: /tmp/rxn-preproc/exp
common:
  sequence:
    # Define which steps and in which order to execute:
    - IMPORT
    - STANDARDIZE
    - PREPROCESS
    - SPLIT
    - TOKENIZE
  fragment_bond: TILDE
preprocess:
  min_products: 0
split:
  split_ratio: 0.05
tokenize:
  input_output_pairs:
    - inp: ${data.proc_dir}/${data.name}.processed.train.csv
      out: ${data.proc_dir}/${data.name}.processed.train
    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv
      out: ${data.proc_dir}/${data.name}.processed.validation
    - inp: ${data.proc_dir}/${data.name}.processed.test.csv
      out: ${data.proc_dir}/${data.name}.processed.test
rxn-data-pipeline --config-dir . --config-name example_config

Configuration using command line arguments (example):

rxn-data-pipeline \
  data.path=/path/to/data/rxns-small.csv \
  data.proc_dir=/path/to/proc/dir \
  common.fragment_bond=TILDE \
  rxn_import.data_format=TXT \
  tokenize.input_output_pairs.0.out=train.txt \
  tokenize.input_output_pairs.1.out=validation.txt \
  tokenize.input_output_pairs.2.out=test.txt

Note about reading CSV files

Pandas appears not to always be able to write a CSV and re-read it if it contains Windows carriage returns. In order for the scripts to work despite this, all the pd.read_csv function calls should include the argument lineterminator='\n'.

Examples

A pipeline supporting augmentation

A config supporting augmentation of the training split called train-augmentation-config.yaml:

defaults:
  - base_config

data:
  name: pipeline-with-augmentation
  path: /tmp/file-with-reactions.txt
  proc_dir: /tmp/rxn-preprocessing/experiment
common:
  sequence:
    # Define which steps and in which order to execute:
    - IMPORT
    - STANDARDIZE
    - PREPROCESS
    - SPLIT
    - AUGMENT
    - TOKENIZE
  fragment_bond: TILDE
rxn_import:
  data_format: TXT
preprocess:
  min_products: 1
split:
  input_file_path: ${preprocess.output_file_path}
  split_ratio: 0.05
augment:
  input_file_path: ${data.proc_dir}/${data.name}.processed.train.csv
  output_file_path: ${data.proc_dir}/${data.name}.augmented.train.csv
  permutations: 10
  tokenize: false
  random_type: rotated
tokenize:
  input_output_pairs:
    - inp: ${data.proc_dir}/${data.name}.augmented.train.csv
      out: ${data.proc_dir}/${data.name}.augmented.train
      reaction_column_name: rxn_rotated
    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv
      out: ${data.proc_dir}/${data.name}.processed.validation
    - inp: ${data.proc_dir}/${data.name}.processed.test.csv
      out: ${data.proc_dir}/${data.name}.processed.test
rxn-data-pipeline --config-dir . --config-name train-augmentation-config

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rxn-reaction-preprocessing-2.4.0.tar.gz (96.7 kB view details)

Uploaded Source

Built Distribution

rxn_reaction_preprocessing-2.4.0-py3-none-any.whl (97.7 kB view details)

Uploaded Python 3

File details

Details for the file rxn-reaction-preprocessing-2.4.0.tar.gz.

File metadata

File hashes

Hashes for rxn-reaction-preprocessing-2.4.0.tar.gz
Algorithm Hash digest
SHA256 c6c22bc642039daa98441ee8d0c926755da0619b3b96f54f9122829172cf35bf
MD5 5633c42eb0b4cedcb0fbc569ba147185
BLAKE2b-256 10677e8805950737d3b7818ade71a45d166c97e7e5f78e91723d84a5a7a8e545

See more details on using hashes here.

File details

Details for the file rxn_reaction_preprocessing-2.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for rxn_reaction_preprocessing-2.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 10609e1a6824bfa6b6e93e090c96a73b3380b45ee73f15fc3e027c3d08deebc6
MD5 688a1cdfaec8edd6aa24bbd14f9b669b
BLAKE2b-256 68e837d64c54b9d9d32fdce7b3594f44cad43e10c35e20b9293d6f7437ebaf82

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page