Skip to main content

Extraction and cleaning of chemical reactions data from ORD

Project description

ORDerly

🧪 Cleaning chemical reaction data 🧪

🎯 Condition Prediction Benchmark 🎯

Quick Install

Requires Python 3.10 (Tested on MacOS and Linux)

pip install orderly

🤔 What is this?

Machine learning has the potential to provide tremendous value to chemistry. However, large amounts of clean high-quality data are needed to train models

ORDerly cleans chemical reaction data from the growing Open Reaction Database (ORD).

Use ORDerly to:

Abstract Figure

📖 Extract and clean a dataset

Download data from ORD

Data in ORD format should be placed in a folder called /data/ord/. You can either use your own data, or the open-source ORD data.

To download the ORD data follow the instructions in the ORD repository (i.e. download Git LFS and clone their repository) and then place it within a folder called /data/ord/.

Extract data from the ORD files

python -m orderly.extract

If you want to run ORDerly on your own data, and want to specify the input and output path:

python -m orderly.extract --input_path="/data/ord/" --output_path="/data/orderly/"

This will generate a parquet file for each ORD file.

Clean the data

This will produce train and test parquet files, along with a .json file showing the arguments used and a .log file showing the operations run.

python -m orderly.clean

🚀 Download the condition prediction benchmark dataset

Reaction condition prediction is the problem of predicting the things "above the arrow" in chemical reactions.

There are three options for donwloading the benchmark.

  1. If you have orderly installed you can download the benchmark using this command:

python -m orderly.download.benchmark

  1. Or you can either download the ORDerly condition prediction benchmark dataset directly

  2. Or use the following code to download it (without installing ORDerly). Make sure to install needed dependencies first (shown below).

Toggle to see code to download benchmark

pip install requests fastparquet pandas

import pathlib
import zipfile

import pandas as pd
import requests


def download_benchmark(
    benchmark_zip_file="orderly_benchmark.zip",
    benchmark_directory="orderly_benchmark/",
    version=2,
):
    figshare_url = (
        f"https://figshare.com/ndownloader/articles/23298467/versions/{version}"
    )
    print(f"Downloading benchmark from {figshare_url} to {benchmark_zip_file}")
    r = requests.get(figshare_url, allow_redirects=True)
    with open(benchmark_zip_file, "wb") as f:
        f.write(r.content)

    print("Unzipping benchmark")
    benchmark_directory = pathlib.Path(benchmark_directory)
    benchmark_directory.mkdir(parents=True, exist_ok=True)
    with zipfile.ZipFile(benchmark_zip_file, "r") as zip_ref:
        zip_ref.extractall(benchmark_directory)


download_benchmark()
train_df = pd.read_parquet("orderly_benchmark/orderly_benchmark_train.parquet")
test_df = pd.read_parquet("orderly_benchmark/orderly_benchmark_test.parquet")

📋 Reproducing results from paper

To reproduce the results from the paper, please clone the repository, and use poetry to install the requirements (see above). Towards the bottom of the makefile, you will find a comprehensive 8 step list of steps to generate all the datasets and reproduce all results presented in the paper.

Results

We run the condition prediction model on four different datasets, and find that trusting the labelling of the ORD data leads to overly confident test accuracy. We conclude that applying chemical logic to the reaction string is necessary to get a high-quality dataset, and that the best strategy for dealing with rare molecules is to delete reactions where they appear.

Top-3 exact match combination accuracy (%): frequency informed guess // model prediction // AIB%:

Dataset A (labeling; rare->"other") B (labeling; rare->delete rxn) C (reaction string; rare->"other") D (reaction string; rare->delete rxn)
Solvents 47 // 58 // 21% 50 // 61 // 22% 23 // 42 // 26% 24 // 45 // 28%
Agents 54 // 70 // 35% 58 // 72 // 32% 19 // 39 // 25% 21 // 42 // 27%
Solvents & Agents 31 // 44 // 19% 33 // 47 // 21% 4 // 21 // 18% 5 // 24 // 21%

Where AIB% is the Average Improvement of the model over the Baseline (i.e. a frequency informed guess), where $A_m$ is the accuracy of the model, and $A_B$ is the accuracy of the baseline: $AIB = (A_m - A_b) / (1 - A_b)$

Full API documentation

Extraction

There are two different ways to extract data from ORD files, trusting the labelling, or using the reaction string (as specified in the trust_labelling boolean). Below you see all the arguments that can be passed to the extraction script, change as appropriate:

python -m orderly.extract --name_contains_substring="uspto" --trust_labelling=False --output_path="data/orderly/uspto_no_trust" --consider_molecule_names=False

Cleaning

There are also a number of customisable steps for the cleaning:

python -m orderly.clean --output_path="data/orderly/datasets_$(dataset_version)/orderly_no_trust_no_map.parquet" --ord_extraction_path="data/orderly/uspto_no_trust/extracted_ords" --molecules_to_remove_path="data/orderly/uspto_no_trust/all_molecule_names.csv" --min_frequency_of_occurrence=100 --map_rare_molecules_to_other=False --set_unresolved_names_to_none_if_mapped_rxn_str_exists_else_del_rxn=True --remove_rxn_with_unresolved_names=False --set_unresolved_names_to_none=False --num_product=1 --num_reactant=2 --num_solv=2 --num_agent=3 --num_cat=0 --num_reag=0 --consistent_yield=True --scramble=True --train_test_split_fraction=0.9

A list of solvents (names and SMILES) commonly used in pharmaceutical chemistry can be found at orderly/data/solvents.csv

Issues?

Submit an issue or send an email to dsw46@cam.ac.uk.

Citing

If you find this project useful, we encourage you to

  • Star this repository :star:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orderly-1.0.0.tar.gz (18.5 MB view details)

Uploaded Source

Built Distribution

orderly-1.0.0-py3-none-any.whl (18.5 MB view details)

Uploaded Python 3

File details

Details for the file orderly-1.0.0.tar.gz.

File metadata

  • Download URL: orderly-1.0.0.tar.gz
  • Upload date:
  • Size: 18.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.1 CPython/3.10.8 Darwin/22.5.0

File hashes

Hashes for orderly-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d3236bd499f4cd12a3dd869c3f5488fbfb5648adf98ab52f4c8c8cb61784cb21
MD5 1697c4ce2e6ac0874aada21a3961206e
BLAKE2b-256 ed589fe933a88ed9facb34816cb1103ab505eeb45c0e2947688d0817e4b41b58

See more details on using hashes here.

File details

Details for the file orderly-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: orderly-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.1 CPython/3.10.8 Darwin/22.5.0

File hashes

Hashes for orderly-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9575b15f40fa12f468f38b45e5c98ee65a27431129c6581db4caf7385436a4aa
MD5 8871e0afece23d64ce2e0efb83f27cf7
BLAKE2b-256 b0391d31164914c427171a31e7f5118f2bb6d58e6b2057e219c03c21a0e0b430

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page