Skip to main content

The computer-assisted chemical synthesis data source research project.

Project description

Data Source

Static Badge Static Badge Static Badge

Welcome to the computer-assisted chemical synthesis data source research project !!!

Over the past decade, computer-assisted chemical synthesis has re-emerged as a prominent research subject. Even though the idea of utilizing computers to assist chemical synthesis has existed for nearly as long as computers themselves, the inherent complexity repeatedly exceeded the available resources. However, recent machine learning approaches have exhibited the potential to break this tendency. The performance of such approaches is dependent on data that frequently suffer from limited quantity, quality, visibility, and accessibility, posing significant challenges to potential scientific breakthroughs. Consequently, the primary objective of the Data Source research project is to systematically curate and facilitate access to relevant open computer-assisted chemical synthesis data sources.

Utilization Instructions

The utilization instructions of this repository are structured as follows:

Installation of the Package

The data_source package can be installed in an existing environment using the pip command as follows:

pip install ncsw-data-source

A local environment can be created using the git and conda commands as follows:

git clone https://github.com/neo-chem-synth-wave/data-source.git

cd data-source

conda env create -f environment.yaml

conda activate data-source-env

pip install .

Utilization of the Package

The data_source package supports three alternatives for the downloading, extraction, and formatting of a specific version of computer-assisted chemical synthesis data from a specific source. The first alternative is by importing and utilizing the individual data source utility classes:

from data_source.compound.zinc.utility import (
    ZINCCompoundDatabaseDownloadUtility,
    ZINCCompoundDatabaseExtractionUtility,
    ZINCCompoundDatabaseFormattingUtility
)

ZINCCompoundDatabaseDownloadUtility.download_v_building_block(
    version="v_building_block_bb_30",
    output_directory_path="/path/to/the/directory_a"
)

ZINCCompoundDatabaseExtractionUtility.extract_v_building_block(
    version="v_building_block_bb_30",
    input_directory_path="/path/to/the/directory_a",
    output_directory_path="/path/to/the/directory_b"
)

ZINCCompoundDatabaseFormattingUtility.format_v_building_block(
    version="v_building_block_bb_30",
    input_directory_path="/path/to/the/directory_b",
    output_directory_path="/path/to/the/directory_c"
)

The second alternative is by importing and utilizing the individual data source classes:

from data_source.compound.zinc import ZINCCompoundDatabase

zinc_compound_db = ZINCCompoundDatabase()

zinc_compound_db.get_supported_versions()

zinc_compound_db.download(
    version="v_building_block_bb_30",
    output_directory_path="/path/to/the/directory_a"
)

zinc_compound_db.extract(
    version="v_building_block_bb_30",
    input_directory_path="/path/to/the/directory_a",
    output_directory_path="/path/to/the/directory_b"
)

zinc_compound_db.format(
    version="v_building_block_bb_30",
    input_directory_path="/path/to/the/directory_b",
    output_directory_path="/path/to/the/directory_c"
)

The third alternative is by importing and utilizing the data source category classes:

from data_source.compound import CompoundDataSource

compound_data_source = CompoundDataSource()

compound_data_source.get_names_of_supported_data_sources()

compound_data_source.get_supported_versions(
    name="zinc"
)

compound_data_source.download(
    name="zinc",
    version="v_building_block_bb_30",
    output_directory_path="/path/to/the/directory_a"
)

compound_data_source.extract(
    name="zinc",
    version="v_building_block_bb_30",
    input_directory_path="/path/to/the/directory_a",
    output_directory_path="/path/to/the/directory_b"
)

compound_data_source.format(
    name="zinc",
    version="v_building_block_bb_30",
    input_directory_path="/path/to/the/directory_b",
    output_directory_path="/path/to/the/directory_c"
)

Utilization of the Scripts

The purpose of the scripts directory is to illustrate how to utilize the data_source package to download, extract, and format a specific version of computer-assisted chemical synthesis data from a specific source. The download_extract_and_format_data script can be utilized as follows:

# Get the chemical reaction data source name information.
python scripts/download_extract_and_format_data.py \
  --data_source_category "reaction" \
  --get_data_source_name_information
# Get the USPTO chemical reaction dataset version information.
python scripts/download_extract_and_format_data.py \
  --data_source_category "reaction" \
  --data_source_name "uspto" \
  --get_data_source_version_information
# Download, extract, and format the data from the USPTO chemical reaction dataset.
python scripts/download_extract_and_format_data.py \
  --data_source_category "reaction" \
  --data_source_name "uspto" \
  --data_source_version "v_50k_by_20171116_coley_c_w_et_al" \
  --output_directory_path "/path/to/the/output/directory"

The full list of script arguments is as follows:

  • --data_source_category or -dsc → The category of the data source. (i.e., compound, compound_pattern, reaction, or reaction_pattern)
  • --get_data_source_name_information or -gdsni → The indicator of whether to get the data source name information.
  • --data_source_name or -dsn → The name of the data source. (i.e., chembl, crd, miscellaneous, ord, rdkit, retro_rules, rhea, uspto, or zinc)
  • --get_data_source_version_information or -gdsvi → The indicator of whether to get the data source version information.
  • --data_source_version or -dsv → The version of the data source.
  • --output_directory_path or -odp → The path to the output directory where the data should be downloaded, extracted, and formatted.
  • --number_of_processes or -nop → The number of processes, if relevant.

Supported Data Sources

The following data sources are supported:

Chemical Compounds

The following chemical compound data sources are supported:

The chemical compound data source relationships can be illustrated as follows:

chemical_compound_data_sources.png

ZINC

The following ZINC [1, 2, 3] chemical compound database versions are supported:

Version DOI Status
v_building_block_{building_block_subset_name} [2] 10.1021/acs.jcim.0c00675 :green_circle:
v_catalog_{catalog_subset_name} [2] 10.1021/acs.jcim.0c00675 :green_circle:

:green_circle: Completely Implemented

ChEMBL

The following ChEMBL [4] chemical compound database versions are supported:

Version DOI Status
v_release_{release_number ≥ 25} [4] 10.6019/CHEMBL.database.{release_number} :green_circle:

:green_circle: Completely Implemented

COCONUT

The following COCONUT [5, 6] chemical compound database versions are supported:

Version DOI Status
v_2_0_by_20241126_chandrasekhar_v_et_al [6] 10.5281/zenodo.13382750 :green_circle:
v_2_0_complete_by_20241126_chandrasekhar_v_et_al [6] 10.5281/zenodo.13382750 :green_circle:

:green_circle: Completely Implemented

Miscellaneous Chemical Compound Data Sources

The following miscellaneous chemical compound data sources are supported:

Version DOI Status
v_moses_by_20201218_polykovskiy_d_et_al [7] 10.3389/fphar.2020.565644 :green_circle:

:green_circle: Completely Implemented

Chemical Compound Patterns

The following chemical compound pattern data sources are supported:

The chemical compound pattern data source relationships can be illustrated as follows:

chemical_compound_pattern_data_sources.png

RDKit

The following RDKit [8] chemical compound pattern dataset versions are supported:

Version DOI Status
v_htl_by_20080307_brenk_r_et_al [9] 10.1002/cmdc.200700139 :green_circle:
v_pains_by_20100204_baell_j_b_and_holloway_g_a [10] 10.1021/jm901137j :green_circle:

:green_circle: Completely Implemented

Chemical Reactions

The following chemical reaction data sources are supported:

The chemical reaction data source relationships can be illustrated as follows:

chemical_reaction_data_sources.png

United States Patent and Trademark Office (USPTO)

The following United States Patent and Trademark Office (USPTO) [11] chemical reaction dataset versions are supported:

Version DOI Status
v_1976_to_2013_rsmi_by_20121009_lowe_d_m [11] 10.6084/m9.figshare.12084729.v1 :green_circle:
v_50k_by_20141226_schneider_n_et_al [12] 10.1021/ci5006614 :green_circle:
v_50k_by_20161122_schneider_n_et_al [13] 10.1021/acs.jcim.6b00564 :green_circle:
v_15k_by_20170418_coley_c_w_et_al [14] 10.1021/acscentsci.7b00064 :green_circle:
v_1976_to_2016_cml_by_20121009_lowe_d_m [11] 10.6084/m9.figshare.5104873.v1 :yellow_circle:
v_1976_to_2016_rsmi_by_20121009_lowe_d_m [11] 10.6084/m9.figshare.5104873.v1 :green_circle:
v_50k_by_20170905_liu_b_et_al [15] 10.1021/acscentsci.7b00303 :green_circle:
v_50k_by_20171116_coley_c_w_et_al [16] 10.1021/acscentsci.7b00355 :green_circle:
v_480k_or_mit_by_20171204_jin_w_et_al [17] 10.48550/arXiv.1709.04555 :green_circle:
v_480k_or_mit_by_20180622_schwaller_p_et_al [18] 10.1039/C8SC02339E :green_circle:
v_stereo_by_20180622_schwaller_p_et_al [18] 10.1039/C8SC02339E :green_circle:
v_lef_by_20181221_bradshaw_j_et_al [19] 10.48550/arXiv.1805.10970 :green_circle:
v_1k_tpl_by_20210128_schwaller_p_et_al [20] 10.1038/s42256-020-00284-w :green_circle:
v_1976_to_2016_remapped_by_20210407_schwaller_p_et_al [21] 10.1126/sciadv.abe4166 :green_circle:
v_1976_to_2016_remapped_by_20240313_chen_s_et_al [22] 10.6084/m9.figshare.25046471.v1 :green_circle:
v_50k_remapped_by_20240313_chen_s_et_al [22] 10.6084/m9.figshare.25046471.v1 :green_circle:
v_mech_31k_by_20240810_chen_s_et_al [23] 10.6084/m9.figshare.24797220.v2 :green_circle:

:green_circle: Completely Implemented
:yellow_circle: Partially Implemented (Limited to Reaction SMILES Strings)

Open Reaction Database (ORD)

The following Open Reaction Database (ORD) [24] versions are supported:

Version DOI Status
v_release_0_1_0 [24] 10.1021/jacs.1c09820 :yellow_circle:
v_release_main [24] 10.1021/jacs.1c09820 :yellow_circle:

:green_circle: Completely Implemented
:yellow_circle: Partially Implemented (Limited to Reaction SMILES Strings)

Chemical Reaction Database (CRD)

The following Chemical Reaction Database (CRD) [25] versions are supported:

Version DOI Status
v_reaction_smiles_2001_to_2021 [25] 10.6084/m9.figshare.20279733.v1 :green_circle:
v_reaction_smiles_2001_to_2023 [25] 10.6084/m9.figshare.22491730.v1 :green_circle:
v_reaction_smiles_2023 [25] 10.6084/m9.figshare.24921555.v1 :green_circle:
v_reaction_smiles_1976_to_2024 [25] 10.6084/m9.figshare.28230053.v1 :green_circle:

:green_circle: Completely Implemented

Rhea

The following Rhea [26] chemical reaction database versions are supported:

Version DOI Status
v_release_{release_number ≥ 126} [26] 10.1093/nar/gkab1016 :green_circle:

:green_circle: Completely Implemented

Miscellaneous Chemical Reaction Data Sources

The following miscellaneous chemical reaction data sources are supported:

Version DOI Status
v_20131008_kraut_h_et_al [27] 10.1021/ci400442f :green_circle:
v_20161014_wei_j_n_et_al [28] 10.1021/acscentsci.6b00219 :green_circle:
v_20200508_grambow_c_et_al [29] 10.5281/zenodo.3581266 :green_circle:
v_add_on_by_20200508_grambow_c_et_al [29] 10.5281/zenodo.3731553 :green_circle:
v_golden_dataset_by_20211102_lin_a_et_al [30] 10.1002/minf.202100138 :green_circle:
v_rdb7_by_20220718_spiekermann_k_et_al [31] 10.5281/zenodo.5652097 :green_circle:
v_orderly_condition_by_20240422_wigh_d_s_et_al [32] 10.6084/m9.figshare.23298467.v4 :green_circle:
v_orderly_forward_by_20240422_wigh_d_s_et_al [32] 10.6084/m9.figshare.23298467.v4 :green_circle:
v_orderly_retro_by_20240422_wigh_d_s_et_al [32] 10.6084/m9.figshare.23298467.v4 :green_circle:

:green_circle: Completely Implemented

Chemical Reaction Patterns

The following chemical reaction pattern data sources are supported:

The chemical reaction pattern data source relationships can be illustrated as follows:

chemical_reaction_pattern_data_sources.png

RetroRules

The following RetroRules [33] chemical reaction pattern database versions are supported:

Version DOI Status
v_release_rr01_rp2_hs [33] 10.5281/zenodo.5827427 :green_circle:
v_release_rr02_rp2_hs [33] 10.5281/zenodo.5828017 :green_circle:
v_release_rr02_rp3_hs [33] 10.5281/zenodo.5827977 :green_circle:
v_release_rr02_rp3_nohs [33] 10.5281/zenodo.5827969 :green_circle:

:green_circle: Completely Implemented

Miscellaneous Chemical Reaction Pattern Data Sources

The following miscellaneous chemical reaction pattern data sources are supported:

Version DOI Status
v_retro_transform_db_by_20180421_avramova_s_et_al [34] 10.5281/zenodo.1209312 :green_circle:
v_dingos_by_20190701_button_a_et_al [35] 10.24433/CO.6930970.v1 :green_circle:
v_auto_template_by_20240627_chen_l_and_li_y [36] 10.1186/s13321-024-00869-2 :green_circle:

:green_circle: Completely Implemented

Data

The purpose of the data directory is to archive and backup the data sources that are hosted on GitHub, GitLab, and CodeOcean repositories.

License Information

The contents of this repository are published under the MIT license. Please refer to the individual references for more details regarding the license information of external resources utilized within the repository.

Contact

If you are interested in contributing to this research project by submitting bugs, questions, and feedback or contributing to the code and data, please refer to the contribution guidelines.

References

[1] Sterling, T. and Irwin, J.J. ZINC15 – Ligand Discovery for Everyone. J. Chem. Inf. Model., 2015, 55, 11, 2324-2337.

[2] Irwin, J.J. et al. ZINC20 - A Free Ultralarge-Scale Chemical Database for Ligand Discovery. J. Chem. Inf. Model., 2020, 60, 12, 6065-6073.

[3] Tingle, B.I. et al. ZINC22 - A Free Multi-billion-scale Database of Tangible Compounds for Ligand Discovery. J. Chem. Inf. Model., 2023, 63, 4, 1166-1176.

[4] Zdrazil, B. et al. The ChEMBL Database in 2023: A Drug Discovery Platform Spanning Multiple Bioactivity Data Types and Time Periods. Nucleic Acids Research, 52, D1, 2024, D1180-D1192.

[5] Sorokina, M. et al. COCONUT Online: Collection of Open Natural Products Database. J. Cheminform., 13, 2, 2021.

[6] Chandrasekhar, V. et al. COCONUT 2.0: A Comprehensive Overhaul and Curation of the Collection of Open Natural Products Database. Nucleic Acids Research, 53, D1, 2025, D634–D643.

[7] Polykovskiy, D. et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front. Pharmacol., 11, 2020.

[8] RDKit: Open-source Cheminformatics: https://www.rdkit.org. Accessed on: 2025/09/25.

[9] Brenk, R. et al. Lessons Learnt from Assembling Screening Libraries for Drug Discovery for Neglected Diseases. ChemMedChem, 3, 435-444.

[10] Baell, J.B. and Holloway, G.A. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for their Exclusion in Bioassays. J. Med. Chem., 2010, 53, 7, 2719–2740.

[11] Lowe, D.M. Extraction of Chemical Structures and Reactions from the Literature. Ph.D. Thesis, University of Cambridge, Department of Chemistry, Pembroke College, 2012.

[12] Schneider, N. et al. Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-scale Reaction Classification and Similarity. J. Chem. Inf. Model., 2015, 55, 1, 39–53.

[13] Schneider, N. et al. What’s What: The (Nearly) Definitive Guide to Reaction Role Assignment. J. Chem. Inf. Model., 2016, 56, 12, 2336–2346.

[14] Coley, C.W. et al. Prediction of Organic Reaction Outcomes using Machine Learning. ACS Cent. Sci., 2017, 3, 5, 434–443.

[15] Liu, B. et al. Retrosynthetic Reaction Prediction Using Neural Sequence-to-sequence Models. ACS Cent. Sci., 2017, 3, 10, 1103-1113.

[16] Coley, C.W. et al. Computer-assisted Retrosynthesis Based on Molecular Similarity. J. Chem. Inf. Model., 2017, 3, 12, 1237–1245.

[17] Jin, W. et al. Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. Advances in Neural Information Processing Systems, 30, 2017.

[18] Schwaller, P. et al. "Found in Translation": Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-sequence Models. Chem. Sci., 2018, 9, 6091-6098.

[19] Bradshaw, J. et al. A Generative Model for Electron Paths. International Conference on Learning Representations, 2019.

[20] Schwaller, P. et al. Mapping the Space of Chemical Reactions using Attention-based Neural Networks. Nat. Mach. Intell., 3, 144-152, 2021.

[21] Schwaller, P. et al. Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical Reactions. Sci. Adv., 7, eabe4166, 2021.

[22] Chen, S. et al. Precise Atom-to-atom Mapping for Organic Reactions via Human-in-the-loop Machine Learning. Nat. Commun., 15, 2250, 2024.

[23] Chen, S. et al. A Large-scale Reaction Dataset of Mechanistic Pathways of Organic Reactions. Sci. Data, 11, 863, 2024.

[24] Kearnes, S.M. et al. The Open Reaction Database. J. Am. Chem. Soc., 2021, 143, 45, 18820–18826.

[25] The Chemical Reaction Database (CRD): https://kmt.vander-lingen.nl. Accessed on: 2025/09/25.

[26] Bansal, P. et al. Rhea, the Reaction Knowledgebase in 2022. Nucleic Acids Research, 50, D1, 2022, D693–D700.

[27] Kraut, H. et al. Algorithm for Reaction Classification. J. Chem. Inf. Model., 2013, 53, 11, 2884–2895.

[28] Wei, J.N. et al. Neural Networks for the Prediction of Organic Chemistry Reactions. ACS Cent. Sci., 2016, 2, 10, 725–732.

[29] Grambow, C.A. et al. Reactants, Products, and Transition States of Elementary Chemical Reactions based on Quantum Chemistry. Sci. Data, 7, 137, 2020.

[30] Lin, A. et al. Atom-to-atom Mapping: A Benchmarking Study of Popular Mapping Algorithms and Consensus Strategies. Mol. Inf., 2022, 41, 2100138.

[31] Spiekermann, K. et al. High Accuracy Barrier Heights, Enthalpies, and Rate Coefficients for Chemical Reactions. Sci. Data, 9, 417, 2022.

[32] Wigh, D.S. et al. ORDerly: Data Sets and Benchmarks for Chemical Reaction Data. J. Chem. Inf. Model., 2024, 64, 9, 3790–3798.

[33] Duigou, T. et al. RetroRules: A Database of Reaction Rules for Engineering Biology. Nucleic Acids Research, 47, D1, 2019, D1229–D1235.

[34] Avramova, S. et al. RetroTransformDB: A Dataset of Generic Transforms for Retrosynthetic Analysis. Data, 2018, 3, 14.

[35] Button, A. et al. Automated De Novo Molecular Design by Hybrid Machine Intelligence and Rule-driven Chemical Synthesis. Nat. Mach. Intell., 1, 307-315, 2019.

[36] Chen, L. and Li, Y. AutoTemplate: Enhancing Chemical Reaction Datasets for Machine Learning Applications in Organic Chemistry. J. Cheminform., 16, 74, 2024.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ncsw_data_source-2025.9.2.tar.gz (52.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ncsw_data_source-2025.9.2-py3-none-any.whl (79.5 kB view details)

Uploaded Python 3

File details

Details for the file ncsw_data_source-2025.9.2.tar.gz.

File metadata

  • Download URL: ncsw_data_source-2025.9.2.tar.gz
  • Upload date:
  • Size: 52.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for ncsw_data_source-2025.9.2.tar.gz
Algorithm Hash digest
SHA256 346f619bdb6c3b75f446a5b404ddb0f9ccb4721deac653d9a9bb1abaca756e3c
MD5 530376dbeeb7b07641fb2c805842bc09
BLAKE2b-256 7949b62a59d33c50ce1f484ca21a95cfeb1ec3a7386288b7c23a0499ee1eba70

See more details on using hashes here.

File details

Details for the file ncsw_data_source-2025.9.2-py3-none-any.whl.

File metadata

File hashes

Hashes for ncsw_data_source-2025.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d1f8b5addc888488046cf2cc73b7e4d2eebe2a1afa1085cb637d10e2b123b79b
MD5 6e8d78c655275cacc6098243039c02e8
BLAKE2b-256 679a63ce23cb0111719e20a2c7321177407166dc8b0eadab2d7f206669f1756e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page