Skip to main content

Package for extracting chemical reaction serialized (Google Protocol Buffers mechanism) in Open Reaction Database (ORD) schema to relational database (RDB) and Resource Descriptive Framework (RDF).

Project description

Project Description for ord_rxn_converter

Introduction

ord_rxn_converter is a Python package designed to streamline the transformation of chemical reaction data from the Open Reaction Database (ORD) in Google Protocol Buffer format into structured datasets suitable for downstream machine learning and data analysis tasks. It provides modular tools for parsing, extracting, and converting complex reaction schema into interpretable tables, lists, and dictionaries that can be easily ingested by models or used in exploratory chemical data analysis.

The library is organized into specialized modules that handle different components of the reaction schema — including identifiers, inputs, conditions, setup, workups, outcomes, and notes/observations — as well as utility functions for key operations and dataset generation. The package is structured for clarity and extendibility, enabling researchers to adapt it to varying needs in computational chemistry or cheminformatics pipelines.

The codebase is written in Python 3 and supports integration into Jupyter notebooks, standalone scripts, or larger ML pipelines for tasks such as property prediction, reaction classification, or synthesis planning.

Motivation

Chemical reaction data is often stored in highly nested or semi-structured formats that are difficult to work with directly in data science workflows. The Open Reaction Database provides a valuable standardized format, but researchers and developers often require a flat, structured format with clean fields to build models or perform analysis.

ord_rxn_converter was developed to automate and standardize this transformation process. It allows users to systematically convert the complex data in ORD protobuf files into simplified Python structures (lists, dictionaries, Pandas DataFrames), reducing time spent on preprocessing and improving reproducibility in ML workflows. By modularizing the conversion process, the package promotes clarity, flexibility, and easier debugging.

The project originated as part of a broader effort to accelerate machine learning-driven synthesis planning by improving the usability of publicly available chemical data.

Limitations

  • The package currently assumes that input ORD data conforms closely to the expected schema. It may require modification or additional error handling for incomplete or non-standard records.

  • Complex reaction pathways involving multi-step synthesis or overlapping outcomes may not be fully supported in this version.

  • The current modules focus primarily on extraction rather than validation or correction of chemical information. Users are advised to preprocess or sanitize their data before applying the conversion tools if needed.

  • While the package is modular, it is not yet fully abstracted for plug-and-play use in non-ORD schemas. Adapting it to other chemical data formats (e.g., USPTO, Reaxys) would require extension.

  • The project is in active development, and interface or function-level changes may occur in future versions.

Affiliations:

Materials Data Science for Stockpile Stewardship Center of Excellence (MDS3-COE), Solar Durability and Lifetime Extension (SDLE) Research Center, Materials Science and Engineering, Case Western Reserve University, Cleveland, OH 44106, USA

Package Usage:

The package will convert a dataset (that contains hundreds to thousands of reactions) in ORD schema in Google Protocol Buffers format into a dictionary of pandas DataFrames for each reaction portion: reaction identifiers, reaction inputs, reaction conditions, reaction setup, reaction outcomes, reaction notes and observations.

Python package documentation

https://sphinx-rtd-tutorial.readthedocs.io/en/latest/index.html

Acknowledgements:

This work was supported by the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) under Solar Energy Technologies Office (SETO) Agreement Numbers DE-EE0009353 and DE-EE0009347, Department of Energy (National Nuclear Security Administration) under Award Number DE-NA0004104 and Contract number B647887, and U.S. National Science Foundation Award under Award Number 2133576.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ord_rxn_converter-0.0.4.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ord_rxn_converter-0.0.4-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file ord_rxn_converter-0.0.4.tar.gz.

File metadata

  • Download URL: ord_rxn_converter-0.0.4.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for ord_rxn_converter-0.0.4.tar.gz
Algorithm Hash digest
SHA256 e35046f38948b9ab78ec456830fff0319688d65b7fac61a13053e9e083785de4
MD5 50444bbf0d15ae2b05a370fc29e44d40
BLAKE2b-256 93c746891fe134baa792349209263f38ab5dcb1c950905767c6708c031263e6b

See more details on using hashes here.

Provenance

The following attestation bundles were made for ord_rxn_converter-0.0.4.tar.gz:

Publisher: python-publish.yml on quynhdtran17/ord_rxn_converter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ord_rxn_converter-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for ord_rxn_converter-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6ed5b8e646723c22676ebd812fc4d16f0013c594b3af34dfeef5c30fbde8c609
MD5 a2dfd8f7a940b86b72cf8fb00eccbbbc
BLAKE2b-256 f6b1e30fb7b27312832c4230057ba961b93e555bb8f4f3a8c2e6d1dd77801ebd

See more details on using hashes here.

Provenance

The following attestation bundles were made for ord_rxn_converter-0.0.4-py3-none-any.whl:

Publisher: python-publish.yml on quynhdtran17/ord_rxn_converter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page