Package for extracting chemical reaction serialized (Google Protocol Buffers mechanism) in Open Reaction Database (ORD) schema to relational database (RDB) and Resource Descriptive Framework (RDF).
Project description
Project Description for ord_rxn_converter
Introduction
ord_rxn_converter is a Python package designed to streamline the transformation of chemical reaction data from the Open Reaction Database (ORD) in Google Protocol Buffer format into structured datasets suitable for downstream machine learning and data analysis tasks. It provides modular tools for parsing, extracting, and converting complex reaction schema into interpretable tables, lists, and dictionaries that can be easily ingested by models or used in exploratory chemical data analysis.
The library is organized into specialized modules that handle different components of the reaction schema — including identifiers, inputs, conditions, setup, workups, outcomes, and notes/observations — as well as utility functions for key operations and dataset generation. The package is structured for clarity and extendibility, enabling researchers to adapt it to varying needs in computational chemistry or cheminformatics pipelines.
The codebase is written in Python 3 and supports integration into Jupyter notebooks, standalone scripts, or larger ML pipelines for tasks such as property prediction, reaction classification, or synthesis planning.
Motivation
Chemical reaction data is often stored in highly nested or semi-structured formats that are difficult to work with directly in data science workflows. The Open Reaction Database provides a valuable standardized format, but researchers and developers often require a flat, structured format with clean fields to build models or perform analysis.
ord_rxn_converter was developed to automate and standardize this transformation process. It allows users to systematically convert the complex data in ORD protobuf files into simplified Python structures (lists, dictionaries, Pandas DataFrames), reducing time spent on preprocessing and improving reproducibility in ML workflows. By modularizing the conversion process, the package promotes clarity, flexibility, and easier debugging.
The project originated as part of a broader effort to accelerate machine learning-driven synthesis planning by improving the usability of publicly available chemical data.
Limitations
-
The package currently assumes that input ORD data conforms closely to the expected schema. It may require modification or additional error handling for incomplete or non-standard records.
-
Complex reaction pathways involving multi-step synthesis or overlapping outcomes may not be fully supported in this version.
-
The current modules focus primarily on extraction rather than validation or correction of chemical information. Users are advised to preprocess or sanitize their data before applying the conversion tools if needed.
-
While the package is modular, it is not yet fully abstracted for plug-and-play use in non-ORD schemas. Adapting it to other chemical data formats (e.g., USPTO, Reaxys) would require extension.
-
The project is in active development, and interface or function-level changes may occur in future versions.
Affiliations:
Materials Data Science for Stockpile Stewardship Center of Excellence (MDS3-COE), Solar Durability and Lifetime Extension (SDLE) Research Center, Materials Science and Engineering, Case Western Reserve University, Cleveland, OH 44106, USA
Package Usage:
The package will convert a dataset (that contains hundreds to thousands of reactions) in ORD schema in Google Protocol Buffers format into a dictionary of pandas DataFrames for each reaction portion: reaction identifiers, reaction inputs, reaction conditions, reaction setup, reaction outcomes, reaction notes and observations.
Python package documentation
https://sphinx-rtd-tutorial.readthedocs.io/en/latest/index.html
Acknowledgements:
This work was supported by the U.S. Department of Energy’s Office of Energy Efficiency and Renewable Energy (EERE) under Solar Energy Technologies Office (SETO) Agreement Numbers DE-EE0009353 and DE-EE0009347, Department of Energy (National Nuclear Security Administration) under Award Number DE-NA0004104 and Contract number B647887, and U.S. National Science Foundation Award under Award Number 2133576.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ord_rxn_converter-0.0.4.tar.gz.
File metadata
- Download URL: ord_rxn_converter-0.0.4.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e35046f38948b9ab78ec456830fff0319688d65b7fac61a13053e9e083785de4
|
|
| MD5 |
50444bbf0d15ae2b05a370fc29e44d40
|
|
| BLAKE2b-256 |
93c746891fe134baa792349209263f38ab5dcb1c950905767c6708c031263e6b
|
Provenance
The following attestation bundles were made for ord_rxn_converter-0.0.4.tar.gz:
Publisher:
python-publish.yml on quynhdtran17/ord_rxn_converter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ord_rxn_converter-0.0.4.tar.gz -
Subject digest:
e35046f38948b9ab78ec456830fff0319688d65b7fac61a13053e9e083785de4 - Sigstore transparency entry: 222885631
- Sigstore integration time:
-
Permalink:
quynhdtran17/ord_rxn_converter@a3fd25dfb52c91ceb50390359f3791f570b3de2b -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/quynhdtran17
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@a3fd25dfb52c91ceb50390359f3791f570b3de2b -
Trigger Event:
release
-
Statement type:
File details
Details for the file ord_rxn_converter-0.0.4-py3-none-any.whl.
File metadata
- Download URL: ord_rxn_converter-0.0.4-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ed5b8e646723c22676ebd812fc4d16f0013c594b3af34dfeef5c30fbde8c609
|
|
| MD5 |
a2dfd8f7a940b86b72cf8fb00eccbbbc
|
|
| BLAKE2b-256 |
f6b1e30fb7b27312832c4230057ba961b93e555bb8f4f3a8c2e6d1dd77801ebd
|
Provenance
The following attestation bundles were made for ord_rxn_converter-0.0.4-py3-none-any.whl:
Publisher:
python-publish.yml on quynhdtran17/ord_rxn_converter
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ord_rxn_converter-0.0.4-py3-none-any.whl -
Subject digest:
6ed5b8e646723c22676ebd812fc4d16f0013c594b3af34dfeef5c30fbde8c609 - Sigstore transparency entry: 222885639
- Sigstore integration time:
-
Permalink:
quynhdtran17/ord_rxn_converter@a3fd25dfb52c91ceb50390359f3791f570b3de2b -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/quynhdtran17
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@a3fd25dfb52c91ceb50390359f3791f570b3de2b -
Trigger Event:
release
-
Statement type: