Skip to main content

An NLP-inspired chemical reaction fingerprint based on basic set arithmetic.

Project description

test workflow DOI

DRFP

An NLP-inspired chemical reaction fingerprint based on basic set arithmetic.

Read the associated open access article

Description

Predicting the nature and outcome of reactions using computational methods is an important tool to accelerate chemical research. The recent application of deep learning-based learned fingerprints to reaction classification and reaction yield prediction has shown an impressive increase in performance compared to previous methods such as DFT- and structure-based fingerprints. However, learned fingerprints require large training data sets, are inherently biased, and are based on complex deep learning architectures. Here we present the differential reaction fingerprint DRFP. The DRFP algorithm takes a reaction SMILES as an input and creates a binary fingerprint based on the symmetric difference of two sets containing the circular molecular n-grams generated from the molecules listed left and right from the reaction arrow, respectively, without the need for distinguishing between reactants and reagents. We show that DRFP outperforms DFT-based fingerprints in reaction yield prediction and other structure-based fingerprints in reaction classification, and reaching the performance of state-of-the-art learned fingerprints in both tasks while being data-independent.

Getting Started

The best way to start exploring DRFP is on binder. A notebook that gets you started on creating and using DRFP:

Binder

A notbook that explains how you can use SHAP to analyse and interpret your machine learning models when using DRFP:

Binder

Installation and Usage

DRFP can be installed from pypi using pip install drfp. However, it depends on RDKit which is best installed using conda.

Once DRFP is installed, there are two ways you can use it. You can use the cli app drfp or the library provided by the package.

CLI

drfp my_rxn_smiles.txt my_rxn_fps.pkl -d 512

This will create a pickle dump containing an numpy ndarray containing DRFP fingerprints with a dimensionality of 512. To also export the mapping, use the flag --mapping. This will create the additional file my_rxn_fps.map.pkl. You can call drfp --help to show all available flags and options.

Library

Following is a basic exmple of how to use DRFP in a Python script.

from drfp import DrfpEncoder

rxn_smiles = [
    "CO.O[C@@H]1CCNC1.[C-]#[N+]CC(=O)OC>>[C-]#[N+]CC(=O)N1CC[C@@H](O)C1",
    "CCOC(=O)C(CC)c1cccnc1.Cl.O>>CCC(C(=O)O)c1cccnc1",
]

fps = DrfpEncoder.encode(rxn_smiles)

The variable fps now points to a list containing the fingerprints for the two reaction SMILES as numpy arrays.

Documentation

The library contains the class DrfpEncoder with one public method encode.

DrfpEncoder.encode() Description Type Default
X An iterable (e.g. a list) of reaction SMILES or a single reaction SMILES to be encoded Iterable or str
n_folded_length The folded length of the fingerprint (the parameter for the modulo hashing) int 2048
min_radius The minimum radius of a substructure (0 includes single atoms) int 0
radius The maximum radius of a substructure int 3
rings Whether to include full rings as substructures bool True
mapping Return a feature to substructure mapping in addition to the fingerprints. If true, the return signature of this method is Tuple[List[np.ndarray], Dict[int, Set[str]]] bool False
atom_index_mapping Return the atom indices of mapped substructures for each reaction bool False
root_central_atom Whether to root the central atom of substructures when generating SMILES bool True
include_hydrogens Whether to explicitly include hydrogens in the molecular graph bool False
show_progress_bar Whether to show a progress bar when encoding reactions bool False

Reproduce

Want to reproduce the results in our paper? You can find all the data in the data folder and encoding and training scripts in the scripts folder.

Cite Us

@article{probst2022reaction,
  title={Reaction Classification and Yield Prediction using the Differential Reaction Fingerprint DRFP},
  author={Probst, Daniel and Schwaller, Philippe and Reymond, Jean-Louis},
  journal={Digital Discovery},
  year={2022},
  publisher={Royal Society of Chemistry}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

drfp-0.3.6.tar.gz (94.6 MB view details)

Uploaded Source

Built Distribution

drfp-0.3.6-py2.py3-none-any.whl (9.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file drfp-0.3.6.tar.gz.

File metadata

  • Download URL: drfp-0.3.6.tar.gz
  • Upload date:
  • Size: 94.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for drfp-0.3.6.tar.gz
Algorithm Hash digest
SHA256 987a8bc36537817d02940618d078817e2891ad499abd1965bf0aacdcb73c5d83
MD5 c46cb4d02ebc744923a237b9c3692eca
BLAKE2b-256 848df266aabfadbf5547e8da47457dd01c5e147ffd3978fd6eb0de348b02bc81

See more details on using hashes here.

File details

Details for the file drfp-0.3.6-py2.py3-none-any.whl.

File metadata

  • Download URL: drfp-0.3.6-py2.py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for drfp-0.3.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 061cd7a3ea2cbdaefff5b89b56da2925cc56ca4220f43ef03565e86dbacd69af
MD5 bab7e695604ba3c59acadf1259574579
BLAKE2b-256 0c1c959d25a4db04040463c45c081a14f25722b4e19bd26df62a1d21dea18739

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page