Skip to main content

An NLP-inspired chemical reaction fingerprint based on basic set arithmetic.

Project description

test workflow DOI

DRFP

An NLP-inspired chemical reaction fingerprint based on basic set arithmetic.

Read the associated open access article

Description

Predicting the nature and outcome of reactions using computational methods is an important tool to accelerate chemical research. The recent application of deep learning-based learned fingerprints to reaction classification and reaction yield prediction has shown an impressive increase in performance compared to previous methods such as DFT- and structure-based fingerprints. However, learned fingerprints require large training data sets, are inherently biased, and are based on complex deep learning architectures. Here we present the differential reaction fingerprint DRFP. The DRFP algorithm takes a reaction SMILES as an input and creates a binary fingerprint based on the symmetric difference of two sets containing the circular molecular n-grams generated from the molecules listed left and right from the reaction arrow, respectively, without the need for distinguishing between reactants and reagents. We show that DRFP outperforms DFT-based fingerprints in reaction yield prediction and other structure-based fingerprints in reaction classification, and reaching the performance of state-of-the-art learned fingerprints in both tasks while being data-independent.

Getting Started

The best way to start exploring DRFP is on binder. A notebook that gets you started on creating and using DRFP:

Binder

A notbook that explains how you can use SHAP to analyse and interpret your machine learning models when using DRFP:

Binder

Installation and Usage

DRFP can be installed from pypi using pip install drfp.

Once DRFP is installed, there are two ways you can use it. You can use the cli app drfp or the library provided by the package.

CLI

drfp my_rxn_smiles.txt my_rxn_fps.pkl -d 512

This will create a pickle dump containing an numpy ndarray containing DRFP fingerprints with a dimensionality of 512. To also export the mapping, use the flag --mapping. This will create the additional file my_rxn_fps.map.pkl. You can call drfp --help to show all available flags and options.

Library

Following is a basic exmple of how to use DRFP in a Python script.

from drfp import DrfpEncoder

rxn_smiles = [
    "CO.O[C@@H]1CCNC1.[C-]#[N+]CC(=O)OC>>[C-]#[N+]CC(=O)N1CC[C@@H](O)C1",
    "CCOC(=O)C(CC)c1cccnc1.Cl.O>>CCC(C(=O)O)c1cccnc1",
]

fps = DrfpEncoder.encode(rxn_smiles)

The variable fps now points to a list containing the fingerprints for the two reaction SMILES as numpy arrays.

Documentation

The library contains the class DrfpEncoder with one public method encode.

DrfpEncoder.encode() Description Type Default
X An iterable (e.g. a list) of reaction SMILES or a single reaction SMILES to be encoded Iterable or str
n_folded_length The folded length of the fingerprint (the parameter for the modulo hashing) int 2048
min_radius The minimum radius of a substructure (0 includes single atoms) int 0
radius The maximum radius of a substructure int 3
rings Whether to include full rings as substructures bool True
mapping Return a feature to substructure mapping in addition to the fingerprints. If true, the return signature of this method is Tuple[List[np.ndarray], Dict[int, Set[str]]] bool False
atom_index_mapping Return the atom indices of mapped substructures for each reaction bool False
root_central_atom Whether to root the central atom of substructures when generating SMILES bool True
include_hydrogens Whether to explicitly include hydrogens in the molecular graph bool False
show_progress_bar Whether to show a progress bar when encoding reactions bool False

Reproduce

Want to reproduce the results in our paper? You can find all the data in the data folder and encoding and training scripts in the scripts folder.

Cite Us

@article{probst2022reaction,
  title={Reaction Classification and Yield Prediction using the Differential Reaction Fingerprint DRFP},
  author={Probst, Daniel and Schwaller, Philippe and Reymond, Jean-Louis},
  journal={Digital Discovery},
  year={2022},
  publisher={Royal Society of Chemistry}
}

Development Setup

This project uses UV for dependency management. To set up a development environment:

  1. Install UV following the official instructions

  2. Clone the repository:

git clone https://github.com/reymond-group/drfp
cd drfp
  1. Install dependencies including development packages:
uv sync --dev
  1. Run tests:
uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

drfp-0.3.7.tar.gz (97.5 MB view details)

Uploaded Source

Built Distribution

drfp-0.3.7-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file drfp-0.3.7.tar.gz.

File metadata

  • Download URL: drfp-0.3.7.tar.gz
  • Upload date:
  • Size: 97.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.6.14

File hashes

Hashes for drfp-0.3.7.tar.gz
Algorithm Hash digest
SHA256 be841e38316e8f5d42325d706941979288911cf4b701a904db555a4e95977af7
MD5 fad38b7791514ece375510451c072772
BLAKE2b-256 a1f9162360cf4f71487aec26f93d5a18d1de4d2c6f32b53493fa8d525e8785fe

See more details on using hashes here.

File details

Details for the file drfp-0.3.7-py3-none-any.whl.

File metadata

  • Download URL: drfp-0.3.7-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.6.14

File hashes

Hashes for drfp-0.3.7-py3-none-any.whl
Algorithm Hash digest
SHA256 6766bba126a296e9898930803432929e344ca306147f047be5a8b621279cc779
MD5 e677cfa6bf3ccf18aecfb0cecca6af85
BLAKE2b-256 04fd69370aef6bebb14d39ac9dbd3bf626938fb6b04a5322cb882b4522f063d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page