Skip to main content

Integration of rdkit functionality into sklearn pipelines.

Project description

MolPipeline

MolPipeline is a Python package for processing molecules with RDKit in scikit-learn.

Background

The scikit-learn package provides a large variety of machine learning algorithms and data processing tools, among which is the Pipeline class, allowing users to prepend custom data processing steps to the machine learning model. MolPipeline extends this concept to the field of cheminformatics by wrapping standard RDKit functionality, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule-object.

MolPipeline aims to provide:

  • Automated end-to-end processing from molecule data sets to deployable machine learning models.
  • Scalable parallel processing and low memory usage through instance-based processing.
  • Standard pipeline building blocks for flexibly building custom pipelines for various cheminformatics tasks.
  • Consistent error handling for tracking, logging, and replacing failed instances (e.g., a SMILES string that could not be parsed correctly).
  • Integrated and self-contained pipeline serialization for easy deployment and tracking in version control.

Publications

Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024
Further links: arXiv

Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural fingerprint-based models, 2024
Further links: repository

Table of Contents

Installation

pip install molpipeline

Documentation

The notebooks folder contains many basic and advanced examples of how to use Molpipeline.

A nice introduction to the basic usage is in the 01_getting_started_with_molpipeline notebook.

Quick Start

Model building

Create a fingerprint-based prediction model:

from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToMorganFP
from molpipeline.mol2mol import (
    ElementFilter,
    SaltRemover,
)

from sklearn.ensemble import RandomForestRegressor

# set up pipeline
pipeline = Pipeline([
      ("auto2mol", AutoToMol()),                                     # reading molecules
      ("element_filter", ElementFilter()),                           # standardization
      ("salt_remover", SaltRemover()),                               # standardization
      ("morgan2_2048", MolToMorganFP(n_bits=2048, radius=2)),        # fingerprints and featurization
      ("RandomForestRegressor", RandomForestRegressor())             # machine learning model
    ],
    n_jobs=4)

# fit the pipeline
pipeline.fit(X=["CCCCCC", "c1ccccc1"], y=[0.2, 0.4])
# make predictions from SMILES strings
pipeline.predict(["CCC"])
# output: array([0.29])

Feature calculation

Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can be calculated like this:

from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToRDKitPhysChem

pipeline_physchem = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        (
            "physchem",
            MolToRDKitPhysChem(
                standardizer=None,
                descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"],
            ),
        ),
    ],
    n_jobs=-1,
)
physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"])
physchem_matrix
# output: array([[72.066,  0.   ,  0.   ],
#                [88.065, 20.23 ,  1.   ]])

MolPipeline provides further features and descriptors from RDKit, for example Morgan (binary/count) fingerprints and MACCS keys. See the 04_feature_calculation notebook for more examples.

Clustering

Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be clustered by their Murcko scaffold. See the 02_scaffold_split_with_custom_estimators notebook for scaffolds splits and further examples.

from molpipeline.estimators import MurckoScaffoldClustering

scaffold_smiles = [
    "Nc1ccccc1",
    "Cc1cc(Oc2nccc(CCC)c2)ccc1",
    "c1ccccc1",
]
linear_smiles = ["CC", "CCC", "CCCN"]

# run the scaffold clustering
scaffold_clustering = MurckoScaffoldClustering(
    make_generic=False, linear_molecules_strategy="own_cluster", n_jobs=16
)
scaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)
# output: array([1., 0., 1., 2., 2., 2.])

Explainability

Machine learning model pipelines can be explained using the explainability module. MolPipeline uses the SHAP library to compute Shapley values for explanations. The Shapley Values can be mapped to the molecular structure to visualize the importance of atoms for the prediction.

advanced_03_introduction_to_explainable_ai notebook Open In Colab gives a detailed introduction to explainability. The notebook also compares explanations of Tree-based models to Neural Networks using the structure-activity relationship (SAR) data from Harren et al. 2022.

Use the following example code to explain a model's predictions and visualize the explanation as heatmaps.

from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToMorganFP
from molpipeline.experimental.explainability import SHAPTreeExplainer
from molpipeline.experimental.explainability import (
    structure_heatmap_shap,
)
from sklearn.ensemble import RandomForestRegressor

X = ["CCCCCC", "c1ccccc1"]
y = [0.2, 0.4]

pipeline = Pipeline([
    ("auto2mol", AutoToMol()),
    ("morgan2_2048", MolToMorganFP(n_bits=2048, radius=2)),
    ("RandomForest", RandomForestRegressor())
],
    n_jobs=4)
pipeline.fit(X, y)

# explain the model
explainer = SHAPTreeExplainer(pipeline)
explanations = explainer.explain(X)

# visualize the explanation
image = structure_heatmap_shap(explanation=explanations[0])
image.save("explanation.png")

Note that the explainability module is fully-functional but in the 'experimental' directory because we might make changes to the API.

License

This software is licensed under the MIT license. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molpipeline-0.12.0.tar.gz (203.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

molpipeline-0.12.0-py3-none-any.whl (271.1 kB view details)

Uploaded Python 3

File details

Details for the file molpipeline-0.12.0.tar.gz.

File metadata

  • Download URL: molpipeline-0.12.0.tar.gz
  • Upload date:
  • Size: 203.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for molpipeline-0.12.0.tar.gz
Algorithm Hash digest
SHA256 f628b98a41cfa052aa4f199e9128f5055a09d776cf7cdb69a9d7da869c202cfb
MD5 4fb62c4d2a3ed315c06587186d0c7ee9
BLAKE2b-256 a668105994dad0125052345c83d45425cd13adc78c57bd06395ef3d0d97b1084

See more details on using hashes here.

File details

Details for the file molpipeline-0.12.0-py3-none-any.whl.

File metadata

  • Download URL: molpipeline-0.12.0-py3-none-any.whl
  • Upload date:
  • Size: 271.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for molpipeline-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c73a4da6302cfb6ffa00f2a4a40a91022c2dc789ad354b71cc850c99333f7a3
MD5 9cda7a7c42fa87bac615249ec8838d57
BLAKE2b-256 9917a2314ed4339931e99dbbd7bb70a479a28e537216b3162dddf65bcf146b43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page