Skip to main content

Integration of rdkit functionality into sklearn pipelines.

Project description

MolPipeline

MolPipeline is a Python package providing RDKit functionality in a Scikit-learn like fashion.

Background

The open-source package scikit-learn provides a large variety of machine learning algorithms and data processing tools, among which is the Pipeline class, allowing users to prepend custom data processing steps to the machine learning model. MolPipeline extends this concept to the field of chemoinformatics by wrapping default functionalities of RDKit, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule-object.

A notable difference to the Pipeline class of scikit-learn is that the Pipline from MolPipeline allows for instances to fail during processing without interrupting the whole pipeline. Such behaviour is useful when processing large datasets, where some SMILES strings might not encode valid molecules or some descriptors might not be calculable for certain molecules.

Publications

The publication is freely available here.

Installation

Not yet available in pypi. For now, Please download and install via:

pip install git+https://github.com/basf/MolPipeline.git

Usage

See the notebooks folder for basic and advanced examples of how to use Molpipeline.

A basic example of how to use MolPipeline to create a fingerprint-based model is shown below (see also the notebook):

from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToMorganFP
from molpipeline.mol2mol import (
    ElementFilter,
    SaltRemover,
)

from sklearn.ensemble import RandomForestRegressor

# set up pipeline
pipeline = Pipeline([
      ("auto2mol", AutoToMol()),                                     # reading molecules
      ("element_filter", ElementFilter()),                           # standardization
      ("salt_remover", SaltRemover()),                               # standardization
      ("morgan2_2048", MolToMorganFP(n_bits=2048, radius=2)),        # fingerprints and featurization
      ("RandomForestRegressor", RandomForestRegressor())             # machine learning model
    ],
    n_jobs=4)

# fit the pipeline
pipeline.fit(X=["CCCCCC", "c1ccccc1"], y=[0.2, 0.4])
# make predictions from SMILES strings
pipeline.predict(["CCC"])
# output: array([0.29])

Molpipeline also provides custom estimators for standard cheminformatics tasks that can be integrated into pipelines, like clustering for scaffold splits (see also the notebook):

from molpipeline.estimators import MurckoScaffoldClustering

scaffold_smiles = [
    "Nc1ccccc1",
    "Cc1cc(Oc2nccc(CCC)c2)ccc1",
    "c1ccccc1",
]
linear_smiles = ["CC", "CCC", "CCCN"]

# run the scaffold clustering
scaffold_clustering = MurckoScaffoldClustering(
    make_generic=False, linear_molecules_strategy="own_cluster", n_jobs=16
)
scaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)
# output: array([1., 0., 1., 2., 2., 2.])

License

This software is licensed under the MIT license. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molpipeline-0.8.0.tar.gz (84.7 kB view details)

Uploaded Source

Built Distribution

molpipeline-0.8.0-py3-none-any.whl (123.8 kB view details)

Uploaded Python 3

File details

Details for the file molpipeline-0.8.0.tar.gz.

File metadata

  • Download URL: molpipeline-0.8.0.tar.gz
  • Upload date:
  • Size: 84.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for molpipeline-0.8.0.tar.gz
Algorithm Hash digest
SHA256 492b7e919c158ee1c162ff2718a9a4b78aac4f3f9bfd3a0b5f8fd0589b09521b
MD5 e8183ea5fb78c13f1c741608be0af911
BLAKE2b-256 5665be48ab7e465900728e03ad77a407f434e3f5caf5861a5affd7f5b2adca26

See more details on using hashes here.

File details

Details for the file molpipeline-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: molpipeline-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 123.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for molpipeline-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d85ed85a53c54ee478a4b5842ffaf3f01a112380686783eb573bf78429bc889a
MD5 7fb90fec0285e6ec349651b8c460e04d
BLAKE2b-256 494a93d8d3fa4f47ef38ad412af770169ad1149d399994232c4d9ed5c4bd6a3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page