Skip to main content

Chemperium package

Project description

Coverage Status

Chemperium

Portmanteau of the Latin words Chemia and Imperium: "a chemical empire".

Chemperium is a deep learning toolkit that aims to conquer the chemical space of compounds and properties. The main focus of this tool is on the applicability and accuracy of trained models. While many publications, tools, and datasets are out on molecular property prediction, we target both experts and non-experts in cheminformatics to make fast and accurate predictions.
In this package, we provide a validated software tool and trained machine learning models to make reliable molecular property predictions with a minimum of code and time.

Table of Contents

1. Installation

Chemperium is built upon NumPy, Pandas, RDKit, TensorFlow, Keras, and Scikit-Learn. The package can be installed using pip:

Install a virtual environment in Anaconda:

conda create -n chemperium python=3.11
conda activate chemperium
git clone https://github.com/mrodobbe/chemperium.git
cd chemperium
pip install .

2. Usage

Chemperium can be loaded as a python package by importing chemperium. There are various options to predict properties of molecules or to train new models.

Predicting properties with chemperium

A distinction is made between liquid-phase properties and thermochemistry.

Liquid-phase properties

Liquid-phase properties are predicted with the module chemperium.training.predict.Liquid. In this module, it is necessary to specify the target property, the dimension of molecular information (2D or 3D), and the location of the trained models.


2D example for boiling point

import chemperium as cp

bp_model = cp.Liquid("bp", "2d", <folder>)
prediction = bp_model.predict("COc1ccccc1")

(currently supported properties: bp, tc, pc, vp, logp, logs)

Thermochemistry

The prediction of thermochemical properties is done in a similar way with the module chemperium.training.predict.Thermo. A distinction is made in the functions to predict enthalpy of formation, entropy of formation, and gibbs free energy of formation. It is possible to predict at temperatures between 298 K and 1500 K. When 3D predictions are chosen, Δ-machine learning will be used and a lower level-of-theory estimate should be provided. At this moment, all predictions are in kcal/mol for enthalpy and cal/mol/K for entropy.


3D example for CBS-QB3

import chemperium as cp

smi = "COc1ccccc1"
xyz = '16\n' \
      '\n' \
      'C          2.76930        0.32250       -0.00050\n' \
      'O          1.76340       -0.67620       -0.00000\n' \
      'C          0.45600       -0.27750       -0.00000\n' \
      'C         -0.49220       -1.31180       -0.00020\n' \
      'C         -1.84930       -1.01160       -0.00010\n' \
      'C         -2.28360        0.31900        0.00010\n' \
      'C         -1.33830        1.34160        0.00020\n' \
      'C          0.03130        1.05620        0.00020\n' \
      'H          3.72200       -0.21080       -0.00090\n' \
      'H          2.71000        0.95720       -0.89500\n' \
      'H          2.71080        0.95730        0.89390\n' \
      'H         -0.13750       -2.33800       -0.00030\n' \
      'H         -2.57430       -1.82150       -0.00030\n' \
      'H         -3.34470        0.55070        0.00010\n' \
      'H         -1.65940        2.38020        0.00040\n' \
      'H          0.74700        1.87060        0.00030'
llot = -5.13245

thermo = cp.Thermo("cbs-qb3", "3d", <folder>)

# Predict the standard enthalpy of formation at 298 K
h298_prediction = thermo.predict_enthalpy(smi, xyz, llot, quality_check=True)

# Predict the Gibbs free energy at 1000 K
g1000_prediction = thermo.predict_gibbs(smi, xyz, llot, t=1000)

# Predict the thermochemistry in Chemkin format
chemkin_inp = thermo.get_nasa_polynomials("anisole", smi, xyz, llot, chemkin=True)

Training machine learning models

The Thermo and Liquid modules are trained in advance. It is also possible to train models by yourself. For this purpose, the function chemperium.training.train.train is needed. It requires three arguments: the location of a CSV file with training data, a list with target properties, and (optionally) a dictionary with training arguments.

Training a 3D MPNN for prediction of logP and logS:

import chemperium as cp

csv_location = "examples/example_data.csv"
props = ["logp", "logs"]
save_dir = "examples/output"
input_args = {"rdf": True, 
              "cutoff": 2.1, 
              "num_layers": 3, 
              "hidden_size": 128, 
              "depth": 4}
cp.train(csv_location, props, save_dir, input_args)

Testing trained machine learning models

Property prediction models that have been trained with the function chemperium.training.train.train can be used for predicting properties using the module chemperium.training.test.test. The usage is highly resembling to the train function and requires following information:

  • smiles: a list with SMILES identifiers
  • prop: the target property/ies
  • save_dir: the folder where the models are stored
  • xyz: (optional) List with 3D coordinates of the target compounds
  • return_results: (optional) A bool that states whether results should be returned as DataFrame. Defaults to False
  • input_args: (optional) Dictionary with training arguments of the trained models

Testing a 3D MPNN for prediction of logP and logS:

import chemperium as cp

smi = ["COc1ccccc1"]
xyz = ['16\n' \
       '\n' \
       'C          2.76930        0.32250       -0.00050\n' \
       'O          1.76340       -0.67620       -0.00000\n' \
       'C          0.45600       -0.27750       -0.00000\n' \
       'C         -0.49220       -1.31180       -0.00020\n' \
       'C         -1.84930       -1.01160       -0.00010\n' \
       'C         -2.28360        0.31900        0.00010\n' \
       'C         -1.33830        1.34160        0.00020\n' \
       'C          0.03130        1.05620        0.00020\n' \
       'H          3.72200       -0.21080       -0.00090\n' \
       'H          2.71000        0.95720       -0.89500\n' \
       'H          2.71080        0.95730        0.89390\n' \
       'H         -0.13750       -2.33800       -0.00030\n' \
       'H         -2.57430       -1.82150       -0.00030\n' \
       'H         -3.34470        0.55070        0.00010\n' \
       'H         -1.65940        2.38020        0.00040\n' \
       'H          0.74700        1.87060        0.00030']
props = ["logp", "logs"]
save_dir = "examples/output"
input_args = {"rdf": True, 
              "cutoff": 2.1, 
              "num_layers": 3, 
              "hidden_size": 128, 
              "depth": 4}
results = cp.test(smi, props, save_dir, xyz, True, input_args)

Creating a learned representation

3. Scripts

It is also possible to train and test models via command line. Below, we show the example from Training and Testing.

Training a model via command line

The script can be found in scripts/train.py.

python train.py --data "examples/example_data.csv" --save_dir "examples/output" --property "logp,logs" 
--rdf --cutoff 2.1 --num_layers 3 --hidden_size 128 --depth 4

Testing a model via command line

python test.py --test_data "examples/example_test_data.csv" --save_dir "examples/output" --property "logp,logs" 
--rdf --cutoff 2.1 --num_layers 3 --hidden_size 128 --depth 4

4. Tutorial

A small demo notebook is available in notebooks/demo.ipynb.

5. Datasets

All datasets are available in the Zenodo repository.

6. Reference

When using chemperium for your own work, please refer to the original publication:
M. R. Dobbelaere, I. Lengyel, C. V. Stevens, and K. M. Van Geem, Geometric Deep Learning for Molecular Property Predictions with Chemical Accuracy Across Chemical Space, Submitted, 2024.

@ARTICLE{Dobbelaere2024,
  title     = "Geometric Deep Learning for Molecular Property Predictions with Chemical Accuracy Across Chemical Space",
  author    = "Dobbelaere, Maarten R and Lengyel, Istvan and Stevens,
               Christian V and Van Geem, Kevin M",
  journal   = "Submitted",
  year      =  2024,
  language  = "en"
}

Acknowledgments

This software tool has been developed with support from the Research Fund of Flanders (FWO-Vlaanderen, grant 1S45522N), the European Research Council (ERC grant 818607), and the European Union's Horizon Programme (grant 101057816, "TransPharm").

fwo

transpharm         EU

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chemperium-0.0.1.tar.gz (73.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chemperium-0.0.1-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file chemperium-0.0.1.tar.gz.

File metadata

  • Download URL: chemperium-0.0.1.tar.gz
  • Upload date:
  • Size: 73.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for chemperium-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c70afbd0f23570ae0a4085b72db2a5c99a8bff61faada5cc0788989507e7dcaf
MD5 f530aff20c4459271673d032ce7747e8
BLAKE2b-256 0d2085e759c14f29f9271301fb2a0947fa7812a38eb347423f2a299bfdb71b88

See more details on using hashes here.

File details

Details for the file chemperium-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: chemperium-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for chemperium-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d5022c40f95bea4ebb7facae2d0bed4f3d421d42a0f080dc3f63517b40bff29d
MD5 1123e897c941b4d3858ade2d8a9d68e1
BLAKE2b-256 af0f13839e94d9c50819fb214233cbd99b19f7eee0bf6a584a82a790fa3c429b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page