Package for prediction of chemical species properties from SMILES.
Project description
tgBoost
tgBoost is a pipeline englobing QSPR model optimized for the prediction of the glass transition temperature (Tg) of monomer organic compounds. The pipeline is based on mol2vec, a machine learning (ML) algorithm converting molecular SMILES into molecular embeddings. The pipeline can be exapanded to include further QSAR/QSPR models developed from SMILES notation.
Motivation
tgBoost is a kickstart project aiming at expanding the use of ML, Data Engineering and QSAR/QSPR models in atmospheric and physical chemistry. The pipeline comes with a pretrained and ML powered QSPR model predicting Tg of monomer organic compounds. The model is based on a Extreme Gradient Boosting framework (XGBoost) and it is developed from the largest dataset of experimental Tg of monomer organic molecules (Koop et al., 2011).
Requirements
- Python >=3.6.0 (Python 2.x is not supported)
- NumPy
- pandas
- scikit-learn
- gensim
- RDKit
- mol2vec
- xgboost
Installation
pip install https://github.com/U0M0Z/tgpipe
tgBoost library needs the independent installation of mol2vec from github via pip:
pip install git+https://github.com/samoturk/mol2vec
Build status
Build status of continus integration i.e. travis, appveyor etc. Ex. -
Documentation
Details on the statistical analysis performed to develop the model and pipeline are found in the supporting article.
Usage
Basic use
This code uses the tgPipeline to train tgBoost a QSPR model for Tg prediction. The QSPR model is based on rdkit, mol2vec and xgboost. In order to use the model on your machine, you need to retrain the model to be conform to the C++ signature of your processor.
The tgBoost model is built, trained, and saved in ./trained_models
with the command:
python tgPipeline/tgboost/train_pipeline.py
Check for the following message to confirm successful model training:
*** EXTRACTION step
n_input SMILES: 415
*** TRANSFORMING step
n_output SMILES: 298
~~ DATA info
Xtrain: 298 ytrain: 298 Xtest: 0 ytest: 0
*** REGRESSION step
PIPELINE completed:
_ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ ^ ~ _ ~ ^ ~ _
__ ___ __
/ /____ _/ _ )___ ___ ___ / /_
/ __/ _ `/ _ / _ \/ _ \(_-</ __/
\__/\_, /____/\___/\___/___/\__/
/___/
_ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ _ ~ ^ ~ _ ~ ^ ~ _ ~ ^ ~ _
As python module
from tgboost import tgboost.processing.smiles_manager as sm
from tgboost import predict
The first line imports functions to open and preprocess files containing SMILES used for predictions, and the second line imports functions for predicting Tg of SMILES.
Check notebooks repository for examples and details.
How to cite?
✨ 🍰 ✨
@Article{D1EA00090J,
author ="Galeazzo, Tommaso and Shiraiwa, Manabu",
title ="Predicting glass transition temperature and melting point of organic compounds via machine learning and molecular embeddings",
journal ="Environ. Sci.: Atmos.",
year ="2022",
volume ="2",
issue ="3",
pages ="362-374",
publisher ="RSC",
doi ="10.1039/D1EA00090J",
url ="http://dx.doi.org/10.1039/D1EA00090J"
}
Contribute
Contact at tommaso.galeazzo@gmail.com
Credits
Initial development was supported by AirUCI, Irvine, CA.
License
BSD 3-clause © Tommaso Galeazzo
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.