This is the code for "SolvBERT for solvation free energy and solubility prediction: a demonstration of an NLP model for predicting the properties of molecular complexes" paper. The preprint of this paper can be found in ChemRxiv with https://doi.org/10.26434/chemrxiv-2022-0hl5p
Project description
SolvBERT
Description
This is the code for "SolvBERT for solvation free energy and solubility prediction: a demonstration of an NLP model for predicting the properties of molecular complexes" paper. The preprint of this paper can be found in ChemRxiv with https://doi.org/10.26434/chemrxiv-2022-0hl5p
Installation
- python 3.7.12
- transformers 2.11.0
- wandb 0.12.15
- tokenizers 0.7.0
- rxnfp 0.0.7
use pip install solv-bert==0.0.7
install this package
Dataset
CombiSolv-QM. The CombiSolv-QM dataset originally came from a study by Vermeire et al. who computed the dataset using a commercial software called COSMOtherm. The dataset consists of 1 million datapoints randomly selected from all possible combinations of 284 commonly used solvents and 11,029 solutes.
CombiSolv-Exp-8780. We managed to download a portion of the CombiSolv-Exp dataset, which originally contains experimental solvent free energy data for 10,145 different solute and solvents combinations from Vermeire et al. The dataset was curated from multiple sources, including the Minnesota Solvation Database, the FreeSolv database, the CompSol database, and a dataset published by Abraham et al. We named the downloaded subset containing 8,780 combinations as CombiSolv-Exp-8780 to distinguish it from the original CombiSolv-Exp dataset.
Solubility. The Solubility dataset was originally from Boobier et al. It was curated from the Open Notebook Science Challenges water solubility dataset and the Reaxys database. This dataset includes ethanol with 695 solutes, benzene with 464 solutes, acetone with 452 solutes, and water with 900 solutes, for a total of 2,511 different combinations, with solubility expressed as log S.
Train
Model use the SMILESlanguagemodel.py to train and the Finetune.py to fine-tune. Each pre-training (SMILESlanguagemodel.py) needs to save the best model for the next step of fine-tune(Finetune.py).
Test
Model use the Predictandeval.py to shart testing. In this step, we also need to save the best model trained in the previous step (Finetune.py) for prediction and evaluation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.