A Human-in-the-loop active learning workflow to improve molecular property predictors with human expert feedback for goal-oriented molecule generation.
Project description
Human-in-the-loop Active Learning for Goal-Oriented Molecule Generation
We present an interactive workflow to fine-tune predictive machine learning models for target molecular properties based on expert feedback and foster human-machine collaboration for goal-oriented molecular design and optimization.
In this study, we simulated the process of producing novel drug candidates using machine learning then validating them in the lab.
The goal is to generate a high number of successful top-scoring molecules (i.e., promising with respect to a target molecular property) according to both the machine learning predictive model, which scores the molecules at each iteration of the drug design process, and the lab simulator, which evaluates the best molecules at the end of the process.
Since simulators are expensive to query at each iteration of the drug design process, as well as for fine-tuning the predictive model iteratively (i.e., active learning), we mitigate this by allowing "weaker" oracles (i.e., human experts) to be queried for fine-tuning the predictive model (i.e., human-in-the-loop active learning).
Human experts supervised the predictive machine learning model that scores the molecules by accepting or refuting some of its predictions of top-scoring molecules. This improved the process' outcome by progressively aligning the machine-predicted probabilities of success with those of an experimental simulator and enhancing other metrics such as drug-likeness and synthetic accessibility.
This workflow uses REINVENT 3.2 for molecule generation.
System Requirements
- Python 3.7
- This code and
REINVENT 3.2have been tested on Linux
Installation
-
Install Conda
-
Clone this Git repository
-
In a shell terminal, go to the repository and clone the Reinvent repository from this URL
-
In the cloned Reinvent repository, run
$ cp configs/example.config.json configs/config.json -
Create the Conda environment for
REINVENTusing$ conda env create -f reinvent.yml -
Active the environment then install
reinvent-scoring/as apippackage using$ conda activate reinvent.v3.2 $ pip install -e ./reinvent-scoring/ $ conda deactivate -
Copy or move the
reinvent.v3.2environment directory in this repository, or modify its path inHITL_AL_GOMG/path.py -
Create the Conda environment for the HITL-AL workflow using
$ conda env create -f environment.yml -
Activate the environment then install this repository as a
pippackage using$ conda activate hitl-al-gomg $ pip install -e .
Usage
Below are command examples for training a target property predictor (e.g., for DRD2 bioactivity) and running the workflow using a simulated expert.
For training the predictor:
- Go to
HITL_AL_GOMG/models/and create a directory to store the trained predictorspredictors/and a directory to store the simulatorssimulators/
In simulators/, you need to have a copy of the DRD2 bioactivity simulator (drd2.pkl) which you can download from this URL. Then, you can run
$ python train.py --task drd2 --path_to_param_grid ../../example_files/rfc_param_grid.json --train True --demo True
The directory example_files/ contains examples of hyperparameter grids for scikit-learn Random Forest models.
For running the HITL-AL workflow using a simulated expert:
- Create an output directory to store
REINVENTgeneration results and change the variabledemosinHITL_AL_GOMG/path.pywith the corresponding path to your output directory - In
HITL_AL_GOMG/, run a simulation
-
without HITL active learning:
$ python run.py --seed 3 --rounds 4 --num_opt_steps 100 --scoring_model drd2 --model_type classification --scoring_component_name bioactivity --threshold_value 0.5 --dirname demo_drd2 --init_train_set drd2_train --acquisition None --task drd2 --expert_model drd2 -
then with HITL active learning (e.g., using entropy-based sampling):
$ python run.py --seed 3 --rounds 4 --num_opt_steps 100 --scoring_model drd2 --model_type classification --scoring_component_name bioactivity --threshold_value 0.5 --dirname demo_drd2 --init_train_set drd2_train --acquisition entropy --al_iterations 5 --n_queries 10 --noise 0.1 --task drd2 --expert_model drd2
For calculating oracle scores and metrics from MOSES:
-
In the
REINVENToutput directory, create adata_for_figures/directory to store all metric values -
In
HITL_AL_GOMG/, run$ python evaluate_results.py --job_name demo_drd2 --seed 3 --rounds 4 --n_opt_steps 100 --task drd2 --model_type classification --score_component_name bioactivity --scoring_model drd2 --init_data drd2 --acquisition None $ python evaluate_results.py --job_name demo_drd2 --seed 3 --rounds 4 --n_opt_steps 100 --task drd2 --model_type classification --score_component_name bioactivity --scoring_model drd2 --init_data drd2 --acquisition entropy --al_iterations 5 --n_queries 10 --sigma_noise 0.1
Data
- We provide data for training the penalized LogP and DRD2 bioactivity predictors, as well as a sample from ChEMBL on which
REINVENTprior agent was pre-trained. - A copy of the
REINVENTpre-trained prior agent is available atHITL_AL_GOMG/models/priors/random.prior.new. - The experimental simulator for DRD2 bioactivity and the hERG model described in the multi-objective generation use case are available at this URL.
Notebooks
In notebooks/, we provide Jupyter notebooks with code to reproduce the paper's figures for both simulation and human experiments.
Acknowledgements
-
We acknowledge the following works which were extremely helpful to develop this workflow:
- Sundin, I., Voronov, A., Xiao, H. et al. Human-in-the-loop assisted de novo molecular design. J Cheminform 14, 86 (2022). https://doi.org/10.1186/s13321-022-00667-8
- Bickford Smith, F., Kirsch, A., Farquhar, S., Gal, Y., Foster, A., Rainforth, T. Prediction-oriented Bayesian active learning. International Conference on Artificial Intelligence and Statistics (2023). https://arxiv.org/abs/2304.08151
-
We thank Vincenzo Palmacci for his contribution in refactoring parts of this code.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file hitl_al_gomg-0.0.10.tar.gz.
File metadata
- Download URL: hitl_al_gomg-0.0.10.tar.gz
- Upload date:
- Size: 35.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5533f8b91e9defc26c8fbb500e83f56a6009ab81f1b6f22bfb262c8ab0fa00b0
|
|
| MD5 |
cd007b62bf2fe2494675f94a3808854f
|
|
| BLAKE2b-256 |
7f9532bdbec470e5084c43d7d976cbe888bb5998557a80f350afdae65829acbd
|