bambu (bioassays model builder) is CLI tool to build QSAR models based on PubChem BioAssays datasets
Project description
Bambu
Bambu (BioAssays Model Builder), is a simple tool to generate QSAR models based on PubChem BioAssays datasets. It relies on RDKit and the FLAML AutoML framework and provides utilitaries for downloading and preprocessing datasets, as well training and running the predictive models.
Try it!
Try Bambu on this Google Colab Notebook ^^
Installing
Installing as a PyPI package using pip
:
$ pip install bambu-qsar
Note: RDKit must be installed separately.
Intalling as an environment using conda
on Linux:
$ git clone https://github.com/omixlab/bambu-v2
$ cd bambu-qsar
$ make setup PLATFORM=linux
$ conda activate bambu-qsar
Running it with Docker
$ docker run -ti omixlab/bambu-qsar:latest
Downloading a PubChem BioAssays data
Downloads a PubChem BioAssays data and save in a CSV file, containing the InchI representation and the label indicating molecules that were found to be active or inactive against a given target.
$ bambu-download \
--pubchem-assay-id 29 \
--pubchem-InchI-chunksize 100 \
--output 29_raw.csv
The generated output contains the columns pubchem_molecule_id
(Substance ID or Compound ID, depending on the option selected during download), InChI
and activity
. Only the fields InchI
and activity
are used in futher steps.
pubchem_molecule_id | pubchem_molecule_type | InChI | activity |
---|---|---|---|
596 | compounds | InChI=1S/C9H13N3O5/c10-5-1-2-12(9(16)11-5)8-7(15)6(14)4(3-13)17-8/h1-2,4,6-8,13-15H,3H2,(H2,10,11,16) | active |
1821 | compounds | InChI=1S/C9H11FN2O6/c10-3-1-12(9(17)11-7(3)16)8-6(15)5(14)4(2-13)18-8/h1,4-6,8,13-15H,2H2,(H,11,16,17) | active |
2019 | compounds | InChI=1S/C62H86N12O16/c1-27(2)42-59(84)73-23-17-19-36(73)57(82)69(13)25-38(75)71(15)48(29(5)6)61(86)88-33(11)44(55(80)65-42)67-53(78)35-22-21-31(9)51-46(35)64-47-40(41(63)50(77)32(10)52(47)90-51)54(79)68-45-34(12)89-62(87)49(30(7)8)72(16)39(76)26-70(14)58(83)37-20-18-24-74(37)60(85)43(28(3)4)66-56(45)81/h21-22,27-30,33-34,36-37,42-45,48-49H,17-20,23-26,63H2,1-16H3,(H,65,80)(H,66,81)(H,67,78)(H,68,79) | active |
2082 | compounds | InChI=1S/C12H15N3O2S/c1-3-6-18-8-4-5-9-10(7-8)14-11(13-9)15-12(16)17-2/h4-5,7H,3,6H2,1-2H3,(H2,13,14,15,16) | active |
2569 | compounds | InChI=1S/C15H19N3O5/c1-8-11(17-3-4-17)14(20)10(9(22-2)7-23-15(16)21)12(13(8)19)18-5-6-18/h9H,3-7H2,1-2H3,(H2,16,21) | active |
2674 | compounds | InChI=1S/C29H26O10/c1-10(30)5-12-18-19-13(6-11(2)31)29(37-4)27(35)21-15(33)8-17-23(25(19)21)22-16(38-9-39-17)7-14(32)20(24(18)22)26(34)28(12)36-3/h7-8,10-11,30-31,34-35H,5-6,9H2,1-4H3 | active |
2693 | compounds | InChI=1S/C31H30N6O6S4/c1-33-25(42)30(15-38)34(2)23(40)28(33,44-46-30)12-17-13-36(21-11-7-4-8-18(17)21)27-14-29-24(41)35(3)31(16-39,47-45-29)26(43)37(29)22(27)32-20-10-6-5-9-19(20)27/h4-11,13,22,32,38-39H,12,14-16H2,1-3H3 | active |
Computing descriptors or fingerprints
Computes molecule descriptors or Morgan fingerprints for a given datasets produced by bambu-download
(or following the same format). The output also contains a train
and test
subsets, whose sizes are defined based on the --train-test-split-percent
argument. The argument --resample
might be used to perform a random undersampling in the dataset, as most HTS datasets are heavily umbalanced. The path passed to --output
is used as template to generate the train
and test
file. In this case, 29_preprocess_train.csv
and 29_preprocess_test.csv
respectively.
$ bambu-preprocess \
--input 29_raw.csv \
--output 29_preprocess.csv \
--output-preprocessor 29_preprocessor.pickle \
--feature-type morgan-2048 \
--train-test-split 0.75 \
--undersample
Train
Trains a classification model using the FLAML AutoML framework based on the bambu-preprocess
output datasets. The user may adjust most of the flaml.automl.AutoML
parameters using the command line arguments. In this case we are using an Extra Trees Classifier.
$ bambu-train \
--input-train 29_preprocess_train.csv \
--output 29_model.pickle \
--time-budget 3600 \
--estimators extra_tree
A list of all available estimators can be accessed using the command bambu-train --list-estimators
. Currently, only rf
(Random Forest) and extra_tree
are available.
Validation
An y-randomization validation can be performed using the command bambu-validate
, which will compute accuracy, recall, precision, f1-score and ROC AUC training with the original training dataset and by validating
with the test one, and furtherly randomizing the training labels and several times (--randomizations
). For each randomization, classification metrics
are computed again and significancy value (p-value) is computed based
on the z-score-normalized metrics.
$ bambu-validate \
--input-train 29_preprocess_train.csv \
--input-test 29_preprocess_test.csv \
--model 29_model.pickle \
--output 29_model.validation.json \
--randomizations 100
Predict
Receives an inputs, preprocess it using a preprocess object (generated using bambu-preprocess
) and then runs a classification model (generated using bambu-train
). Results are saved in a CSV file.
$ bambu-predict \
--input pubchem_compounds.sdf \
--preprocessor 29_preprocessor.pickle \
--model 29_model.pickle \
--output 29_predictions.csv
Contact
Feel free to open issues or pull requests! You may also contact us by email.
Dr. Frederico Schmitt Kremer, PhD. E-mail: fred.s.kremer@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for bambu_qsar-0.0.17-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 326804b65e00d74a6e77b008b9cde229261a3fdab8a3bd69f81e2fbf522ee378 |
|
MD5 | b327be49794b24f36b2e34a34eac184d |
|
BLAKE2b-256 | 5b0feb35d406e0415b85f4dff36e3b4896a9eee8d9d62bf1007bc9866c9463b6 |