Automated Machine Learning pipeline for bioactivity prediction using molecular fingerprints
Project description
DrugAutoML
DrugAutoML is an end-to-end Automated Machine Learning (AutoML) pipeline for bioactivity prediction in drug discovery.
It automates every stage — from reading raw datasets to generating predictions for new molecules — and produces both high-performance models and explainable outputs.
🚀 Features
-
Data Preprocessing
Reads raw datasets, cleans and standardizes SMILES, removes invalid molecules, and labels compounds as active or inactive based on pChEMBL cutoffs or existing binary labels. -
Molecular Featurization
Generates ECFP (Extended-Connectivity Fingerprints) using RDKit with customizable radius, bit size, and count-based features. -
Data Splitting
Splits data into training and testing sets using:- Scaffold Split (structure-aware)
- Stratified Random Split (class-proportion preserving)
-
Model Selection
Hyperparameter optimization with Optuna for:- Random Forest, Extra Trees, Logistic Regression, Linear SVC, XGBoost, LightGBM
Uses repeated stratified k-fold CV and produces a ranked leaderboard.
- Random Forest, Extra Trees, Logistic Regression, Linear SVC, XGBoost, LightGBM
-
Model Finalization
Trains the best model, applies probability calibration, selects optimal classification threshold, evaluates on the test set, and saves the model. -
Explainability
- SHAP global importance plots (beeswarm, bar, signed bar)
- Bit Gallery visualizations: highlights ECFP bits in test molecules that strongly influence predictions.
-
Prediction on New Data
Scores unlabeled or labeled molecules, outputs probabilities and predictions, and computes metrics if labels are available.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file drugautoml-0.1.1-py3-none-any.whl.
File metadata
- Download URL: drugautoml-0.1.1-py3-none-any.whl
- Upload date:
- Size: 34.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ec82ad75e2e2e22707ad93f880898c9e2e11510fd3c8c55a4b725ee906c86ca
|
|
| MD5 |
2cdc1712a84e0b58ba739e61ad4e85e0
|
|
| BLAKE2b-256 |
046efaf44e8e1eb51907ac6fb5e18d21daf5534d987bde1eed2e34118c69a560
|