Skip to main content

Automated Machine Learning pipeline for bioactivity prediction using molecular fingerprints

Project description

DrugAutoML

DrugAutoML is an end-to-end Automated Machine Learning (AutoML) pipeline for bioactivity prediction in drug discovery.
It automates every stage — from reading raw datasets to generating predictions for new molecules — and produces both high-performance models and explainable outputs.


🚀 Features

  • Data Preprocessing
    Reads raw datasets, cleans and standardizes SMILES, removes invalid molecules, and labels compounds as active or inactive based on pChEMBL cutoffs or existing binary labels.

  • Molecular Featurization
    Generates ECFP (Extended-Connectivity Fingerprints) using RDKit with customizable radius, bit size, and count-based features.

  • Data Splitting
    Splits data into training and testing sets using:

    • Scaffold Split (structure-aware)
    • Stratified Random Split (class-proportion preserving)
  • Model Selection
    Hyperparameter optimization with Optuna for:

    • Random Forest, Extra Trees, Logistic Regression, Linear SVC, XGBoost, LightGBM
      Uses repeated stratified k-fold CV and produces a ranked leaderboard.
  • Model Finalization
    Trains the best model, applies probability calibration, selects optimal classification threshold, evaluates on the test set, and saves the model.

  • Explainability

    • SHAP global importance plots (beeswarm, bar, signed bar)
    • Bit Gallery visualizations: highlights ECFP bits in test molecules that strongly influence predictions.
  • Prediction on New Data
    Scores unlabeled or labeled molecules, outputs probabilities and predictions, and computes metrics if labels are available.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

drugautoml-0.1.2.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

drugautoml-0.1.2-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file drugautoml-0.1.2.tar.gz.

File metadata

  • Download URL: drugautoml-0.1.2.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.1

File hashes

Hashes for drugautoml-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7d296380995a98c17771bd7edda27565762026842aecbc74158f6b67b54377ae
MD5 e5fd5f5057398db04776c170cf88ba57
BLAKE2b-256 fea71e0f76ec73c526a8f4fa57d6f8fe3f3acf1cbc68750d9bad02aee41f425f

See more details on using hashes here.

File details

Details for the file drugautoml-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: drugautoml-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 34.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.1

File hashes

Hashes for drugautoml-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0fdfc298526d81d36007751abb68133093dad63eb50ee98c1b3cdf60b3d1a23e
MD5 c1088b94dd7a0253b95230b4a225fc88
BLAKE2b-256 1b1b9b8a0fd06ce80188a25bed9fe85b4abcf7e3d0d03eac30fda3ef140c028e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page