Skip to main content

A library to quickly build QSAR models

Project description

Ersilia's LazyQSAR

A Python library for building supervised binary QSAR (Quantitative Structure-Activity Relationship) models quickly, with minimal configuration. LazyQSAR automates descriptor computation, feature selection, and hyperparameter tuning to produce robust ensemble models from chemical structures.

Two usage modes:

  • SMILES-based: pass molecule SMILES strings directly; descriptors are computed automatically
  • Descriptor-agnostic: bring your own pre-computed descriptor arrays or HDF5 files

Table of Contents

Installation

Install LazyQSAR from source:

git clone https://github.com/ersilia-os/lazy-qsar.git
cd lazy-qsar
python -m pip install -e .

To use the built-in LazyQSAR descriptors, install the optional dependencies:

python -m pip install -e .[descriptors]

This enables descriptor (featurizer) calculation. The first time you run LazyQSAR with deep-learning descriptors, it will download the Chemeleon and CDDD model checkpoints. To complete this setup in advance, run:

lazyqsar-setup

Use as a Python API

Binary Classification

LazyQSAR's binary classifier can run either with built-in descriptors (takes SMILES as input) or with custom pre-computed descriptors.

Built-in descriptors

Instantiate LazyBinaryQSAR with a mode of choice:

Mode Descriptors used Speed
fast RDKit, Morgan fingerprints Fastest, no deep-learning descriptors
default Chemeleon, RDKit, CDDD Balanced
slow Chemeleon, Morgan, RDKit, CDDD Most thorough
from lazyqsar.qsar import LazyBinaryQSAR

model = LazyBinaryQSAR(mode="default")
model.fit(smiles_list=smiles_train, y=y_train)
y_hat = model.predict_proba(smiles_list=smiles_test)[:, 1]

Custom descriptors

Pre-calculate your own descriptors and pass them directly. We recommend the Ersilia Model Hub for this — its .h5 output format is supported natively. Alternatively, pass descriptors as a NumPy array.

from lazyqsar.agnostic import LazyBinaryClassifier

# From a NumPy array
model = LazyBinaryClassifier(mode="default")
model.fit(X=X_train, y=y_train)
y_hat = model.predict_proba(X=X_test)[:, 1]

# From an Ersilia .h5 file
model.fit(h5_file="descriptors.h5", y=y_train)
y_hat = model.predict_proba(h5_file="descriptors.h5")[:, 1]

Saving and loading models

Models are saved as ONNX files by default, so inference only requires the ONNX runtime (no scikit-learn dependency at prediction time).

# Save after training
model.save(model_dir)

# Load for inference (auto-detects ONNX or raw format)
from lazyqsar.agnostic import LazyBinaryClassifier

model = LazyBinaryClassifier.load(model_dir)
y_hat = model.predict_proba(X=X)[:, 1]

You can also save and load as a .zip archive:

model.save("my_model.zip")
model = LazyBinaryClassifier.load("my_model.zip")

The same save/load interface applies to LazyBinaryQSAR:

from lazyqsar.qsar import LazyBinaryQSAR

model = LazyBinaryQSAR(mode="default")
model.fit(smiles_list=smiles_train, y=y_train)
model.save(model_dir)

model = LazyBinaryQSAR.load(model_dir)
y_hat = model.predict_proba(smiles_list=smiles_test)[:, 1]

Tests and benchmarks

Quick testing

The tests/ folder contains scripts for quickly verifying that the code works. The Bioavailability dataset is used as an example.

python tests/test_binary_classification.py
python tests/test_binary_classification.py --agnostic

Additional flags:

Flag Description
--mode fast|default|slow Select descriptor mode
--agnostic Use descriptor-agnostic LazyBinaryClassifier
--no-onnx Skip ONNX conversion
--no-zip Skip ZIP archive save/load
--clean Remove temporary files after the run

Benchmarking

The benchmark repository contains performance results for the default estimators and descriptors on the TDCommons ADMET dataset.

Use as a CLI

The CLI expects a data_dir containing one CSV file per task. Each CSV must have SMILES in the first column and binary labels (0/1) in the second column, with a header row.

Fit:

lazyqsar-binary-fit --data_dir $DATA_DIR --model_dir $MODEL_DIR --mode default

Optionally, pass a --models_txt file listing which tasks (CSV filenames without extension) to train, one per line. Without it, all CSVs in the directory are used.

lazyqsar-binary-fit --data_dir $DATA_DIR --model_dir $MODEL_DIR --models_txt models.txt

Predict:

lazyqsar-binary-predict --input_csv $INPUT_CSV --model_dir $MODEL_DIR --output_csv $OUTPUT_CSV

The output CSV contains the input SMILES and one predicted probability column per task. Optionally use --models_txt to run predictions only for a subset of tasks.

How It Works

LazyQSAR builds a weighted ensemble of up to 8 model variants per descriptor set:

  1. Preprocessing — missing value imputation, variance filtering, and scaling (StandardScaler for dense data, TF-IDF for sparse fingerprints)
  2. Feature selection — univariate F-test (fs) and RandomForest-based (mfs) selection pipelines run in parallel, producing two reduced feature sets
  3. Latent variables — optional SparseRandomProjection for dimensionality reduction, with the number of components chosen by PCA explained-variance heuristics
  4. Classifiers — Logistic Regression, Linear SVM, Extra Trees, and MLP (PyTorch); each head is tuned over a small fixed grid of hyperparameter configurations using stratified cross-validation
  5. Ensemble — predictions are averaged with weights derived from each head's cross-validation ROC-AUC score, with shrinkage toward uniform weights at small sample sizes

The active set of heads is selected automatically based on dataset size and feature dimensionality. All components are exported to ONNX for lightweight, dependency-free inference.

Use in an Ersilia Model Hub template

LazyQSAR models can be used inside an Ersilia Model Hub template structure. See eos1lb5 for an example.

Given a checkpoints folder with the following structure:

checkpoints/
├── task1/
│   ├── cddd/
│   │   ├── featurizer.json
│   │   └── model.onnx
│   ├── chemeleon/
│   │   ├── featurizer.json
│   │   └── model.onnx
│   └── rdkit/
│       ├── featurizer.json
│       └── model.onnx
└── task2/
    ├── cddd/
    ├── chemeleon/
    └── rdkit/

The code/main.py script should look like this:

import os
import sys
import csv

from lazyqsar.api.binary_qsar_predict import predict

root = os.path.dirname(os.path.abspath(__file__))
checkpoints_dir = os.path.abspath(os.path.join(root, "..", "checkpoints"))

input_file = sys.argv[1]
output_file = sys.argv[2]

predict(model_dir=checkpoints_dir, input_csv=input_file, output_csv=output_file)

Note that the order of the columns is alphabetical in the case presented. For a more controlled approach, look into the eos1lb5 repository for an example.

Disclaimer

This library is intended for quick QSAR modeling. For a more complete automated QSAR pipeline, refer to Zaira Chem.

About Us

Learn about the Ersilia Open Source Initiative!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lazyqsar-2.4.0.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lazyqsar-2.4.0-py3-none-any.whl (72.7 kB view details)

Uploaded Python 3

File details

Details for the file lazyqsar-2.4.0.tar.gz.

File metadata

  • Download URL: lazyqsar-2.4.0.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.14.0-1017-azure

File hashes

Hashes for lazyqsar-2.4.0.tar.gz
Algorithm Hash digest
SHA256 d891240d3299422a8a960ea098c757caec916ec5faed021bd8d170c72b72640e
MD5 8534876009c2b465e867d7ece9fc7ec2
BLAKE2b-256 ffbec0c634bcb07943fd2f5f77fbd30907776ac895b1f4834aa660108a6242f8

See more details on using hashes here.

File details

Details for the file lazyqsar-2.4.0-py3-none-any.whl.

File metadata

  • Download URL: lazyqsar-2.4.0-py3-none-any.whl
  • Upload date:
  • Size: 72.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.14.0-1017-azure

File hashes

Hashes for lazyqsar-2.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d43b787999c4e3c4f057b9d3bacb0a2812dbf5ec1931f6a3ad472ffb62ad9eba
MD5 d15b11fa074343edc2b7a3bfaabc8f94
BLAKE2b-256 04032cb5c804e1540185bcc597b8150b23cfaeeb9dcd1a323454854a1a47b5dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page