A library to quickly build QSAR models
Project description
Ersilia's LazyQSAR
A Python library for building supervised binary QSAR (Quantitative Structure-Activity Relationship) models quickly, with minimal configuration. LazyQSAR automates descriptor computation, feature selection, and hyperparameter tuning to produce robust ensemble models from chemical structures.
Two usage modes:
- SMILES-based: pass molecule SMILES strings directly; descriptors are computed automatically
- Descriptor-agnostic: bring your own pre-computed descriptor arrays or HDF5 files
Table of Contents
Installation
Install LazyQSAR from source:
git clone https://github.com/ersilia-os/lazy-qsar.git
cd lazy-qsar
python -m pip install -e .
To use the built-in LazyQSAR descriptors, install the optional dependencies:
python -m pip install -e .[descriptors]
This enables descriptor (featurizer) calculation. The first time you run LazyQSAR with deep-learning descriptors, it will download the Chemeleon and CDDD model checkpoints. To complete this setup in advance, run:
lazyqsar-setup
Use as a Python API
Binary Classification
LazyQSAR's binary classifier can run either with built-in descriptors (takes SMILES as input) or with custom pre-computed descriptors.
Built-in descriptors
Instantiate LazyBinaryQSAR with a mode of choice:
| Mode | Descriptors used | Speed |
|---|---|---|
fast |
RDKit, Morgan fingerprints | Fastest, no deep-learning descriptors |
default |
Chemeleon, RDKit, CDDD | Balanced |
slow |
Chemeleon, Morgan, RDKit, CDDD | Most thorough |
from lazyqsar.qsar import LazyBinaryQSAR
model = LazyBinaryQSAR(mode="default")
model.fit(smiles_list=smiles_train, y=y_train)
y_hat = model.predict_proba(smiles_list=smiles_test)[:, 1]
Custom descriptors
Pre-calculate your own descriptors and pass them directly. We recommend the Ersilia Model Hub for this — its .h5 output format is supported natively. Alternatively, pass descriptors as a NumPy array.
from lazyqsar.agnostic import LazyBinaryClassifier
# From a NumPy array
model = LazyBinaryClassifier(mode="default")
model.fit(X=X_train, y=y_train)
y_hat = model.predict_proba(X=X_test)[:, 1]
# From an Ersilia .h5 file
model.fit(h5_file="descriptors.h5", y=y_train)
y_hat = model.predict_proba(h5_file="descriptors.h5")[:, 1]
Saving and loading models
Models are saved as ONNX files by default, so inference only requires the ONNX runtime (no scikit-learn dependency at prediction time).
# Save after training
model.save(model_dir)
# Load for inference (auto-detects ONNX or raw format)
from lazyqsar.agnostic import LazyBinaryClassifier
model = LazyBinaryClassifier.load(model_dir)
y_hat = model.predict_proba(X=X)[:, 1]
You can also save and load as a .zip archive:
model.save("my_model.zip")
model = LazyBinaryClassifier.load("my_model.zip")
The same save/load interface applies to LazyBinaryQSAR:
from lazyqsar.qsar import LazyBinaryQSAR
model = LazyBinaryQSAR(mode="default")
model.fit(smiles_list=smiles_train, y=y_train)
model.save(model_dir)
model = LazyBinaryQSAR.load(model_dir)
y_hat = model.predict_proba(smiles_list=smiles_test)[:, 1]
Tests and benchmarks
Quick testing
The tests/ folder contains scripts for quickly verifying that the code works. The Bioavailability dataset is used as an example.
python tests/test_binary_classification.py
python tests/test_binary_classification.py --agnostic
Additional flags:
| Flag | Description |
|---|---|
--mode fast|default|slow |
Select descriptor mode |
--agnostic |
Use descriptor-agnostic LazyBinaryClassifier |
--no-onnx |
Skip ONNX conversion |
--no-zip |
Skip ZIP archive save/load |
--clean |
Remove temporary files after the run |
Benchmarking
The benchmark repository contains performance results for the default estimators and descriptors on the TDCommons ADMET dataset.
Use as a CLI
The CLI expects a data_dir containing one CSV file per task. Each CSV must have SMILES in the first column and binary labels (0/1) in the second column, with a header row.
Fit:
lazyqsar-binary-fit --data_dir $DATA_DIR --model_dir $MODEL_DIR --mode default
Optionally, pass a --models_txt file listing which tasks (CSV filenames without extension) to train, one per line. Without it, all CSVs in the directory are used.
lazyqsar-binary-fit --data_dir $DATA_DIR --model_dir $MODEL_DIR --models_txt models.txt
Predict:
lazyqsar-binary-predict --input_csv $INPUT_CSV --model_dir $MODEL_DIR --output_csv $OUTPUT_CSV
The output CSV contains the input SMILES and one predicted probability column per task. Optionally use --models_txt to run predictions only for a subset of tasks.
How It Works
LazyQSAR builds a weighted ensemble of up to 8 model variants per descriptor set:
- Preprocessing — missing value imputation, variance filtering, and scaling (StandardScaler for dense data, TF-IDF for sparse fingerprints)
- Feature selection — univariate F-test (
fs) and RandomForest-based (mfs) selection pipelines run in parallel, producing two reduced feature sets - Latent variables — optional SparseRandomProjection for dimensionality reduction, with the number of components chosen by PCA explained-variance heuristics
- Classifiers — Logistic Regression, Linear SVM, Extra Trees, and MLP (PyTorch); each head is tuned over a small fixed grid of hyperparameter configurations using stratified cross-validation
- Ensemble — predictions are averaged with weights derived from each head's cross-validation ROC-AUC score, with shrinkage toward uniform weights at small sample sizes
The active set of heads is selected automatically based on dataset size and feature dimensionality. All components are exported to ONNX for lightweight, dependency-free inference.
Use in an Ersilia Model Hub template
LazyQSAR models can be used inside an Ersilia Model Hub template structure. See eos1lb5 for an example.
Given a checkpoints folder with the following structure:
checkpoints/
├── task1/
│ ├── cddd/
│ │ ├── featurizer.json
│ │ └── model.onnx
│ ├── chemeleon/
│ │ ├── featurizer.json
│ │ └── model.onnx
│ └── rdkit/
│ ├── featurizer.json
│ └── model.onnx
└── task2/
├── cddd/
├── chemeleon/
└── rdkit/
The code/main.py script should look like this:
import os
import sys
import csv
from lazyqsar.api.binary_qsar_predict import predict
root = os.path.dirname(os.path.abspath(__file__))
checkpoints_dir = os.path.abspath(os.path.join(root, "..", "checkpoints"))
input_file = sys.argv[1]
output_file = sys.argv[2]
predict(model_dir=checkpoints_dir, input_csv=input_file, output_csv=output_file)
Note that the order of the columns is alphabetical in the case presented. For a more controlled approach, look into the eos1lb5 repository for an example.
Disclaimer
This library is intended for quick QSAR modeling. For a more complete automated QSAR pipeline, refer to Zaira Chem.
About Us
Learn about the Ersilia Open Source Initiative!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lazyqsar-2.4.0.tar.gz.
File metadata
- Download URL: lazyqsar-2.4.0.tar.gz
- Upload date:
- Size: 55.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.14.0-1017-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d891240d3299422a8a960ea098c757caec916ec5faed021bd8d170c72b72640e
|
|
| MD5 |
8534876009c2b465e867d7ece9fc7ec2
|
|
| BLAKE2b-256 |
ffbec0c634bcb07943fd2f5f77fbd30907776ac895b1f4834aa660108a6242f8
|
File details
Details for the file lazyqsar-2.4.0-py3-none-any.whl.
File metadata
- Download URL: lazyqsar-2.4.0-py3-none-any.whl
- Upload date:
- Size: 72.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.14.0-1017-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d43b787999c4e3c4f057b9d3bacb0a2812dbf5ec1931f6a3ad472ffb62ad9eba
|
|
| MD5 |
d15b11fa074343edc2b7a3bfaabc8f94
|
|
| BLAKE2b-256 |
04032cb5c804e1540185bcc597b8150b23cfaeeb9dcd1a323454854a1a47b5dc
|