MolALKit: A Toolkit for Active Learning in Molecular Data.

These details have not been verified by PyPI

Project links

source

Project description

MolALKit: A Toolkit for Active Learning in Molecular Data.

This software package serves as a robust toolkit designed for the active learning of molecular data.

Installation

Check the GPU and CUDA requirements at mgktools for marginalized graph kernel model. Non-CUDA installation is not supported.

Python 3.12, CUDA12.4, and GCC11.2 are recommended.

Minimum installation

pip install git+https://gitlab.com/Xiangyan93/graphdot.git@v0.8.2 molalkit

Support Chemprop

pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install git+https://github.com/Xiangyan93/chemprop4molalkit.git@v0.0.0

Support GraphGPS

pip install torch-scatter torch-sparse torch-geometric pytorch-lightning yacs torchmetrics performer-pytorch ogb git+https://github.com/Xiangyan93/graphgps4molalkit.git@v0.0.0 -f https://data.pyg.org/whl/torch-2.6.0+cu124.html

Support MolFormer

pip install transformers pytorch-fast-transformers git+https://github.com/Xiangyan93/molformer4molalkit.git@v0.0.0
APEX_CPP_EXT=1 APEX_CUDA_EXT=1 pip install -v --no-build-isolation git+https://github.com/NVIDIA/apex.git

Then download pretrained model at https://github.com/IBM/molformer.

QuickStart

GPU is required to support graph kernel. It will take about 10 minutes to set up the environment and run the demo.

Google Colab notebook.

Data

MolALKit currently supports active learning exclusively for single-task datasets, which can be either classification or regression tasks.

Custom Dataset

The data file must be in CSV format with a header row, structured as follows:

smiles,p_np
[Cl].CC(C)NCC(O)COc1cccc2ccccc12,1
C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl,1
...

The following arguments are required to run the active learning

--data_path <dataset.csv> --smiles_columns <smiles> --targets_columns <target> --task_type <classification/regression>

Public Dataset

The toolkit incorporates several popular public datasets, such as MoleculeNet and TDC, which can be used directly --data_public <dataset name>.

Here is the list of available datasets:

from molalkit.data.datasets import AVAILABLE_DATASETS
print(AVAILABLE_DATASETS)

ActiveLearning/Validation Split

Our code supports several methods of splitting data into an active learning set and a validation set. The active learning is used for active learning and the validation set is used for evaluating the performance of the active learning model.

random The data will be split randomly.
scaffold_order With this approach, the data is split based on molecular scaffolds, ensuring that the same scaffold never appears in both the active learning and validation sets. The scaffold containing the most molecules is placed in the active learning set. This method aligns with the implementation in DeepChem and is independent of random seeds.
scaffold_random In this method, the placement of scaffolds in either the active learning set or the validation set is done randomly. This split is dependent on random seeds and introduces an element of randomness into the scaffold split.

The following arguments are required for data split:

--split_type <random/scaffold_order/scaffold_random> --split_sizes <active learning set ratio> <validation set ratio> --seed <random seed>

Machine Learning Model

The machine learning model used in this package is described in a json config file. Here is the list of built-in machine learning models:

from molalkit.models.configs import AVAILABLE_MODELS
print(AVAILABLE_MODELS)

The model config files are placed in molalkit/models/configs. The following arguments are required for choosing machine learning models:

--model_configs <model_config_file>

First Example

Here's an example of running active learning using MolALKit with the BACE dataset, a 50:50 scaffold split, and Random Forest as the machine learning model:

molalkit_run --data_public bace --metrics roc_auc mcc accuracy precision recall f1_score --model_configs RandomForest_Morgan_Config --split_type scaffold_order --split_sizes 0.5 0.5 --evaluate_stride 10 --seed 0 --save_dir bace --init_size 2 --select_method explorative --s_batch_size 1 --max_iter 100

Project details

These details have not been verified by PyPI

Project links

source

Release history Release notifications | RSS feed

1.1.0

Mar 9, 2026

This version

1.0.1

Feb 23, 2026

1.0.0

Feb 7, 2026

0.10.1

Nov 3, 2024

0.10.0

Oct 31, 2024

0.9.1

Mar 27, 2024

0.9.0

Mar 24, 2024

0.8.0

Mar 4, 2024

0.7.0

Jan 14, 2024

0.6.3

Jan 10, 2024

0.6.2

Dec 26, 2023

0.6.1

Dec 20, 2023

0.6.0

Dec 15, 2023

0.5.2

Dec 14, 2023

0.5.1

Dec 11, 2023

0.5.0

Dec 11, 2023

0.4.0

Dec 9, 2023

0.3.0

Dec 7, 2023

0.2.0

Dec 2, 2023

0.1.0

Nov 30, 2023

0.0.1

Nov 24, 2023

0.0.0

Nov 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molalkit-1.0.1.tar.gz (2.6 MB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

molalkit-1.0.1-py3-none-any.whl (2.7 MB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file molalkit-1.0.1.tar.gz.

File metadata

Download URL: molalkit-1.0.1.tar.gz
Upload date: Feb 23, 2026
Size: 2.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for molalkit-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`68739db0a32c3b1cdbdadb3c0b136c5653c4027177f56dfae6394de74ba5f9f9`
MD5	`3194e992327dcf497aae617526e181a4`
BLAKE2b-256	`f39839b4913b76886efba74143a77e39d8f91eeec52b65d5638044d0929dbaf6`

See more details on using hashes here.

File details

Details for the file molalkit-1.0.1-py3-none-any.whl.

File metadata

Download URL: molalkit-1.0.1-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 2.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for molalkit-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3da7b5164ddcee76b9d5f43e0b264b099e3b8077fada2c738c1fc08a2781409b`
MD5	`fa851a18579b1c8860d82f5d36e3a319`
BLAKE2b-256	`6f634167bdf1a9f2be4bb2981927a1f42b559b101011fa4f916f79336fb6ca9a`

See more details on using hashes here.

molalkit 1.0.1

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MolALKit: A Toolkit for Active Learning in Molecular Data.

Installation

Minimum installation

Support Chemprop

Support GraphGPS

Support MolFormer

QuickStart

Data

Custom Dataset

Public Dataset

ActiveLearning/Validation Split

Machine Learning Model

First Example

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes