The official implementation of the WelQrate dataset and benchmark
Project description
WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery
Installation
We provide the recommended environment, which were used for benchmarking in the original paper. Users can also build their own environment based on their own needs.
conda create -n welqrate python=3.9
pip install welqrate
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
Load the Dataset
Users can download and preprocess the datasets by calling WelQrateDataset
class. Available datasets include AID1798, AID435008, AID435034, AID1843, AID2258, AID463087, AID488997, AID2689, and AID485290. Please refer to our website for more details. Besides, users can choose between 2D and 3D molecular representations by setting mol_repr
to 2d_graph
or 3d_graph
.
from welqrate.dataset import WelQrateDataset
# Load the 2D dataset
AID1798_dataset_2d = WelQrateDataset(dataset_name = 'AID1798', root =f'./datasets', mol_repr ='2d_graph')
# Load the 3D dataset
AID1843_dataset_3d = WelQrateDataset(dataset_name = 'AID1843', root =f'./datasets', mol_repr ='3d_graph')
# Load a split dictionary
split_dict = AID1798_dataset_2d.get_idx_split(split_scheme ='random_cv1') # or 'scaffold_seed1; we provide 1-5 for both random_cv and scaffold_seed
Train a model
We can store hyperparameters related to model, training scheme, and dataset in a configuration file. Users can refer to configuration files in ./config/
for different models. Then, we can config the model and start training by calling train
function.
dataset_name = 'AID1798'
split_scheme = 'random_cv1'
AID1798_2d = WelQrateDataset(dataset_name=dataset_name, root='./datasets', mol_repr='2d_graph',
source='inchi')
split_dict = AID1798_2d.get_idx_split(split_scheme)
train_loader = get_train_loader(AID1798_2d[split_dict['train']], batch_size=128, num_workers=0, seed=1)
valid_loader = get_valid_loader(AID1798_2d[split_dict['valid']], batch_size=128, num_workers=0)
test_loader = get_test_loader(AID1798_2d[split_dict['test']], batch_size=128, num_workers=0)
config = {}
# default train config
for config_file in ['./config/train.yaml', './config/gcn.yaml']:
with open(config_file) as file:
config.update(yaml.safe_load(file))
# initialize model
hidden_channels = config['model']['hidden_channels']
num_layers = config['model']['num_layers']
model = GCN_Model(hidden_channels = hidden_channels,
num_layers = num_layers)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
train(model = model,
train_loader = train_loader,
valid_loader = valid_loader,
test_loader = test_loader,
config = config,
device = device,
save_path = f'./results/{dataset_name}/{split_scheme}/gcn'
)
Citation
If you find our work helpful, please cite our paper:
@article{dong2024welqrate,
title={WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking},
author={Yunchao, Liu and Dong, Ha and Wang, Xin and Moretti, Rocco and Wang, Yu and Su, Zhaoqian and Gu, Jiawei and Bodenheimer, Bobby and Weaver, Charles David and Meiler, Jens and Derr, Tyler and others},
journal={arXiv preprint arXiv:2411.09820},
year={2024}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file welqrate-0.1.4.tar.gz
.
File metadata
- Download URL: welqrate-0.1.4.tar.gz
- Upload date:
- Size: 41.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 668cd5c57aa2f99f8d241e1f407238f6952aa4bd5965bbfc356a7890eaf33c5a |
|
MD5 | 4df1c7a4a05495d2b47f0d94aaf39eb8 |
|
BLAKE2b-256 | 265674539e2576079b1127b67b15d698256256ffd7be32daf07ef68733c92f75 |
File details
Details for the file welqrate-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: welqrate-0.1.4-py3-none-any.whl
- Upload date:
- Size: 49.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb021e0452483c598bc37a5ebf8459d0de28f69ad3dfbfc0c29ca0fb70a51d0c |
|
MD5 | dff6b99452abe8d9c4a4302afd825947 |
|
BLAKE2b-256 | 943960542034c9bf9f1ab3f9e921776056697f02e9284a85b09cf3d9dc2f274b |