MLM MachineLearning Molecular Benchmarch
Project description
MLMBench - MachineLearning Molecular Benchmarch
MLMBench collects datasets and splits them to do FAIR ML benchmarks. MLMBench can be used with different ML algorithms and data representations for molecular property/activity predictions and more.
The scope of this code is:
- keep a simple API representation
- no need of other libraries
- keep the dataset offline and represented as CSV file (RFC 4180 standard) or SMILES string list.
Splits are made using well-known rational approaches such as:
- random split
- meaningful split for model target extrapolation
- meaningful split for chemical diversity extrapolation
- literature published split
The datasets are stored in the "data" directory in subfolders. Every subfolder needs the following files with the following names:
- Readme.txt: explain some dataset info (provenience, type of data, descriptors version, and so on)
- cv.splits: the split required to do a fair trainin, test, validation in any ml algorithm
- dataset.csv: the matrix of features
- target.csv: the matrix of target/targets
- dataset.smi: the smiles list
Install
pip3 install mlmbench
How to use
#!/usr/bin/env python3
from mlmbench.data import Datasets
ds = Datasets()
print(ds.get_available_datasets())
print(f'Dataset info: {ds.get_info("esol-random")}')
for train_data, test_data, val_data in ds.ttv_generator("esol-random"):
print("train ", train_data["xdata"].shape, train_data["target"].shape, len(train_data["smi"]))
print("test ", test_data["xdata"].shape, test_data["target"].shape, len(test_data["smi"]))
print("val ", val_data["xdata"].shape, val_data["target"].shape, len(val_data["smi"]))
# Do ml training/test validation, collect the results and store it in your
# appropriate format to do your analysis.
print("-"*40)
Submit new dataset
-
Fork the project!
-
Clone the forked project
-
Add the dataset in this form: dataset.csv: tabular data for any kind of descriptors target.csv: tabular data for one or multiple targets dataset.smi: smiles of the molecule in its appropriate format "c1ccccc1 benzene" cv.split: The split you like. This specific file needs to be compatible with the following standard. The file comprises lines representing the model, groups split by the ";" character, and every group representing the compound name, and every name is split using the "," character. i.e. train keys test keys validation keys line 1 mol1,mol2,mol3,.. ; mol200,mol201,... ; mol400,mol401,... line 2 ... line 3 ..
Readme.md: Info regarding the dataset(i.e. source and so on)
-
Create a pull request and 99.9% will be merged
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.