MLM MachineLearning Molecular Benchmarch
Project description
MLMBench - MachineLearning Molecular Benchmarch
MLMBench collects datasets and splits them to do FAIR ML benchmarks. MLMBench can be used with different ML algorithms and data representations for molecular property/activity predictions and more.
The scope of this code is:
- keep a simple API representation
- no need of other libraries
- keep the dataset offline and represented as CSV file (RFC 4180 standard) or SMILES string list.
Splits are made using well-known rational approaches such as:
- random split
- meaningful split for model target extrapolation
- meaningful split for chemical diversity extrapolation
- literature published split
The datasets are stored in the "data" directory in subfolders. Every subfolder needs the following files with the following names:
- Readme.txt: explain some dataset info (provenience, type of data, descriptors version, and so on)
- cv.splits: the split required to do a fair trainin, test, validation in any ml algorithm
- dataset.csv: the matrix of features
- target.csv: the matrix of target/targets
- dataset.smi: the smiles list
Install
pip3 install mlmbench
Split types per dataset
mlmbench includes for every dataset two different splits:
- random split using "mkrndsplits.py" starting from a list of names
- target extrapolation using "mktgtextrapsplits.py" starting from the target file. In this case, the algorithm will first import the target file, and then for every column, rank from min to max the queue and split the ordered target into "N" splits selected by the user. This split aims to check for "extrapolation."
- literature split (if available). In this case, we try to preserve particular splits published by users.
Available datasets
- BACE-moleculenet
- BACE-random
- BACE-tgt_extrapolation
- FU-random
- FU-tgt_extrapolation
- HLMCLint-random
- HLMCLint-tgt_extrapolation
- MeltingPoint-random
- MeltingPoint-tgt_extrapolation
- NIR_Gasoline-random
- NIR_Gasoline-tgt_extrapolation
- SteroidsLSS-isomers
- SteroidsLSS-random
- SteroidsLSS-tgt_extrapolation
- esol-chemdiversity
- esol-random
- esol-tgt_extrapolation
- logDpH7.4-random
- logDpH7.4-tgt_extrapolation
How to use
#!/usr/bin/env python3
from mlmbench.data import Datasets
ds = Datasets()
print(ds.get_available_datasets())
print(f'Dataset info: {ds.get_info("esol-random")}')
for train_data, test_data, val_data in ds.ttv_generator("esol-random"):
print("train ", train_data["xdata"].shape, train_data["target"].shape, len(train_data["smi"]))
print("test ", test_data["xdata"].shape, test_data["target"].shape, len(test_data["smi"]))
print("val ", val_data["xdata"].shape, val_data["target"].shape, len(val_data["smi"]))
# Do ml training/test validation, collect the results and store it in your
# appropriate format to do your analysis.
print("-"*40)
Submit new dataset
-
Fork the project!
-
Clone the forked project
-
Add the dataset in this form: dataset.csv: tabular data for any kind of descriptors target.csv: tabular data for one or multiple targets dataset.smi: smiles of the molecule in its appropriate format "c1ccccc1 benzene" cv.split: The split you like. This specific file needs to be compatible with the following standard. The file comprises lines representing the model, groups split by the ";" character, and every group representing the compound name, and every name is split using the "," character. i.e. train keys test keys validation keys line 1 mol1,mol2,mol3,.. ; mol200,mol201,... ; mol400,mol401,... line 2 ... line 3 ..
Readme.md: Info regarding the dataset(i.e. source and so on)
-
Create a pull request and 99.9% will be merged
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mlmbench-1.0.3.tar.gz
.
File metadata
- Download URL: mlmbench-1.0.3.tar.gz
- Upload date:
- Size: 49.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.4.2 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5e2f086eadfaf422a9a753aab9fa2ba1bf1911843b3b8e26117f803362e05e5 |
|
MD5 | 30bf5ea58bde5d1cf111c1e462ba6d97 |
|
BLAKE2b-256 | 90d061191d6c855a8dab77c38c0457d98daef392e476cdb24c5ffd616269af36 |
File details
Details for the file mlmbench-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: mlmbench-1.0.3-py3-none-any.whl
- Upload date:
- Size: 50.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.4.2 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 820282dbe1ca12ae66a4656edddf77f8b729297fb4e4821559225e590256624e |
|
MD5 | e15ee011df9642ff7343250211878384 |
|
BLAKE2b-256 | 34c6e5f669d78199e6b0ce4c20ee9fc1d866a8683d2d8aa23e2c75a2768ccce2 |