Skip to main content

MLM MachineLearning Molecular Benchmarch

Project description

MLMBench - MachineLearning Molecular Benchmarch

Page views Licence: GPL v3

MLMBench collects datasets and splits them to do FAIR ML benchmarks. MLMBench can be used with different ML algorithms and data representations for molecular property/activity predictions and more.

The scope of this code is:

  • keep a simple API representation
  • no need of other libraries
  • keep the dataset offline and represented as CSV file (RFC 4180 standard) or SMILES string list.

Splits are made using well-known rational approaches such as:

  • random split
  • meaningful split for model target extrapolation
  • meaningful split for chemical diversity extrapolation
  • literature published split

The datasets are stored in the "data" directory in subfolders. Every subfolder needs the following files with the following names:

  • Readme.txt: explain some dataset info (provenience, type of data, descriptors version, and so on)
  • cv.splits: the split required to do a fair trainin, test, validation in any ml algorithm
  • dataset.csv: the matrix of features
  • target.csv: the matrix of target/targets
  • dataset.smi: the smiles list

Install


pip3 install mlmbench

Split types per dataset

mlmbench includes for every dataset two different splits:

  • random split using "mkrndsplits.py" starting from a list of names
  • target extrapolation using "mktgtextrapsplits.py" starting from the target file. In this case, the algorithm will first import the target file, and then for every column, rank from min to max the queue and split the ordered target into "N" splits selected by the user. This split aims to check for "extrapolation."
  • literature split (if available). In this case, we try to preserve particular splits published by users.

Available datasets

  • BACE-moleculenet
  • BACE-random
  • BACE-tgt_extrapolation
  • FU-random
  • FU-tgt_extrapolation
  • HLMCLint-random
  • HLMCLint-tgt_extrapolation
  • MeltingPoint-random
  • MeltingPoint-tgt_extrapolation
  • NIR_Gasoline-random
  • NIR_Gasoline-tgt_extrapolation
  • SteroidsLSS-isomers
  • SteroidsLSS-random
  • SteroidsLSS-tgt_extrapolation
  • esol-chemdiversity
  • esol-random
  • esol-tgt_extrapolation
  • logDpH7.4-random
  • logDpH7.4-tgt_extrapolation

How to use

#!/usr/bin/env python3

from mlmbench.data import Datasets

ds = Datasets()
print(ds.get_available_datasets())
print(f'Dataset info: {ds.get_info("esol-random")}')
for train_data, test_data, val_data in ds.ttv_generator("esol-random"):
    print("train ", train_data["xdata"].shape, train_data["target"].shape, len(train_data["smi"]))
    print("test ", test_data["xdata"].shape, test_data["target"].shape, len(test_data["smi"]))
    print("val ", val_data["xdata"].shape, val_data["target"].shape, len(val_data["smi"]))
    
    # Do ml training/test validation, collect the results and store it in your 
    # appropriate format to do your analysis.

    print("-"*40)

Submit new dataset


  1. Fork the project!

  2. Clone the forked project

  3. Add the dataset in this form: dataset.csv: tabular data for any kind of descriptors target.csv: tabular data for one or multiple targets dataset.smi: smiles of the molecule in its appropriate format "c1ccccc1 benzene" cv.split: The split you like. This specific file needs to be compatible with the following standard. The file comprises lines representing the model, groups split by the ";" character, and every group representing the compound name, and every name is split using the "," character. i.e. train keys test keys validation keys line 1 mol1,mol2,mol3,.. ; mol200,mol201,... ; mol400,mol401,... line 2 ... line 3 ..

    Readme.md: Info regarding the dataset(i.e. source and so on)

  4. Create a pull request and 99.9% will be merged

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlmbench-1.0.3.tar.gz (49.9 MB view details)

Uploaded Source

Built Distribution

mlmbench-1.0.3-py3-none-any.whl (50.6 MB view details)

Uploaded Python 3

File details

Details for the file mlmbench-1.0.3.tar.gz.

File metadata

  • Download URL: mlmbench-1.0.3.tar.gz
  • Upload date:
  • Size: 49.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.4.2 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for mlmbench-1.0.3.tar.gz
Algorithm Hash digest
SHA256 a5e2f086eadfaf422a9a753aab9fa2ba1bf1911843b3b8e26117f803362e05e5
MD5 30bf5ea58bde5d1cf111c1e462ba6d97
BLAKE2b-256 90d061191d6c855a8dab77c38c0457d98daef392e476cdb24c5ffd616269af36

See more details on using hashes here.

File details

Details for the file mlmbench-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: mlmbench-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 50.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.4.2 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for mlmbench-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 820282dbe1ca12ae66a4656edddf77f8b729297fb4e4821559225e590256624e
MD5 e15ee011df9642ff7343250211878384
BLAKE2b-256 34c6e5f669d78199e6b0ce4c20ee9fc1d866a8683d2d8aa23e2c75a2768ccce2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page