Skip to main content

Package for calculating the Modelability Index (MODI) of a QSAR dataset

Project description

MODI

Package for calculating the Modelability Index (MODI) of a QSAR dataset. You can read about the MODI in the original paper. MODI is mathematically equivalent to the uniform-class average of the leave-one-out cross-validation accuracy for a 1-nearest neighbor classifier on the training data. The paper claims if this value is not high, then not model will be able to predict well on the new data. While maybe true for small datasets, the law of scaling is real and this work was written before nueral networks in chemistry became popular. With large non-linear models, this might not hold True, though it is still a useful metric if tyring to understand how well separated your classes are in feature space.

By default, MODI is calculated using the Tanimoto distance between Morgan fingerprints. However, you can also provide your own data matrix and distance metric if you are using non-sparse non-binary features.

If you use this package in your work, please cite the original paper:

Golbraikh A, Muratov E, Fourches D, Tropsha A. Data set modelability by QSAR. J Chem Inf Model. 2014 Jan 27;54(1):1-4. doi: 10.1021/ci400572x. Epub 2014 Jan 8. PMID: 24251851; PMCID: PMC3984298.

Installation

You can install the package via pip:

pip install qsar_modi

You can also clone the repository and build from source using poetry:

git clone https://github.com/molecularmodelinglab/modi
cd modi
poetry build

Usage

You can use MODI in your Python scripts as follows:

from qsar_modi import modi

smiles = ["CCO", "CCN", "CCC", "CCCl", "CCBr", "CCCCl", "CCCBr"]
labels = ["A", "A", "B", "B", "A", "C", "C"]

modi_value, class_contributions = modi(chemicals=smiles, labels=labels)

modi returns both the MODI value and a breakdown of contributions from each class. MODI is defined as the uniform average of the class contributions, but observing the class contributions can help identify which classes are well separated and which are not.

MODI support several types of fingerprints, see the docstring for details.

modi_value, class_contributions = modi(chemicals=smiles, labels=labels, fp_type="bAtomPair")

You can also provide your own data matrix and distance metric:

data = numpy.random.random((100, 100)) # pretend this is your feature matrix
labels = [0] * 50 + [1] * 50  # pretend these are your class labels

modi_value, class_contributions = modi(data=data, labels=labels, metric="euclidean")

Lastly, pairwise distances are expensive to compute. MODI will automatically avoid calculating the entire pairwise distance matrix if your dataset is large enough (more than 25,000 chemicals). Instead it will use a row by row approach to save memory (at the sacrifice of speed). You can control this behavior with the force_pdist and force_loop parameters. force_pdist will force the use of the full pairwise distance matrix, while force_loop will force the row by row approach. By default, neither is set and MODI will choose the best approach

modi_value, class_contributions = modi(data=data, labels=labels, metric="euclidean", force_loop=True)

Command Line Interface

MODI has a command line interface (CLI) that you can use to calculate MODI from a CSV or SDF file, given some type of class label is linked to each chemical. The CLI can be accessed via the modi command after installing the package.

modi my_chemicals.sdf --label-name activity_class

This will return the MODI value and class contributions for the chemicals in my_chemicals.sdf, as well as some extra info about how it was calculated (FP type, number of classes, etc.).

[!note] The CLI does not support pre-embedded data. If you have a CSV with pre-calculated features, you will need to to use the Python API instead.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qsar_modi-0.1.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qsar_modi-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file qsar_modi-0.1.0.tar.gz.

File metadata

  • Download URL: qsar_modi-0.1.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.5 Windows/11

File hashes

Hashes for qsar_modi-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5f62a8b109811d90604bc57db9ebe8e2367dbb2de0334ea7cd685c9ed900d24f
MD5 2ba3b0bdbb538db0c2533ae7a4b83c5e
BLAKE2b-256 f2f61c05ebc4c14af28d2f98b84d8359ddb6cdb31b424f3789db3129b62665b9

See more details on using hashes here.

File details

Details for the file qsar_modi-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: qsar_modi-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.5 Windows/11

File hashes

Hashes for qsar_modi-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b430112210d18fd31b6fa855a19d1a66e215a56034e01094cad465e534df303d
MD5 83c038580812f299bf0d88f6730cecbe
BLAKE2b-256 467bfacf3e2e78526c2be44c4f4a2de471e232d0d7de45baeaa177c498784816

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page