Skip to main content

Automating the generation of human readable descriptions of arbitrary subsets of molecular space.

Project description

EXplainable MOlecular SETs

Package to automate the identification of molecular similarity given an arbitrary set of molecules and associated functions to calculate the value of particular properties (label fingerprints).

Installation

The easiest way to install is using pip following the setup of a new conda environment with rdkit installed (rdkit does not play well with pip).

  1. conda create -n exmoset python=3.8
  2. conda activate exmoset
  3. conda install -c conda-forge rdkit
  4. pip install exmoset

API

The MolSpace Class handles the analysis of a given molecular set in accordance with the list of fingerprints provided. The molecules can be passed to Molspace in any format, with additional conversions specified by the mol_converters argument.

analysis = MolSpace(molecules,
                    fingerprints = fingerprints,
                    file="data/QM9_Data.csv",
                    mol_converters={"rd" : Chem.MolFromSmiles, "smiles" : str},
                    index_col="SMILES")

Fingerprints

Fingerprints are a standardized way for Molspace to calculate the properties for each molecule it is analysing. Its arguments determine the grammatical structure of the label that will be produced (property, noun and verb), and a function to calculate the property (calculator) along with what molecular format this function works on (mol_format). The grammatical structure of the resulting labels is a work in progress, and may lead to some poor results that require further processing.

def contains_C(mol):
      return 1 if C in mol else 0

contains_carbon = Fingerprint(property="Contains C",
                  verb="contain",
                  noun="Molecule",
                  label_type="binary",
                  calculator=contains_C,
                  mol_format="smiles")

Molecule Converters

The mol_converters argument provides the means to transform each molecule into alternate representations. The argument is a dictionary with the following structure {Identifier : Function_that_will_convert} that is expanded in the following way:

formats = {key : mol_converters[key](mol) for key in mol_converters.keys()} # Assigns each identifier to its assocaited representation by
self.Molecules.append(Molecule(mol, **formats)) # Unpacks the new formats as kwargs into the Molecule object

An example is provided below

mol_converters = {"rd" : Chem.MolFromSmiles, "smiles" : str} # Will convert molecules provided as smiles strings into Chem.rd objects from RDKit and maintain the SMILES in the dataset as strings.

Label Types

Binary

Binary labels indicate the presence of absence of a particular element, bond type, or molecular feature (such as aromaticity). Simplest to calculate and best behaved with respect to the entropy estimators. Uses a discrete entropy estimator.

Multiclass

Discrete labels where the value can be any integer. Examples include number of rings, number of atoms, or number of each type of bond. Uses a discrete entropy estimator.

Continuous

Continuous labels where the value can be any real number. Examples include electronic spatial extent, dipole moment, and free energy. Uses the continuous entropy estimator

References

The mathematical methods employed in this codebase are based on the following publications:

  • Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 66138.
  • Ross, B. C. Mutual Information between Discrete and Continuous Data Sets. PLOS ONE 2014, 9, 1–5.

Continuous entropy estimation is provided by Paul Broderson's entropy estimators package (https://github.com/paulbrodersen/entropy_estimators).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exmoset-0.1.0.tar.gz (95.1 kB view details)

Uploaded Source

File details

Details for the file exmoset-0.1.0.tar.gz.

File metadata

  • Download URL: exmoset-0.1.0.tar.gz
  • Upload date:
  • Size: 95.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.9

File hashes

Hashes for exmoset-0.1.0.tar.gz
Algorithm Hash digest
SHA256 26700af603c6c9283e282691794cdfbaa8676d39ef3e0576a8d0cce1ef3c2ed7
MD5 61fe5e275239c06519ca49077599e259
BLAKE2b-256 7f81212ee953557a8e1b6c8d83880838c1a0a1141fe939ec36e039a74101d22a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page