Skip to main content

Molecular feature generation for machine learning

Project description

GitHub release (latest by date) Documentation Status PyPI - Downloads PyPI version

Molecular Descriptors

Molecular features for machine learning.

Table of Contents


This is a collection of methods to generate molecular features for machine learning, including common feautre representations like coulomb matrix etc. Also a graph generator for graph neural networks is found in molreps. This repo is currently under construction and can be easily expandend following this recommended style:

  • In molreps the main classes are listed
  • In methods individual functions are collected.
  • Uses a google-style doc string documentation for each function or class.


Clone repository and install with editable mode:

pip install -e ./molecular_features

or latest release via:

pip install molreps


Auto-documentation is generated at: .


Since there are many chemcial libraries in use, which can not be easily installed with pip (that include the following dependencies), they must be installed with e.g. conda:

  • rdkit, e.g. via conda install -c rdkit rdkit
  • openbabel, e.g. via conda install -c openbabel openbabel



Simple moleculear representations can be generated from molreps.descriptors.

from molreps.descriptors import coulomb_matrix
atoms = ['C','C']
coords = [[0,0,0],[1,0,0]]
cm = coulomb_matrix(atoms,coords)

However, also individual function can be used from molreps.methods. Like in this case the back-direction.

from molreps.methods.geo_npy import geometry_from_coulombmat
atom_cord = geometry_from_coulombmat(cm)


For many ML models a graph representation of the molecule is required. The module MolGraph from molreps.graph inherits from networkx's nx.Graph and can generate a molecular graph based on a mol-object provided by a cheminformatics package like rdkit, openbabel, ase etc. This is a flexible way to use functionalities from both networkx and packages like rdkit. First create a mol object.

import rdkit
m = rdkit.Chem.MolFromSmiles("CC1=CC=CC=C1")
m = rdkit.Chem.AddHs(m)

The mol object is passed to the MolGraph class constructor but can be further accessed.

import networkx as nx
import numpy as np
from molreps.graph import MolGraph
mgraph = MolGraph(m)
mgraph.mol  # Access the mol object.

The networkx graph is generated by make(), where the features and keys can be specified. There are pre-defined features that can be assigned by an identifier like 'key': 'identifier' or if further arguments are required by 'key' : {'class':'identifier', 'args':{'arg1': value1,'arg2': value2 }}. In the latter case also a custom function or class can be provided like 'key' : {'class': my_fun, 'args':{'arg1': value1,'arg2': value2 }}. A dictionary of predifined identifiers is listed in print(MolGraph._mols_implemented).

mgraph.make(nodes = {"AtomicNum" : 'AtomicNum'},
            edges = {"BondType" : 'BondType',
                     "Distance" : {'class':'Distance', 'args':{'bonds_only':True}}},
            state = {"ExactMolWt" : 'ExactMolWt'}

Note, a custom function must accept key and this instance as first argument, the molecule class is accessible via .mol. For example make a list of tuples such as [(i, {key: property})] for atoms and [((i,j, {key: property}))] for bonds and then add them to the graph by add_nodes_from() or add_edges_from(), respectively. Then the generated graph can be viewed and treated as a networkx graph, like plotting nx.draw(mgraph,with_labels=True). Finnaly, a closed form tensor is collected from selected features defined by the key-attribute. For each key an additional function to process the features and a default value can be optionally provided but defaults to np.array. A default value has to be added, if a single node or edge is missing a key, to generate a closed form tensor.

graph_tensors= mgraph.to_tensor(nodes = ["AtomicNum"],
                                edges = ["BondType"],
                                state = ["ExactMolWt"],
                                out_tensor = np.array)

The returned dictionary containing the feature tensors can be passed to graph models.


Examples scripts using this repo are collected in examples.


    author = {Patrick Reiser},
    title = {Python Package for Molecular Representations in Machine learning},
    year = {2020},
    publisher = {GitHub},
    journal = {GitHub Repository},
    howpublished = {\url{}},
    url = ""


Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for molreps, version 0.1.1
Filename, size File type Python version Upload date Hashes
Filename, size molreps-0.1.1.tar.gz (27.7 kB) File type Source Python version None Upload date Hashes View
Filename, size molreps-0.1.1-py3-none-any.whl (26.1 kB) File type Wheel Python version py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page