Fast Molecular Property Prediction with mordredcommunity
Project description
Molecular Property Prediction with mordredcommunity
Fast, Scalable, and <500 LOC
Announcements
alphaXiv
The fastprop
paper is freely available online at arxiv.org/abs/2404.02058 and we are conducting open source peer review on alphaXiv - comments are appreciated!
The source for the paper is stored in this repository under the paper
directory.
Initial Release :tada:
fastprop
version 1 is officially released, meaning the API is now stable and ready for production use!
Please try fastprop
on your datasets and let us know what you think.
Feature requests and bug reports are very appreciated!
Installing fastprop
fastprop
supports Mac, Windows, and Linux on Python versions 3.8 to 3.12.
Installing from pip
is the best way to get fastprop
, but if you need to check out a specific GitHub branch or you want to contribute to fastprop
a source installation is recommended.
Pending interest from users, a conda
package will be added.
Check out the demo notebook for quick intro to fastprop
via Google Colab - runs in your browser, GPU included, no install required:
pip
[recommended]
fastprop
is available via PyPI with pip install fastprop
.
To make extending fastprop
easier and keep the installation size down, dependencies required for hyperparameter optimization and SHAP analysis are optional.
They can be installed with pip install fastprop[hopt]
, pip install fastprop[shap]
, or pip install fastprop[shap,hopt]
to install them both.
If you want to use fastprop
but not write new code on top of it, you may want to install these now - you can always do so later, however, and fastprop
will remind you.
Source
To install fastprop
from GitHub directly you can:
- Run
pip install https://github.com/JacksonBurns/fastprop.git@main
to install from themain
branch (or specify any other branch you like). - Clone the repository with
git clone https://github.com/JacksonBurns/fastprop.git
, navigate tofastprop
withcd fastprop
, and runpip install .
To contribute to fastprop
please follow this tutorial (or something similar) to set up a forked version of fastprop
and open a pull request (similar to above option 2).
All contributions are appreciated!
See Developing fastprop
for more details.
About fastprop
fastprop
is a package for performing deep-QSPR (Quantitative Structure-Property Relationship) with minimal user intervention.
By passing in a list of SMILES strings, fastprop
will automatically generate and cache a set of molecular descriptors using mordredcommunity
and train an FNN to predict the corresponding properties.
See the examples
and benchmarks
directories to see how to run training - the rest of this documentation will focus on how you can run, configure, and customize fastprop
.
fastprop
Framework
There are four distinct steps in fastprop
that define its framework:
- Featurization - transform the input molecules (as SMILES strings) into an array of molecular descriptors which are saved
- Preprocessing - clean the descriptors by removing or imputing missing values then rescaling the remainder
- Training - send the processed input to the neural network, which is a simple FNN (sequential fully-connected layers with an activation function between)
- Prediction - save the trained model for future use
Configurable Parameters
-
Featurization
- Input CSV file: comma separated values (CSV) file (with headers) containing SMILES strings representing the molecules and the targets
- SMILES column name: name of the column containing the SMILES strings
- Target column name(s): name(s) of the columns containing the targets
and
- Which
mordred
descriptors to calculate: 'all' or 'optimized' (a smaller set of descriptors; faster, but less accurate). - Enable/Disable caching of calculated descriptors:
fastprop
will by default cache calculated descriptors based on the input filename and warn the user when it loads descriptors from the file rather than calculating on-the-fly
or
- Load precomputed descriptors: filepath to where descriptors are already cached either manually or by
fastprop
-
Preprocessing
- not configurable:
fastprop
will always rescale input features, set invariant and missing features to zero, and impute missing values with the per-feature mean
- not configurable:
-
Training
- Number of Repeats: How many times to split/train/test on the dataset (increments random seed by 1 each time).
and
- Number of FNN layers (default 2; repeated fully connected layers of hidden size)
- Hidden Size: number of neurons per FNN layer (default 1800)
or
- Hyperparameter optimization: runs hyperparameter optimization identify the optimal number of layers and hidden size
generic NN training parameters
- Output Directory
- Learning rate
- Batch size
- Problem type (one of: regression, binary, multiclass (start labels from 0), multilabel)
-
Prediction
- Input SMILES: either a single SMILES or file of SMILES strings on individual lines
- Output format: filepath to write the results or nothing, defaults to stdout
Using fastprop
fastprop
can be run from the command line or as a Python module.
Regardless of the method of use the parameters described in Configurable Parameters can be modified.
Command Line
After installation, fastprop
is accessible from the command line via fastprop subcommand
, where subcommand
is either train
, predict
, or shap
.
train
takes in the parameters described in Configurable Parameters sections 1, 2, and 3 (featurization, preproccessing, and training) and trainsfastprop
model(s) on the input data.predict
uses the output of a call totrain
to make prediction on arbitrary SMILES strings.shap
performs SHAP analysis on a trained model to determine which of the input features are important.
Try fastprop --help
or fastprop subcommand --help
for more information and see below.
Configuration File [recommended]
See examples/example_fastprop_train_config.yaml
for configuration files that show all options that can be configured during training.
It is everything shown in the Configurable Parameters section.
Arguments
All of the options shown in the Configuration File section can also be passed as command line flags instead of written to a file.
When passing the arguments, replace all _
(underscore) with -
(hyphen), i.e. fastprop train --number-epochs 100
See fastprop train --help
or fastprop predict --help
for more information.
fastprop shap
and fastprop predict
have only a couple arguments and so do not use configuration files.
Python Module
Example
Here's an example of training fastprop
as a Python module on the Arockiaraj
Polycyclic Aromatic Hydrocarbon dataset, pulled largely from fastprop/cli/train.py
.
With fastprop
installed you can copy and run this script as-is!
import pandas as pd
import torch
from fastprop.data import (
clean_dataset,
fastpropDataLoader,
fastpropDataset,
split,
standard_scale,
)
from fastprop.defaults import DESCRIPTOR_SET_LOOKUP, _init_loggers, init_logger
from fastprop.descriptors import get_descriptors
from fastprop.io import load_saved_descriptors, read_input_csv
from fastprop.model import fastprop, train_and_test
# prepare the dataset
targets, smiles = read_input_csv("https://raw.githubusercontent.com/JacksonBurns/fastprop/main/benchmarks/pah/arockiaraj_pah_data.csv")
targets, rdkit_mols = clean_dataset(targets, smiles)
descriptors = get_descriptors(".", DESCRIPTOR_SET_LOOKUP["all"], rdkit_mols)
descriptors = descriptors.to_numpy(dtype=float)
descriptors = torch.tensor(descriptors, dtype=torch.float32)
targets = torch.tensor(targets, dtype=torch.float32)
# feature scaling
train_indexes, val_indexes, test_indexes = split(smiles)
descriptors[train_indexes], feature_means, feature_vars = standard_scale(descriptors[train_indexes])
descriptors[val_indexes] = standard_scale(descriptors[val_indexes], feature_means, feature_vars)
descriptors[test_indexes] = standard_scale(descriptors[test_indexes], feature_means, feature_vars)
# initialize dataloaders and model, then train
train_dataloader = fastpropDataLoader(fastpropDataset(descriptors[train_indexes], targets[train_indexes]), shuffle=True)
val_dataloader = fastpropDataLoader(fastpropDataset(descriptors[val_indexes], targets[val_indexes]))
test_dataloader = fastpropDataLoader(fastpropDataset(descriptors[test_indexes], targets[test_indexes]))
model = fastprop(feature_means, feature_vars)
test_results, validation_results = train_and_test(".", model, train_dataloader, val_dataloader, test_dataloader)
Package Structure
This section documents where the various modules and functions used in fastprop
are located.
Check each file listed for more information, as each contains additional inline documentation useful for development as a Python module.
To use the core fastprop
model and dataloaders in your own work, consider looking at shap.py
or train.py
which show how to import and instantiate the relevant classes.
fastprop
defaults
: contains the functioninit_logger
used to initialize loggers in different submodules, as well as the default configuration for training.model
: the model itself and a convenience function for training.metrics
: wraps a number of common loss and score functions.descriptors
: functions for calculating descriptors.data
: functions for cleaning and scaling data.io
: functions for loading data from files.
fastprop.cli
`fastprop_cli`` contains all the CLI code which is likely not useful in use from a script. If you wish to extend the CLI, check the inline documentation there.
Benchmarks
The benchmarks
directory contains the scripts needed to perform the studies (see benchmarks/README.md
for more detail, they are a great way to learn how to use fastprop
).
To just see the results, checkout paper/paper.pdf
(or paper/paper.md
for the plain text version).
Relationship to Chemprop
In addition to having a similar name, fastprop
and Chemprop do a similar things: map chemical structures to their corresponding properties in a user-friendly way using machine learning.
I (@JacksonBurns) am also a developer of Chemprop so some code is inevitably shared between the two (fastprop
to Chemprop and vice versa).
fastprop
feels a lot like Chemprop but without a lot of the clutter.
The fast
in fastprop
(both in usage and execution time) comes from the basic architecture, the use of caching, and the reduced configurability of fastprop
(i.e. I hope you like MSE loss for regression tasks, because that's the only training metric fastprop
will use via the CLI).
Developing fastprop
Bug reports, feature requests, and pull requests are welcome and encouraged! Follow this tutorial from GitHub to get started.
fastprop
is built around PyTorch lightning, which defines a rigid API for implementing models that is followed here.
See the section on the package layout for information on where all the other functions are, and check out the docstrings and inline comments in each file for more information on what each does.
Note that the pyproject.toml
defines optional dev
and bmark
packages, which will get you setup with the same dependencies used for CI and benchmarking.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fastprop-1.0.2.tar.gz
.
File metadata
- Download URL: fastprop-1.0.2.tar.gz
- Upload date:
- Size: 33.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 763572f29b033340d08d9e6bef494d341139066067c75de3507d611c1222caa2 |
|
MD5 | 2cd67a5283676a055a91ab0c7c59f27b |
|
BLAKE2b-256 | 9a7607675dca7827970e6792a791e8abf46721fd10514528d43d98bc74535cc5 |
File details
Details for the file fastprop-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: fastprop-1.0.2-py3-none-any.whl
- Upload date:
- Size: 31.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b8bc38204687dc52934ba01e564cb031c867dc7f2543d3927a087da98ef1eae |
|
MD5 | b210544ffcaf72f75efb813e43a5e603 |
|
BLAKE2b-256 | 6caf92964bfe552cc82e1052cb9edc951a31092da117ac944386772b544cf2e1 |