Skip to main content

Data-driven materials discovery based on composition.

Project description

DiSCoVeR

Open In Colab (PyPI) Open in Code Ocean Interactive Figures Read the Docs

PyPI version Code style: black Coverage Status Lines of code License DOI

Conda Conda Conda Anaconda-Server Badge

A materials discovery algorithm geared towards exploring high performance candidates in new chemical spaces using composition-only.

Bulk modulus values overlaid on DensMAP densities (cropped).

We describe the DiSCoVeR algorithm, how to install mat_discover, and basic usage (e.g. fit/predict, custom or built-in datasets, adaptive design). Interactive plots for several types of Pareto front plots are available via the mat_discover documentation. We also describe how to contribute, what to do if you run into bugs or have questions, and citation information. The mat_discover docs have more, such as examples (including a teaching example), the interactive figures mentioned, and the Python API.

The article (ChemRxiv) has been accepted at Digital Discovery (2021-02-03). See Citing.

DiSCoVeR Workflow

Why you'd want to use this tool, whether it's "any good", alternative tools, and summaries of the workflow.

Why DiSCoVeR?

The primary anticipated use-case of DiSCoVeR is that you have some training data (chemical formulas and target property), and you would like to determine the "next best experiment" to perform based on a user-defined relative importance of performance vs. chemical novelty. You can even run the model without any training targets which is equivalent to setting the target weight as 0.

Is it any good?

Take an initial training set of 100 chemical formulas and associated Materials Project bulk moduli followed by 900 adaptive design iterations (x-axis) using random search, novelty-only (performance weighted at 0), a 50/50 weighting split, and performance-only (novelty weighted at 0). These are the columns. The rows are the total number of observed "extraordinary" compounds (top 2%), the total number of additional unique atoms, and total number of additional unique chemical formulae templates. In other words:

  1. How many "extraordinary" compounds have been observed so far?
  2. How many unique atoms have been explored so far? (not counting atoms already in the starting 100 formulas)
  3. How many unique chemical templates (e.g. A2B3, ABC, ABC2) have been explored so far? (not counting templates already in the starting 100 formulas)

The 50/50 weighting split offers a good trade-off between performance and novelty. Click the image to navigate to the interactive figure which includes two additional rows: best so far and current observed.

We also ran some benchmarking against sklearn.neighbors.LocalOutlierFactor (novelty detection algorithm) using mat2vec and mod_petti featurizations. The interactive results are available here.

Alternatives

This approach is similar to what you will find with Bayesian optimization (BO), but with explicit emphasis on chemical novelty. If you're interested in doing Bayesian optimization, I recommend using Facebook/Ax (not affiliated). I am working on an implementation of composition-based Bayesian optimization using Ax (2021-12-10).

For alternative "suggest next experiment" materials discovery tools, see the Citrine Platform (free for non-commercial use), CAMD (trihackathon2020 tutorial notebooks), PyChemia, Heteroscedastic-BO, and thermo.

For materials informatics (MI) and other relevant codebases/links, see:

Visualization

The DiSCoVeR workflow is visualized as follows:

DiSCoVeR Workflow

Figure 1: DiSCoVeR workflow to create chemically homogeneous clusters. (a) Training and validation data are obtained inthe form of chemical formulas and target properties (i.e. performance). (b) The training and validation chemical formulasare combined and used to compute ElMD pairwise distances. (c) ElMD pairwise distance matrices are used to computeDensMAP embeddings and DensMAP densities. (d) DensMAP embeddings are used to compute HDBSCAN* clusters.(e) Validation target property predictions are made via CrabNet and plotted against the uniqueness proxy (e.g. densityproxy) in the form of a Pareto front plot. Discovery scores are assigned based on the (arbitrarily) weighted sum of scaledperformance and uniqueness proxy. Higher scores are better. (f) HDBSCAN* clustering results can be used to obtain acluster-wise performance (e.g. average target property) plotted against a cluster-wise uniqueness proxy (e.g. fraction ofvalidation compounds vs. total compounds within a cluster).

Tabular Summary

A summary of the DiSCoVeR methods are given in the following table:

Table 1: A description of methods used in this work and each method’s role in DiSCoVeR. ∗A Pareto front is more information-dense than a proxy score in that there are no predefined relative weights for performance vs. uniqueness proxy. Compounds that are closer to the Pareto front are better. The upper areas of the plot represent a higher weight towards performance while the right-most areas of the plot represent a higher weight towards uniqueness.

Method What is it? What is its role in DiSCoVeR?
CrabNet Composition-based property regression Predict performance for proxy scores
ElMD Composition-based distance metric Supply distance matrix to DensMAP
DensMAP Density-aware dimensionality reduction Obtain densities for density proxy
HDBSCAN* Density-aware clustering Create chemically homogeneous clusters
Peak proxy High performance relative to nearby compounds Proxy for "surprising" high performance
Density proxy Sparsity relative to nearby compounds Proxy for chemical novelty
Peak proxy score Weighted sum of performance and peak proxy Used to rank compounds
Density proxy score Weighted sum of performance and density proxy Used to rank compounds
Pareto front Optimal performance/uniqueness trade-offs Visually screen compounds (no weights*)

Installation

I recommend that you run mat_discover in a separate conda environment, at least for initial testing. After installing Anaconda or Miniconda, you can create a new environment in Python 3.9 (mat_discover is also tested on 3.7 and 3.8) via:

conda create --name mat_discover python==3.9.*

There are three ways to install mat_discover: Anaconda (conda), PyPI (pip), and from source. Anaconda is the preferred method.

Anaconda

To install mat_discover using conda, first, update conda via:

conda update conda

The Anaconda mat_discover package is hosted on the @sgbaird channel and can be installed via:

conda install -c sgbaird mat_discover

Pip

To install via pip, first update pip via:

pip install -U pip

Due to limitations of PyPI distributions of CUDA/PyTorch, you will need to install PyTorch separately via the command that's most relevant to you (PyTorch Getting Started). For example, for Stable/Windows/Pip/Python/CUDA-11.3:

pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Finally, install mat_discover:

pip install mat_discover

From Source

To install from source, clone the mat_discover repository:

git clone https://github.com/sparks-baird/mat_discover.git
cd mat_discover

To perform the local installation, you can use pip, conda, or flit. If using flit, make sure to install it first via conda install flit or pip install flit.

pip conda flit
pip install -e . conda env create --file environment.yml flit install --pth-file

Basic Usage

How to fit/predict, use custom or built-in datasets, and perform adaptive design.

Fit/Predict

from mat_discover.mat_discover_ import Discover
disc = Discover()
disc.fit(train_df) # DataFrames should have at minimum "formula" and "target" columns
scores = disc.predict(val_df)
disc.plot()
disc.save()
print(disc.dens_score_df.head(10), disc.peak_score_df.head(10))

⚠️ ignore the "validation" mean absolute error (MAE) command line output during disc.fit(train_df) ⚠️

See mat_discover_example.py, Open In Colab (PyPI), or Binder. On Google Colab and Binder, this may take a few minutes to install and load, respectively. During training and prediction, Google Colab will be faster than Binder since Google Colab has access to a GPU while Binder does not. Sometimes Binder takes a long time to load, so please consider using Open In Colab or the normal installation instructions instead.

Load Data

If you're using your own dataset, you will need to supply a Pandas DataFrame that contains formula (string) and target (numeric) columns. If you have a train.csv file (located in current working directory) with these two columns, this can be converted to a DataFrame via:

import pandas as pd
train_df = pd.read_csv("train.csv")

For validation data without known property values to be used with predict, dummy values (all zeros) are assigned internally. In this case, you can read in a CSV file that contains only the formula (string) column:

val_df = pd.read_csv("val.csv")

Note that you can load any of the datasets within CrabNet/data/, which includes matbench data, other datasets from the CrabNet paper, and a recent (as of Oct 2021) snapshot of K_VRH bulk modulus data from Materials Project. For example, to load the bulk modulus snapshot:

from crabnet.data.materials_data import elasticity
train_df, val_df = disc.data(elasticity, "train.csv") # note that `val.csv` within `elasticity` is every other Materials Project compound (i.e. "target" column filled with zeros)

The built-in data directories are as follows:

{'benchmark_data',
 'benchmark_data.CritExam__Ed',
 'benchmark_data.CritExam__Ef',
 'benchmark_data.OQMD_Bandgap',
 'benchmark_data.OQMD_Energy_per_atom',
 'benchmark_data.OQMD_Formation_Enthalpy',
 'benchmark_data.OQMD_Volume_per_atom',
 'benchmark_data.aflow__Egap',
 'benchmark_data.aflow__ael_bulk_modulus_vrh',
 'benchmark_data.aflow__ael_debye_temperature',
 'benchmark_data.aflow__ael_shear_modulus_vrh',
 'benchmark_data.aflow__agl_thermal_conductivity_300K',
 'benchmark_data.aflow__agl_thermal_expansion_300K',
 'benchmark_data.aflow__energy_atom',
 'benchmark_data.mp_bulk_modulus',
 'benchmark_data.mp_e_hull',
 'benchmark_data.mp_elastic_anisotropy',
 'benchmark_data.mp_mu_b',
 'benchmark_data.mp_shear_modulus',
 'element_properties',
 'matbench',
 'materials_data',
 'materials_data.elasticity',
 'materials_data.example_materials_property'}

To see what .csv files are available (e.g. train.csv), you will probably need to navigate to CrabNet/data/ and explore. For example, to use a snapshot of the Materials Project e_above_hull dataset (mp_e_hull):

from crabnet.data.benchmark_data import mp_e_hull
train_df = disc.data(mp_e_hull, "train.csv", split=False)
val_df = disc.data(mp_e_hull, "val.csv", split=False)
test_df = disc.data(mp_ehull, "test.csv", split=False)

Finally, to download data from Materials Project directly, see generate_elasticity_data.py.

Adaptive Design

The anticipated end-use of mat_discover is in an adaptive design scheme where the objective function (e.g. wetlab synthesis and characterization) is expensive. After loading some data for a validation scenario (or your own data)

from crabnet.data.materials_data import elasticity
from mat_discover.utils.data import data
from mat_discover.adaptive_design import Adapt
train_df, val_df = data(elasticity, "train.csv", dummy=False, random_state=42)
train_df, val_df, extraordinary_thresh = extraordinary_split(
    train_df, val_df, train_size=100, extraordinary_percentile=0.98, random_state=42
)

you can then predict your first additional experiment to run via:

adapt = Adapt(train_df, val_df, timed=False)
first_experiment = adapt.suggest_first_experiment() # fit Discover() to train_df, then move top-ranked from val_df to train_df

Subsequent experiments are suggested as follows:

second_experiment = adapt.suggest_next_experiment() # refit CrabNet, use existing DensMAP data, move top-ranked from val to train
third_experiment = adapt.suggest_next_experiment()

Alternatively, you can do this in a closed loop via:

n_iter = 100
adapt.closed_loop_adaptive_design(n_experiments=n_iter, print_experiment=False)

However, as the name suggests, the closed loop approach does not allow you to input data after each suggested experiment.

Developing and Contributing

This project was developed primarily in Python in Visual Studio Code using black, mypy, pydocstyle, kite, other tools, and various community extensions. Some other notable tools used in this project are:

  • Miniconda
  • pipreqs was used as a starting point for requirements.txt
  • flit is used to create pyproject.toml to publish to PyPI
  • conda env export --from-history -f environment.yml was used as a starting point for environment.yml
  • grayskull and conda-souschef are used to generate and tweak meta.yaml, respectively, for publishing to Anaconda (if you know how to get this up on conda-forge, help is welcome 😉)
  • A variety of GitHub actions are used (see workflows)
  • pytest is used for testing
  • numba is used to accelerate the Wasserstein distance matrix computations via CPU or GPU

For simple changes, navigate to github.com/sparks-baird/mat_discover, click on the relevant file (e.g. README.md), and look for the pencil (✏️). GitHub will walk you through the rest.

To help with in-depth development, you will need to install from source. Note that when using a conda environment (recommended), you may avoid certain issues down the road by opening VS Code via an Anaconda command prompt and entering the command code (at least until the VS Code devs fix some of the issues associated with opening it "normally"). For example, in Windows, press the "Windows" key, type "anaconda", and open "Anaconda Powershell Prompt (miniconda3)" or similar. Then type code and press enter. To build the docs, first install sphinx and sphinx_rtd_theme. Then run:

cd docs/
make html

And open docs/build/index.html (e.g. via start index.html on Windows)

Bugs, Questions, and Suggestions

If you find a bug or have suggestions for documentation please open an issue. If you're reporting a bug, please include a simplified reproducer. If you have questions, have feature suggestions/requests, or are interested in extending/improving mat_discover and would like to discuss, please use the Discussions tab and use the appropriate category ("Ideas", "Q&A", etc.). If you have a question, please ask! I won't bite. Pull requests are welcome and encouraged.

Citing

The preprint is hosted on ChemRxiv:

Baird S, Diep T, Sparks T. DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions. ChemRxiv 2021. doi:10.33774/chemrxiv-2021-5l2f8-v3. This content is a preprint and has not been peer-reviewed.

The BibTeX citation is as follows:

@article{baird_diep_sparks_2021,
place={Cambridge},
title={DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions},
DOI={10.33774/chemrxiv-2021-5l2f8-v3},
journal={ChemRxiv},
publisher={Cambridge Open Engage},
author={Baird, Sterling and Diep, Tran and Sparks, Taylor},
year={2021}
}

The article is under review at Digital Discovery.

Looking for more?

See examples, including a teaching example, and the Python API.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mat_discover-2.1.0.tar.gz (16.2 MB view details)

Uploaded Source

Built Distribution

mat_discover-2.1.0-py2.py3-none-any.whl (41.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file mat_discover-2.1.0.tar.gz.

File metadata

  • Download URL: mat_discover-2.1.0.tar.gz
  • Upload date:
  • Size: 16.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.27.1

File hashes

Hashes for mat_discover-2.1.0.tar.gz
Algorithm Hash digest
SHA256 479fb3040f4276847f52f70b757d3515f82cb02f20fceb608bc10c40e0dc4ab7
MD5 27756791626edbf5b04c54944c19f477
BLAKE2b-256 384539c7f55c584e289f0afcf26590736c31ff8b4ee0398f93f40dd8c2bbed42

See more details on using hashes here.

File details

Details for the file mat_discover-2.1.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for mat_discover-2.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 145233d85a8b698fb33a99c112606ea737c23b060faa053170fc50e3df76acd6
MD5 e7da3b7359927d2f2892b7219872e43e
BLAKE2b-256 e3611b44a11321923afed188dcbb01fd8be7a8c460147f91e1bfd59ad30cf091

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page