Data-driven materials discovery based on composition.

These details have not been verified by PyPI

Project links

Source

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

DiSCoVeR

PyPI - Downloads

Conda Conda

Lines of code GitHub

A materials discovery algorithm geared towards exploring high performance candidates in new chemical spaces using composition-only.

^{Bulk modulus values overlaid on DensMAP densities (cropped).}

Citing

The preprint is hosted on ChemRxiv:

Baird S, Diep T, Sparks T. DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions. ChemRxiv 2021. doi:10.33774/chemrxiv-2021-5l2f8-v2. This content is a preprint and has not been peer-reviewed.

The BibTeX citation is as follows:

@article{baird_diep_sparks_2021,
place={Cambridge},
title={DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions},
DOI={10.33774/chemrxiv-2021-5l2f8-v2},
journal={ChemRxiv},
publisher={Cambridge Open Engage},
author={Baird, Sterling and Diep, Tran and Sparks, Taylor},
year={2021}
}

DiSCoVeR Workflow

^{Figure 1. DiSCoVeR workflow to create chemically homogeneous clusters. (a) Training and validation data. (b) ElMD pairwise distances. (c) DensMAP embeddings and DensMAP densities. (d) Clustering via HDBSCAN*. (e) Pareto plot and discovery scores. (f) Pareto plot of cluster properties}

Installation

I recommend that you run mat_discover in a separate conda environment, at least for initial testing. After installing Anaconda or Miniconda, you can create a new environment via:

conda create --name mat_discover

There are three ways to install mat_discover: Anaconda (conda), PyPI (pip), and from source. Anaconda is the preferred method.

Anaconda

To install mat_discover using conda, first, update conda via:

conda update conda

The Anaconda mat_discover package is hosted on the @sgbaird channel and can be installed via:

conda install -c sgbaird mat_discover

Pip

To install via pip, first update pip via:

pip install -U pip

Due to limitations of PyPI distributions of CUDA/PyTorch, you will need to install PyTorch separately via the command that's most relevant to you (PyTorch Getting Started). For example, for Stable/Windows/Pip/Python/CUDA-11.3:

pip3 install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Finally, install mat_discover:

pip install mat_discover

From Source

To install from source, clone the mat_discover repository:

git clone --recurse-submodules https://github.com/sparks-baird/mat_discover.git
cd mat_discover

To perform the local installation, you can use pip, conda, or flit:

pip	conda	flit
`pip install -e .`	`conda env create --file environment.yml`	`conda install flit; flit install`

Basic Usage

Fit/Predict

from mat_discover.mat_discover_ import Discover
disc = Discover()
disc.fit(train_df) # DataFrames should have at minimum "formula" and "target" columns
scores = disc.predict(val_df)
disc.plot()
disc.save()
print(disc.dens_score_df.head(10), disc.peak_score_df.head(10))

See mat_discover_example.py, , or . On Google Colab and Binder, this may take a few minutes to install and load, respectively. During training and prediction, Google Colab will be faster than Binder since Google Colab has access to a GPU while Binder does not.

Load Data

If you're using your own dataset, you will need to supply a Pandas DataFrame that contains formula and target columns. If you have a train.csv file (located in current working directory) with these two columns, this can be converted to a DataFrame via:

import pandas as pd
df = pd.read_csv("train.csv")

Note that you can load any of the datasets within CrabNet/data/, which includes matbench data, other datasets from the CrabNet paper, and a recent (as of Oct 2021) snapshot of K_VRH bulk modulus data from Materials Project. For example, to load the bulk modulus snapshot:

from mat_discover.CrabNet.data.materials_data import elasticity
train_df, val_df = disc.data(elasticity, "train.csv") # note that `val.csv` within `elasticity` is every other Materials Project compound (i.e. "target" column filled with zeros)

The built-in data directories are as follows:

{'benchmark_data',
 'benchmark_data.CritExam__Ed',
 'benchmark_data.CritExam__Ef',
 'benchmark_data.OQMD_Bandgap',
 'benchmark_data.OQMD_Energy_per_atom',
 'benchmark_data.OQMD_Formation_Enthalpy',
 'benchmark_data.OQMD_Volume_per_atom',
 'benchmark_data.aflow__Egap',
 'benchmark_data.aflow__ael_bulk_modulus_vrh',
 'benchmark_data.aflow__ael_debye_temperature',
 'benchmark_data.aflow__ael_shear_modulus_vrh',
 'benchmark_data.aflow__agl_thermal_conductivity_300K',
 'benchmark_data.aflow__agl_thermal_expansion_300K',
 'benchmark_data.aflow__energy_atom',
 'benchmark_data.mp_bulk_modulus',
 'benchmark_data.mp_e_hull',
 'benchmark_data.mp_elastic_anisotropy',
 'benchmark_data.mp_mu_b',
 'benchmark_data.mp_shear_modulus',
 'element_properties',
 'matbench',
 'materials_data',
 'materials_data.elasticity',
 'materials_data.example_materials_property'}

To see what .csv files are available (e.g. train.csv), you will probably need to navigate to CrabNet/data/ and explore.

Finally, to download data from Materials Project directly, see generate_elasticity_data.py.

Interactive Plots

Interactive plots for several types of Pareto front plots can be found here.

Developing

This project was developed primarily in "Python in Visual Studio Code" using black, mypy, pydocstyle, kite, other tools, and various community extensions. Some other notable tools used in this project are:

Miniconda
pipreqs was used as a starting point for requirements.txt
flit is used to create pyproject.toml and publish to PyPI
conda env export --from-history -f environment.yml was used as a starting point for environment.yml
grayskull is used to generate meta.yaml for publishing to conda-forge
conda-smithy is used to create a feedstock for conda-forge
A variety of GitHub actions are used (see workflows)
pytest is used for testing
numba is used to accelerate the Wasserstein distance matrix computations via CPU or GPU

To help with development, you will need to install from source. Note that when using a conda environment (recommended), you may avoid certain issues down the road by opening VS Code via an Anaconda command prompt and entering the command code (at least until the VS Code devs fix some of the issues associated with opening it "normally"). For example, in Windows, press the "Windows" key, type "anaconda", and open "Anaconda Powershell Prompt (miniconda3)" or similar. Then type code and press enter.

Bugs, Questions, and Suggestions

If you find a bug or have suggestions for documentation please open an issue. If you're reporting a bug, please include a simplified reproducer. If you have questions, have feature suggestions/requests, or are interested in extending/improving mat_discover and would like to discuss, please use the Discussions tab and use the appropriate category ("Ideas", "Q&A", etc.). Pull requests are welcome and encouraged.

Project details

These details have not been verified by PyPI

Project links

Source

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.2.9

Jun 23, 2023

2.2.7

Aug 13, 2022

2.2.5

Aug 13, 2022

2.2.3

Jun 29, 2022

2.2.2

Mar 10, 2022

2.2.1 yanked

Mar 10, 2022