Skip to main content

Data-driven materials discovery based on composition.

Project description

DiSCoVeR

Open In Colab (PyPI) Binder Open in Code Ocean

PyPI version PyPI - Downloads

Conda Conda Anaconda-Server Downloads Anaconda-Server Badge

Code style: black Coverage Status Lines of code GitHub DOI

A materials discovery algorithm geared towards exploring high performance candidates in new chemical spaces using composition-only.

Bulk modulus values overlaid on DensMAP densities (cropped).

Citing

The preprint is hosted on ChemRxiv:

Baird S, Diep T, Sparks T. DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions. ChemRxiv 2021. doi:10.33774/chemrxiv-2021-5l2f8-v2. This content is a preprint and has not been peer-reviewed.

The BibTeX citation is as follows:

@article{baird_diep_sparks_2021,
place={Cambridge},
title={DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions},
DOI={10.33774/chemrxiv-2021-5l2f8-v2},
journal={ChemRxiv},
publisher={Cambridge Open Engage},
author={Baird, Sterling and Diep, Tran and Sparks, Taylor},
year={2021}
}

DiSCoVeR Workflow

DiSCoVeR Workflow

Figure 1. DiSCoVeR workflow to create chemically homogeneous clusters. (a) Training and validation data. (b) ElMD pairwise distances. (c) DensMAP embeddings and DensMAP densities. (d) Clustering via HDBSCAN*. (e) Pareto plot and discovery scores. (f) Pareto plot of cluster properties

Installation

I recommend that you run mat_discover in a separate conda environment, at least for initial testing. After installing Anaconda or Miniconda, you can create a new environment via:

conda create --name mat_discover

There are three ways to install mat_discover: Anaconda (conda), PyPI (pip), and from source. Anaconda is the preferred method.

Anaconda

To install mat_discover using conda, first, update conda via:

conda update conda

The Anaconda mat_discover package is hosted on the @sgbaird channel and can be installed via:

conda install -c sgbaird mat_discover

Pip

To install via pip, first update pip via:

pip install -U pip

Due to limitations of PyPI distributions of CUDA/PyTorch, you will need to install PyTorch separately via the command that's most relevant to you (PyTorch Getting Started). For example, for Stable/Windows/Pip/Python/CUDA-11.3:

pip3 install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Finally, install mat_discover:

pip install mat_discover

From Source

To install from source, clone the mat_discover repository:

git clone --recurse-submodules https://github.com/sparks-baird/mat_discover.git
cd mat_discover

To perform the local installation, you can use pip, conda, or flit:

pip conda flit
pip install -e . conda env create --file environment.yml conda install flit; flit install

Basic Usage

Fit/Predict

from mat_discover.mat_discover_ import Discover
disc = Discover()
disc.fit(train_df) # DataFrames should have at minimum "formula" and "target" columns
scores = disc.predict(val_df)
disc.plot()
disc.save()
print(disc.dens_score_df.head(10), disc.peak_score_df.head(10))

See mat_discover_example.py, Open In Colab (PyPI), or Binder. On Google Colab and Binder, this may take a few minutes to install and load, respectively. During training and prediction, Google Colab will be faster than Binder since Google Colab has access to a GPU while Binder does not.

Load Data

If you're using your own dataset, you will need to supply a Pandas DataFrame that contains formula and target columns. If you have a train.csv file (located in current working directory) with these two columns, this can be converted to a DataFrame via:

import pandas as pd
df = pd.read_csv("train.csv")

Note that you can load any of the datasets within CrabNet/data/, which includes matbench data, other datasets from the CrabNet paper, and a recent (as of Oct 2021) snapshot of K_VRH bulk modulus data from Materials Project. For example, to load the bulk modulus snapshot:

from mat_discover.CrabNet.data.materials_data import elasticity
train_df, val_df = disc.data(elasticity, "train.csv") # note that `val.csv` within `elasticity` is every other Materials Project compound (i.e. "target" column filled with zeros)

The built-in data directories are as follows:

{'benchmark_data',
 'benchmark_data.CritExam__Ed',
 'benchmark_data.CritExam__Ef',
 'benchmark_data.OQMD_Bandgap',
 'benchmark_data.OQMD_Energy_per_atom',
 'benchmark_data.OQMD_Formation_Enthalpy',
 'benchmark_data.OQMD_Volume_per_atom',
 'benchmark_data.aflow__Egap',
 'benchmark_data.aflow__ael_bulk_modulus_vrh',
 'benchmark_data.aflow__ael_debye_temperature',
 'benchmark_data.aflow__ael_shear_modulus_vrh',
 'benchmark_data.aflow__agl_thermal_conductivity_300K',
 'benchmark_data.aflow__agl_thermal_expansion_300K',
 'benchmark_data.aflow__energy_atom',
 'benchmark_data.mp_bulk_modulus',
 'benchmark_data.mp_e_hull',
 'benchmark_data.mp_elastic_anisotropy',
 'benchmark_data.mp_mu_b',
 'benchmark_data.mp_shear_modulus',
 'element_properties',
 'matbench',
 'materials_data',
 'materials_data.elasticity',
 'materials_data.example_materials_property'}

To see what .csv files are available (e.g. train.csv), you will probably need to navigate to CrabNet/data/ and explore.

Finally, to download data from Materials Project directly, see generate_elasticity_data.py.

Interactive Plots

Interactive plots for several types of Pareto front plots can be found here.

Developing

This project was developed primarily in "Python in Visual Studio Code" using black, mypy, pydocstyle, kite, other tools, and various community extensions. Some other notable tools used in this project are:

  • Miniconda
  • pipreqs was used as a starting point for requirements.txt
  • flit is used to create pyproject.toml and publish to PyPI
  • conda env export --from-history -f environment.yml was used as a starting point for environment.yml
  • grayskull is used to generate meta.yaml for publishing to conda-forge
  • conda-smithy is used to create a feedstock for conda-forge
  • A variety of GitHub actions are used (see workflows)
  • pytest is used for testing
  • numba is used to accelerate the Wasserstein distance matrix computations via CPU or GPU

To help with development, you will need to install from source. Note that when using a conda environment (recommended), you may avoid certain issues down the road by opening VS Code via an Anaconda command prompt and entering the command code (at least until the VS Code devs fix some of the issues associated with opening it "normally"). For example, in Windows, press the "Windows" key, type "anaconda", and open "Anaconda Powershell Prompt (miniconda3)" or similar. Then type code and press enter.

Bugs, Questions, and Suggestions

If you find a bug or have suggestions for documentation please open an issue. If you're reporting a bug, please include a simplified reproducer. If you have questions, have feature suggestions/requests, or are interested in extending/improving mat_discover and would like to discuss, please use the Discussions tab and use the appropriate category ("Ideas", "Q&A", etc.). Pull requests are welcome and encouraged.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mat_discover-1.3.0.tar.gz (8.9 MB view hashes)

Uploaded Source

Built Distribution

mat_discover-1.3.0-py2.py3-none-any.whl (39.7 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page