Data-driven materials discovery based on composition.
Project description
DiSCoVeR
A materials discovery algorithm geared towards exploring high performance candidates in new chemical spaces using composition-only.
Bulk modulus values overlaid on DensMAP densities (cropped).Citing
The preprint is hosted on ChemRxiv:
Baird S, Diep T, Sparks T. DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions. ChemRxiv 2021. doi:10.33774/chemrxiv-2021-5l2f8-v2. This content is a preprint and has not been peer-reviewed.
The BibTeX citation is as follows:
@article{baird_diep_sparks_2021,
place={Cambridge},
title={DiSCoVeR: a Materials Discovery Screening Tool for High Performance, Unique Chemical Compositions},
DOI={10.33774/chemrxiv-2021-5l2f8-v2},
journal={ChemRxiv},
publisher={Cambridge Open Engage},
author={Baird, Sterling and Diep, Tran and Sparks, Taylor},
year={2021}
}
DiSCoVeR Workflow
Figure 1. DiSCoVeR workflow to create chemically homogeneous clusters. (a) Training and validation data. (b) ElMD pairwise distances. (c) DensMAP embeddings and DensMAP densities. (d) Clustering via HDBSCAN*. (e) Pareto plot and discovery scores. (f) Pareto plot of cluster properties
Installation
I recommend that you run mat_discover
in a separate conda environment, at least for initial testing. After installing Anaconda or Miniconda, you can create a new environment via:
conda create --name mat_discover
There are three ways to install mat_discover
: Anaconda (conda
), PyPI (pip
), and from source. Anaconda is the preferred method.
Anaconda
To install mat_discover
using conda
, first, update conda
via:
conda update conda
The Anaconda mat_discover
package is hosted on the @sgbaird channel and can be installed via:
conda install -c sgbaird mat_discover
Pip
To install via pip
, first update pip
via:
pip install -U pip
Due to limitations of PyPI distributions of CUDA/PyTorch, you will need to install PyTorch separately via the command that's most relevant to you (PyTorch Getting Started). For example, for Stable/Windows/Pip/Python/CUDA-11.3:
pip3 install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
Finally, install mat_discover
:
pip install mat_discover
From Source
To install from source, clone the mat_discover
repository:
git clone --recurse-submodules https://github.com/sparks-baird/mat_discover.git
cd mat_discover
To perform the local installation, you can use pip
, conda
, or flit
:
pip | conda | flit |
---|---|---|
pip install -e . |
conda env create --file environment.yml |
conda install flit; flit install |
Basic Usage
Fit/Predict
from mat_discover.mat_discover_ import Discover
disc = Discover()
disc.fit(train_df) # DataFrames should have at minimum "formula" and "target" columns
scores = disc.predict(val_df)
disc.plot()
disc.save()
print(disc.dens_score_df.head(10), disc.peak_score_df.head(10))
See mat_discover_example.py, , or . On Google Colab and Binder, this may take a few minutes to install and load, respectively. During training and prediction, Google Colab will be faster than Binder since Google Colab has access to a GPU while Binder does not.
Load Data
If you're using your own dataset, you will need to supply a Pandas DataFrame that contains formula
and target
columns. If you have a train.csv
file (located in current working directory) with these two columns, this can be converted to a DataFrame via:
import pandas as pd
df = pd.read_csv("train.csv")
Note that you can load any of the datasets within CrabNet/data/
, which includes matbench
data, other datasets from the CrabNet paper, and a recent (as of Oct 2021) snapshot of K_VRH
bulk modulus data from Materials Project. For example, to load the bulk modulus snapshot:
from crabnet.data.materials_data import elasticity
train_df, val_df = disc.data(elasticity, "train.csv") # note that `val.csv` within `elasticity` is every other Materials Project compound (i.e. "target" column filled with zeros)
The built-in data directories are as follows:
{'benchmark_data', 'benchmark_data.CritExam__Ed', 'benchmark_data.CritExam__Ef', 'benchmark_data.OQMD_Bandgap', 'benchmark_data.OQMD_Energy_per_atom', 'benchmark_data.OQMD_Formation_Enthalpy', 'benchmark_data.OQMD_Volume_per_atom', 'benchmark_data.aflow__Egap', 'benchmark_data.aflow__ael_bulk_modulus_vrh', 'benchmark_data.aflow__ael_debye_temperature', 'benchmark_data.aflow__ael_shear_modulus_vrh', 'benchmark_data.aflow__agl_thermal_conductivity_300K', 'benchmark_data.aflow__agl_thermal_expansion_300K', 'benchmark_data.aflow__energy_atom', 'benchmark_data.mp_bulk_modulus', 'benchmark_data.mp_e_hull', 'benchmark_data.mp_elastic_anisotropy', 'benchmark_data.mp_mu_b', 'benchmark_data.mp_shear_modulus', 'element_properties', 'matbench', 'materials_data', 'materials_data.elasticity', 'materials_data.example_materials_property'}
To see what .csv
files are available (e.g. train.csv
), you will probably need to navigate to CrabNet/data/ and explore.
Finally, to download data from Materials Project directly, see generate_elasticity_data.py.
Interactive Plots
Interactive plots for several types of Pareto front plots can be found here.
Developing
This project was developed primarily in "Python in Visual Studio Code" using black
, mypy
, pydocstyle
, kite
, other tools, and various community extensions. Some other notable tools used in this project are:
- Miniconda
pipreqs
was used as a starting point forrequirements.txt
flit
is used to createpyproject.toml
and publish to PyPIconda env export --from-history -f environment.yml
was used as a starting point forenvironment.yml
grayskull
is used to generatemeta.yaml
for publishing toconda-forge
conda-smithy
is used to create a feedstock forconda-forge
- A variety of GitHub actions are used (see workflows)
pytest
is used for testingnumba
is used to accelerate the Wasserstein distance matrix computations via CPU or GPU
To help with development, you will need to install from source. Note that when using a conda
environment (recommended), you may avoid certain issues down the road by opening VS Code via an Anaconda command prompt and entering the command code
(at least until the VS Code devs fix some of the issues associated with opening it "normally"). For example, in Windows, press the "Windows" key, type "anaconda", and open "Anaconda Powershell Prompt (miniconda3)" or similar. Then type code
and press enter.
Bugs, Questions, and Suggestions
If you find a bug or have suggestions for documentation please open an issue. If you're reporting a bug, please include a simplified reproducer. If you have questions, have feature suggestions/requests, or are interested in extending/improving mat_discover
and would like to discuss, please use the Discussions tab and use the appropriate category ("Ideas", "Q&A", etc.). Pull requests are welcome and encouraged.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mat_discover-1.3.3.tar.gz
.
File metadata
- Download URL: mat_discover-1.3.3.tar.gz
- Upload date:
- Size: 8.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.25.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9a063fd0dcb40783c8b7401d958a6ebc851f84d4667d0645c2406ac6cf0fc3f |
|
MD5 | effb62871020e2231b7b6fdd6cecd6ef |
|
BLAKE2b-256 | e5c7a7a4a444b3dd8a5b838954f41a0bbcafb2f1e6f15cf15119736adf8abdf3 |
File details
Details for the file mat_discover-1.3.3-py2.py3-none-any.whl
.
File metadata
- Download URL: mat_discover-1.3.3-py2.py3-none-any.whl
- Upload date:
- Size: 40.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.25.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a25be7dbc26f73c19689ab39b32ec77749764f18c5f468d02a61d44df0457916 |
|
MD5 | 7d67e5a305932752164d69a7ca4e1e54 |
|
BLAKE2b-256 | 269e8a02ff0323d6f47838da82409b899fecb2d01625a70f6a5529dbb6676bca |