Thermodynamic stability of chemical compound using DNN.
Project description
Machine Learning for Material properties
General Information
Thermo_stability project trains Machine Learning (ML) models to predict thermodynamic stability of inorganic crystalll chemical compounds, inspired by ElemNet paper. The study uses binary classification for various ML methods. For more detailed information about the concept and results of this study, refer to ThermoStability file.
Dataset Source:
The dataset used are Next-Gen Materials Project API from Materials Project (MPR) database. To download dataset, the free api_key was used from MPR.
Bindary Classification with ML Models:
- Logistc Regression
- Random Forest
- Deep Neural Networks (DNN)
ML features:
The ML features used in this study are classified into two categories:
- Data Extraction by directly downloading features from MPR database:
nelements, density, energy_per_atom, formation_energy_per_atom, band_gap, cbm, vbm, vpa, magmom_pa. The last two features (vpa, magmom_pa) are computed from volume, total_magnetization descriptor normalized by the nsites. The other - Feature Engineering including atomic fractions and bond structure statistics. Both sample weight and model weight are applied to DNN and LogisticRegression, whereas only model weight is applied to RandomForest model.
Split datasets to train, validation and test in 80-10-10 percent respectively. Split data saved in numpy arrays and pandas dataframes formats.
ML Frameworks:
- Scikit-Learn for Logistic Regression & Random Forest
- TensorFlow for DNN
Technical Details
- Python Version:
Python 3.10 - Environment Manager: Conda. Create and activate conda environment:
conda create -n env_ml
conda activate env_ml
Setting up the Environment:
Required packages to set the environment
# Install required packages:
pip install tensorflow==2.12.0
pip install scikit-learn
pip install numpy==1.24.3
pip install pandas
pip install matplotlib
pip install scipy
pip install mp-api
pip install xgboost
pip install seaborn
MacOS needs slightly different packages, especially for tensoflow. Use env_ml_macos.yml file to setup conda environment.
HTCondor setup
The bond_structure function in common.py takes a long time (about 36 hours) to run it with one computer. The whole package is in the condor folder. First prepare the conda environment to install packages needed to run bond_structure
conda create -n htcon_env -c conda-forge -c defaults python=3.10 numpy=1.26.4 mp-api=0.45.5 pandas=2.2.3 pip
conda activate htcon_env
The above packages are added to mlenv.yml. To set up the environment run:
conda env create -f env_ml.yml
Code structure:
Thermo_stability project directory tree:
├── __init__.py
├── bond_stats
├── directory_tree.txt
├── files
│ ├── bond_stats
│ ├── logs
│ │ ├── dnn_accuracy
│ │ ├── dnn_auc
│ │ ├── dnn_hypertune_mp_accuracy.txt
│ │ ├── dnn_hypertune_auc.txt
│ │ ├── feature.txt
│ │ ├── plotting.txt
│ │ └── processin.txt
│ ├── MLHypertune_pars
│ │ ├── DNN_hypertune.txt
│ │ └── npzfiles
│ ├── models
│ │ └── MLHypertune_pars
│ └── structures_dict.json
├── scripts
│ ├── bond_structure.sh
│ ├── mp_jobs.sh
│ ├── multiprocessing_hypertune.py
├── test
│ ├── classification_hyperpars.py
│ ├── classification.py
│ ├── data.py
│ ├── dnn_schematic.py
│ ├── dnn_shap.py
│ ├── pdp_dnnresults.py
│ ├── plotting.py
│ ├── precision_recall.py
│ ├── RFLR_perfomance.py
│ ├── scorehyper_barchart.py
│ ├── scorehyper_heatmap.py
│ └── uncertainty.py
└── thermo_stability
├── __init__.py
├── config.py
├── creds.py
├── feature.py
├── processing.py
└── utils.py
Git clone the project:
git clone https://github.com/snabili/thermo_stability.git
Data preparation:
1. API_key
Register with the Materials Project website to directly access to the database via API_key
2. Data extraction:
To download features from MPR database:
python thermo_stability/feature.py data_acquisition
3. Data Engineering:
To extract atomic_fraction:
python thermo_stability/feature.py atomic_fraction
The bond_structure feature extraction is computationally expensive. The prefered method is to use HighThroughputCondor. Alternatively, multiprocessing tools should be used. The cpu_time = 181.06 sec with 4.0 CPUs to process 1000 datasets. Run the following to extract bond_structure:
bash scripts/bond_structure.sh
To merge bond_structure csv files, and later all features:
python thermo_stability/feature.py merge_df_structure
python thermo_stability/features.py merge_df_file
4. Split datasets:
To split datasets into train, validation, and save to numpy and pandas dataframes:
python test/data.py
ML hyperparameters:
Tuned ML hyperparameters using GridSearchCV from sklearn.model_selection tools from python scikit-library. Parameters are tuned for each ML model individually based on maximizing scoring on auc. Hypertuning Logisti Regression can be done in an interactive node (cpu_time=496 sec) as follow:
python classification_hyperpars.py LR_hypertune
The same command can be used to hypertune RandomForest by replacing LR_hypertune with RF_hypertune (cpu_time=159 sec)
Hypertuning DNN takes more cpu_time and multiprocessing cores was used. To run DNN hypertuning do:
python scripts/multiprocessing_hypertune.py
The cpu_time for DNN run with three CPU cores varies from 1069.02 sec to 653.09 sec.
The hypertuned results are saved into log files, later are called by classificaton.py.
Diagnostic plots
Metrics to make diagnostic plots to assess ML methods performance are the ROC AreaUnderCurve (auc), accuracy, and F1-score. To plot the effect of hyperparamter tuning for RandomForest & LogisticRegression and DNN two python codes are used.
RandomForest & LogisticRegression:
An example of how to plot LogisticRegression and RandomForest hyperparameter results:
python test/RFLR_perfomance.py --script LR_performance --lr_c 0.001 0.01 0.1 1 10 100
python test/RFLR_perfomance.py --script RF_performance --rf_nest 100 200 300 400 500
DNN:
To extract the best hyperparameters:
python test/scorehyper_barchart.py --metric acc
python test/scorehyper_barchart.py --metric roc
python test/scorehyper_barchart.py --metric f1
The output will be saved in three text files named scoretune_metric-acc.txt, scoretune_metric-roc.txt and scoretune_metric-f1.txt in files/logs format. The values used in classification.py code are extracted from scoretune_metric-acc.txt the result is almost the same and the improvement in performance is negligible.
ML Classification:
To run classification:
python test/classification.py dnn_classification
python test/classification.py lr_classification
python test/classification.py rf_classification
Plot results:
A decorator is used to plot a specific result.
To plot the mean score cross-validation heatmap plot, run this:
python test/scorehyper_heatmap.py --metric roc --NL 2
This is the plot:
To make DNN metric plots:
python test/plotting.py dnn_metric_evaluation
This is the plot:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thermo_stability-0.0.1.tar.gz.
File metadata
- Download URL: thermo_stability-0.0.1.tar.gz
- Upload date:
- Size: 3.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26c01f21dddc0ce5dc1f04b5d175a185863dc69a0bf5dff2d3a1c6c5dbae6b23
|
|
| MD5 |
5b2fe146f466241a1f223ffc8ec231d8
|
|
| BLAKE2b-256 |
f0df1fbfc5dd997f017ed100dd0a98fb01fa10f3ec67fbe0595f993f659049e1
|
File details
Details for the file thermo_stability-0.0.1-py3-none-any.whl.
File metadata
- Download URL: thermo_stability-0.0.1-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
782d019f31fd2ff89b7c3f4c9837b279033a2be72f8c1e76b2766d289d555d3e
|
|
| MD5 |
3d8b1f1d73b3c4e8898504da4d083395
|
|
| BLAKE2b-256 |
bc9313d4bb527ee2cf74b1097f4eb43600f71cf2963c66a7612ad60f0a7c109a
|