Skip to main content

Thermodynamic stability of chemical compound using DNN.

Project description

Machine Learning for Material properties

General Information

Thermo_stability project trains Machine Learning (ML) models to predict thermodynamic stability of inorganic crystalll chemical compounds, inspired by ElemNet paper. The study uses binary classification for various ML methods. For more detailed information about the concept and results of this study, refer to ThermoStability file.

Dataset Source:

The dataset used are Next-Gen Materials Project API from Materials Project (MPR) database. To download dataset, the free api_key was used from MPR.

Bindary Classification with ML Models:

  • Logistc Regression
  • Random Forest
  • Deep Neural Networks (DNN)

ML features:

The ML features used in this study are classified into two categories:

  • Data Extraction by directly downloading features from MPR database: nelements, density, energy_per_atom, formation_energy_per_atom, band_gap, cbm, vbm, vpa, magmom_pa. The last two features (vpa, magmom_pa) are computed from volume, total_magnetization descriptor normalized by the nsites. The other
  • Feature Engineering including atomic fractions and bond structure statistics. Both sample weight and model weight are applied to DNN and LogisticRegression, whereas only model weight is applied to RandomForest model.

Split datasets to train, validation and test in 80-10-10 percent respectively. Split data saved in numpy arrays and pandas dataframes formats.

ML Frameworks:

  • Scikit-Learn for Logistic Regression & Random Forest
  • TensorFlow for DNN

Technical Details

  • Python Version: Python 3.10
  • Environment Manager: Conda. Create and activate conda environment:
conda create -n env_ml
conda activate env_ml

Setting up the Environment:

Required packages to set the environment

# Install required packages:
pip install tensorflow==2.12.0
pip install scikit-learn
pip install numpy==1.24.3 
pip install pandas
pip install matplotlib
pip install scipy
pip install mp-api
pip install xgboost
pip install seaborn

MacOS needs slightly different packages, especially for tensoflow. Use env_ml_macos.yml file to setup conda environment.

HTCondor setup

The bond_structure function in common.py takes a long time (about 36 hours) to run it with one computer. The whole package is in the condor folder. First prepare the conda environment to install packages needed to run bond_structure

conda create -n htcon_env -c conda-forge -c defaults python=3.10 numpy=1.26.4 mp-api=0.45.5 pandas=2.2.3 pip
conda activate htcon_env

The above packages are added to mlenv.yml. To set up the environment run:

conda env create -f env_ml.yml


Code structure:

Thermo_stability project directory tree:

├── __init__.py
├── bond_stats
├── directory_tree.txt
├── files
│   ├── bond_stats
│   ├── logs
│   │   ├── dnn_accuracy
│   │   ├── dnn_auc
│   │   ├── dnn_hypertune_mp_accuracy.txt
│   │   ├── dnn_hypertune_auc.txt 
│   │   ├── feature.txt
│   │   ├── plotting.txt
│   │   └── processin.txt
│   ├── MLHypertune_pars
│   │   ├── DNN_hypertune.txt
│   │   └── npzfiles
│   ├── models
│   │   └── MLHypertune_pars
│   └── structures_dict.json
├── scripts
│   ├── bond_structure.sh
│   ├── mp_jobs.sh
│   ├── multiprocessing_hypertune.py
├── test
│   ├── classification_hyperpars.py
│   ├── classification.py
│   ├── data.py
│   ├── dnn_schematic.py
│   ├── dnn_shap.py
│   ├── pdp_dnnresults.py
│   ├── plotting.py
│   ├── precision_recall.py
│   ├── RFLR_perfomance.py
│   ├── scorehyper_barchart.py
│   ├── scorehyper_heatmap.py
│   └── uncertainty.py
└── thermo_stability
    ├── __init__.py
    ├── config.py
    ├── creds.py
    ├── feature.py
    ├── processing.py
    └── utils.py

Git clone the project:

 git clone https://github.com/snabili/thermo_stability.git 

Data preparation:

1. API_key

Register with the Materials Project website to directly access to the database via API_key

2. Data extraction:

To download features from MPR database:

 python thermo_stability/feature.py data_acquisition 

3. Data Engineering:

To extract atomic_fraction:

 python thermo_stability/feature.py atomic_fraction 

The bond_structure feature extraction is computationally expensive. The prefered method is to use HighThroughputCondor. Alternatively, multiprocessing tools should be used. The cpu_time = 181.06 sec with 4.0 CPUs to process 1000 datasets. Run the following to extract bond_structure:

 bash scripts/bond_structure.sh 

To merge bond_structure csv files, and later all features:

python thermo_stability/feature.py merge_df_structure
python thermo_stability/features.py merge_df_file

4. Split datasets:

To split datasets into train, validation, and save to numpy and pandas dataframes:

 python test/data.py 

ML hyperparameters:

Tuned ML hyperparameters using GridSearchCV from sklearn.model_selection tools from python scikit-library. Parameters are tuned for each ML model individually based on maximizing scoring on auc. Hypertuning Logisti Regression can be done in an interactive node (cpu_time=496 sec) as follow:

 python classification_hyperpars.py LR_hypertune  

The same command can be used to hypertune RandomForest by replacing LR_hypertune with RF_hypertune (cpu_time=159 sec)

Hypertuning DNN takes more cpu_time and multiprocessing cores was used. To run DNN hypertuning do:

 python scripts/multiprocessing_hypertune.py  

The cpu_time for DNN run with three CPU cores varies from 1069.02 sec to 653.09 sec.

The hypertuned results are saved into log files, later are called by classificaton.py.

Diagnostic plots

Metrics to make diagnostic plots to assess ML methods performance are the ROC AreaUnderCurve (auc), accuracy, and F1-score. To plot the effect of hyperparamter tuning for RandomForest & LogisticRegression and DNN two python codes are used.

RandomForest & LogisticRegression:

An example of how to plot LogisticRegression and RandomForest hyperparameter results:

python test/RFLR_perfomance.py --script LR_performance --lr_c 0.001 0.01 0.1 1 10 100
python test/RFLR_perfomance.py --script RF_performance --rf_nest 100 200 300 400 500

DNN:

To extract the best hyperparameters:

python test/scorehyper_barchart.py --metric acc
python test/scorehyper_barchart.py --metric roc
python test/scorehyper_barchart.py --metric f1

The output will be saved in three text files named scoretune_metric-acc.txt, scoretune_metric-roc.txt and scoretune_metric-f1.txt in files/logs format. The values used in classification.py code are extracted from scoretune_metric-acc.txt the result is almost the same and the improvement in performance is negligible.

ML Classification:

To run classification:

python test/classification.py dnn_classification
python test/classification.py lr_classification
python test/classification.py rf_classification

Plot results:

A decorator is used to plot a specific result.

To plot the mean score cross-validation heatmap plot, run this:

 python test/scorehyper_heatmap.py --metric roc --NL 2 

This is the plot:

Plot Description

To make DNN metric plots:

 python test/plotting.py dnn_metric_evaluation 

This is the plot:

Plot Description

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thermo_stability-0.0.5.tar.gz (3.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thermo_stability-0.0.5-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file thermo_stability-0.0.5.tar.gz.

File metadata

  • Download URL: thermo_stability-0.0.5.tar.gz
  • Upload date:
  • Size: 3.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for thermo_stability-0.0.5.tar.gz
Algorithm Hash digest
SHA256 d15121957e5bc79866f958d5863b16dbe2ab7c8141220d100bc6e2baa7dd03ab
MD5 fdbb13cc50715bd62997f62f84cec9a0
BLAKE2b-256 7aaaf3aafd866d4d4d0697c5f2543c5202445525068f5ef33f819eefc2d0d122

See more details on using hashes here.

File details

Details for the file thermo_stability-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for thermo_stability-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 5e5dd31805321258e791a73a2a9c222475ee9729ac394018bfc24d423109a3ba
MD5 c550865f4e808f60a7cbddd6e7b74d0f
BLAKE2b-256 79ca7c8e49cf6ef02ce5bad355947295e3bcb93f26d649391bfca03d294fb213

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page