Thermodynamic stability of chemical compound using DNN.

These details have not been verified by PyPI

Project description

Machine Learning for Material properties

General Information

Thermo_stability project trains Machine Learning (ML) models to predict thermodynamic stability of inorganic crystalll chemical compounds, inspired by ElemNet paper. The study uses binary classification for various ML methods. For more detailed information about the concept and results of this study, refer to ThermoStability file.

Dataset Source:

The dataset used are Next-Gen Materials Project API from Materials Project (MPR) database. To download dataset, the free api_key was used from MPR.

Bindary Classification with ML Models:

Logistc Regression
Random Forest
Deep Neural Networks (DNN)

ML features:

The ML features used in this study are classified into two categories:

Data Extraction by directly downloading features from MPR database: nelements, density, energy_per_atom, formation_energy_per_atom, band_gap, cbm, vbm, vpa, magmom_pa. The last two features (vpa, magmom_pa) are computed from volume, total_magnetization descriptor normalized by the nsites. The other
Feature Engineering including atomic fractions and bond structure statistics. Both sample weight and model weight are applied to DNN and LogisticRegression, whereas only model weight is applied to RandomForest model.

Split datasets to train, validation and test in 80-10-10 percent respectively. Split data saved in numpy arrays and pandas dataframes formats.

ML Frameworks:

Scikit-Learn for Logistic Regression & Random Forest
TensorFlow for DNN

Technical Details

Python Version: Python 3.10
Environment Manager: Conda. Create and activate conda environment:

conda create -n env_ml
conda activate env_ml

Setting up the Environment:

Required packages to set the environment

# Install required packages:
pip install tensorflow==2.12.0
pip install scikit-learn
pip install numpy==1.24.3 
pip install pandas
pip install matplotlib
pip install scipy
pip install mp-api
pip install xgboost
pip install seaborn

MacOS needs slightly different packages, especially for tensoflow. Use env_ml_macos.yml file to setup conda environment.

HTCondor setup

The bond_structure function in common.py takes a long time (about 36 hours) to run it with one computer. The whole package is in the condor folder. First prepare the conda environment to install packages needed to run bond_structure

conda create -n htcon_env -c conda-forge -c defaults python=3.10 numpy=1.26.4 mp-api=0.45.5 pandas=2.2.3 pip
conda activate htcon_env

The above packages are added to mlenv.yml. To set up the environment run:

conda env create -f env_ml.yml

Code structure:

Thermo_stability project directory tree:

├── __init__.py
├── bond_stats
├── directory_tree.txt
├── files
│   ├── bond_stats
│   ├── logs
│   │   ├── dnn_accuracy
│   │   ├── dnn_auc
│   │   ├── dnn_hypertune_mp_accuracy.txt
│   │   ├── dnn_hypertune_auc.txt 
│   │   ├── feature.txt
│   │   ├── plotting.txt
│   │   └── processin.txt
│   ├── MLHypertune_pars
│   │   ├── DNN_hypertune.txt
│   │   └── npzfiles
│   ├── models
│   │   └── MLHypertune_pars
│   └── structures_dict.json
├── scripts
│   ├── bond_structure.sh
│   ├── mp_jobs.sh
│   ├── multiprocessing_hypertune.py
├── test
│   ├── classification_hyperpars.py
│   ├── classification.py
│   ├── data.py
│   ├── dnn_schematic.py
│   ├── dnn_shap.py
│   ├── pdp_dnnresults.py
│   ├── plotting.py
│   ├── precision_recall.py
│   ├── RFLR_perfomance.py
│   ├── scorehyper_barchart.py
│   ├── scorehyper_heatmap.py
│   └── uncertainty.py
└── thermo_stability
    ├── __init__.py
    ├── config.py
    ├── creds.py
    ├── feature.py
    ├── processing.py
    └── utils.py

Git clone the project:

 git clone https://github.com/snabili/thermo_stability.git

Data preparation:

1. API_key

2. Data extraction:

To download features from MPR database:

 python thermo_stability/feature.py data_acquisition

3. Data Engineering:

To extract atomic_fraction:

 python thermo_stability/feature.py atomic_fraction

The bond_structure feature extraction is computationally expensive. The prefered method is to use HighThroughputCondor. Alternatively, multiprocessing tools should be used. The cpu_time = 181.06 sec with 4.0 CPUs to process 1000 datasets. Run the following to extract bond_structure:

 bash scripts/bond_structure.sh

To merge bond_structure csv files, and later all features:

python thermo_stability/feature.py merge_df_structure
python thermo_stability/features.py merge_df_file

4. Split datasets:

To split datasets into train, validation, and save to numpy and pandas dataframes:

 python test/data.py

ML hyperparameters:

Tuned ML hyperparameters using GridSearchCV from sklearn.model_selection tools from python scikit-library. Parameters are tuned for each ML model individually based on maximizing scoring on auc. Hypertuning Logisti Regression can be done in an interactive node (cpu_time=496 sec) as follow:

 python classification_hyperpars.py LR_hypertune

The same command can be used to hypertune RandomForest by replacing LR_hypertune with RF_hypertune (cpu_time=159 sec)

Hypertuning DNN takes more cpu_time and multiprocessing cores was used. To run DNN hypertuning do:

 python scripts/multiprocessing_hypertune.py

The cpu_time for DNN run with three CPU cores varies from 1069.02 sec to 653.09 sec.

The hypertuned results are saved into log files, later are called by classificaton.py.

Diagnostic plots

Metrics to make diagnostic plots to assess ML methods performance are the ROC AreaUnderCurve (auc), accuracy, and F1-score. To plot the effect of hyperparamter tuning for RandomForest & LogisticRegression and DNN two python codes are used.

RandomForest & LogisticRegression:

An example of how to plot LogisticRegression and RandomForest hyperparameter results:

python test/RFLR_perfomance.py --script LR_performance --lr_c 0.001 0.01 0.1 1 10 100
python test/RFLR_perfomance.py --script RF_performance --rf_nest 100 200 300 400 500

DNN:

To extract the best hyperparameters:

python test/scorehyper_barchart.py --metric acc
python test/scorehyper_barchart.py --metric roc
python test/scorehyper_barchart.py --metric f1

The output will be saved in three text files named scoretune_metric-acc.txt, scoretune_metric-roc.txt and scoretune_metric-f1.txt in files/logs format. The values used in classification.py code are extracted from scoretune_metric-acc.txt the result is almost the same and the improvement in performance is negligible.

ML Classification:

To run classification:

python test/classification.py dnn_classification
python test/classification.py lr_classification
python test/classification.py rf_classification

Plot results:

A decorator is used to plot a specific result.

To plot the mean score cross-validation heatmap plot, run this:

 python test/scorehyper_heatmap.py --metric roc --NL 2

This is the plot:

Plot Description

To make DNN metric plots:

 python test/plotting.py dnn_metric_evaluation

This is the plot:

Plot Description

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.0.6

Aug 11, 2025

This version

0.0.5

Aug 8, 2025

0.0.4

Aug 8, 2025

0.0.3

Aug 8, 2025

0.0.2

Aug 8, 2025

0.0.1

Aug 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thermo_stability-0.0.5.tar.gz (3.8 MB view details)

Uploaded Aug 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

thermo_stability-0.0.5-py3-none-any.whl (13.4 kB view details)

Uploaded Aug 8, 2025 Python 3

File details

Details for the file thermo_stability-0.0.5.tar.gz.

File metadata

Download URL: thermo_stability-0.0.5.tar.gz
Upload date: Aug 8, 2025
Size: 3.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for thermo_stability-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`d15121957e5bc79866f958d5863b16dbe2ab7c8141220d100bc6e2baa7dd03ab`
MD5	`fdbb13cc50715bd62997f62f84cec9a0`
BLAKE2b-256	`7aaaf3aafd866d4d4d0697c5f2543c5202445525068f5ef33f819eefc2d0d122`

See more details on using hashes here.

File details

Details for the file thermo_stability-0.0.5-py3-none-any.whl.

File metadata

Download URL: thermo_stability-0.0.5-py3-none-any.whl
Upload date: Aug 8, 2025
Size: 13.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for thermo_stability-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e5dd31805321258e791a73a2a9c222475ee9729ac394018bfc24d423109a3ba`
MD5	`c550865f4e808f60a7cbddd6e7b74d0f`
BLAKE2b-256	`79ca7c8e49cf6ef02ce5bad355947295e3bcb93f26d649391bfca03d294fb213`

See more details on using hashes here.

thermo-stability 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Machine Learning for Material properties

General Information

Dataset Source:

Bindary Classification with ML Models:

ML features:

ML Frameworks:

Technical Details

Setting up the Environment:

HTCondor setup

Code structure:

Data preparation:

1. API_key

2. Data extraction:

3. Data Engineering:

4. Split datasets:

ML hyperparameters:

Diagnostic plots

RandomForest & LogisticRegression:

DNN:

ML Classification:

Plot results:

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes