EtaML

An automated machine learning platform with a focus on explainability

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Summary

This project aims to create a template for solving classification problems based on tabular data. The template handles binary and multi-class problems. Among others, the project includes an exploratory data analysis, a preprocessing pipeline before train/test splitting, a fold-wise preprocessing pipeline after train/test splitting, a scalable and robust Monte Carlo cross-validation scheme, various classification algorithms which are evaluated for multiple performance metrics and a set of capabilities enabling explainable artificial intelligence including visualizations.

Content:

Exploratory data analysis
- Report via Pandas Profiling
- Visualization by dimensionality reduction (via PCA, tSNE and UMAP)
Preprocessing
- Removing all-NA instances
- Removing features with constant value over all instances (ignoring NaNs)
- Removing features with a user-provided ratio of missing values
- One hot encoding of non-numeric features
Fold-wise preprocessing
- Normalization / Standardization
- Filling missing values using kNN or MICE imputation
- Resampling for handling label imbalances via SMOTE
Performance estimation using Monte Carlo cross validation with multiple metrics
- Accuracy
- Area under the receiver operating characteristic curve (AUC)
- Balanced accuracy
- Sensitivity / Recall / True positive rate
- Specificity / True negative rate
- Positive predictive value (Precision)
- Negative predictive value
Receiver operating characteristic curve
Feature selection using mRMR (or univariate filter methods)
Hyperparameter optimization (using cross-validated randomized search)
Training and evaluation of multiple classification algorithms
- Explainable boosting machine (EBM)
- Extreme gradient boosting (XGBoost)
- k-nearest neighbors (kNN)
- Decision tree (DT)
- Random forest (RF)
- Neural network (NN)
- Support vector machine (SVM)
- Logistic regression (LGR)
Probability calibration (not supported for EBM)
- Calibration plots with Brier score
Explainable Artificial Intelligence (XAI)
- Permutation feature importance (+ visualizations)
- Individual conditional expectation (ICE) and partial dependence plots (PDPs)
- EBM-specific global feature-wise PDPs
- SHAP values (+ summary visualization)
- Surrogate models (approximation via DT and EBM)
Visualization of performance evaluation
- Performance metrics for each classification model
- Confusion matrices
- Detailed list of predictions for each cross-validation sample
- Receiver operating characteristic (ROC) curve for each model

Output

Output content: - EDA: results as html report - Intermediate data: preprocessed data for final models as csv - Input data: input table and settings - Models: joblib objects and tuned hyperparameters as json - Performance: Confusion matrices and overall performance metrics for each model as csv and visalization as svg - XAI: Partial dependence plots, permutation feature importances and SHAP summary plots as csv and svg

Output structure:

Results/
    ├── EDA
    |   ├── exploratory_data_analysis.html
    |   └── umap.html
    ├── Intermediate_data
    |   ├── preprocessed_features.csv
    |   └── preprocessed_labels.csv
    ├── Models
    |   ├── ebm_model.pickle
    |   ├── ebm_model_hyperparameters.json
    |   └── ... (other pickled models and hyperparameters)
    ├── Performance
    |   ├── confusion_matrix-ebm.csv
    |   ├── confusion_matrix-ebm.png
    |   ├── ... (other models confusion matrices)
    |   ├── performance.png
    |   └── performances.csv
    └── XAI
        ├── Partial_dependence_plots
        |   ├── partial_dependence-ebm_feature-1_class-A.png
        |   └── ... (PDPs of other features, models and classes)
        ├── Permutation_importances   
        |   ├── permutation_importance_ebm-test.png
        |   ├── permutation_importance_ebm-train.png
        |   ├── ... (other models permutation importances for train and test set)
        ├── Surrogate_models
        |   ├── dt_surrogate_model_for_opaque_model.pickle
        |   ├── ebm_surrogate_model_for_opaque_model.pickle 
        |   ├── dt_surrogate_model_for_opaque_model.svg 
        └── SHAP
            ├── label-0_shap-values.csv
            ├── ... (other labels shap values if multiclass)
            ├── shap_summary-ebm.png
            └── ... (other models shap summary plots)

Installation

Recommended: Create and activate a virtual environment

python3 -m venv /path/to/new/virtual/environment
cd /path/to/new/virtual/environment
source bin/activate

Clone this repository, navigate to the corresponding directory and install the supplied requirements.txt. The project was built using python 3.9.5.

pip install -r requirements.txt

Alternatively, the individual packages contained in the requirements.txt file can be installed manually.

Afterwards, run the software using

python etaml.py --config ../Example_data/settings.ini

System specifications

The software was tested with the following specifications

Ubuntu 18.04 LTS (64-bit)
Ubuntu 20.04 LTS (64-bit)
Windows 11 (64-bit)
Python 3.8
Python 3.9

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.9

Jan 29, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

EtaML-0.0.9.tar.gz (32.2 kB view hashes)

Uploaded Jan 29, 2023 Source

Built Distribution

EtaML-0.0.9-py3-none-any.whl (35.6 kB view hashes)

Uploaded Jan 29, 2023 Python 3

Hashes for EtaML-0.0.9.tar.gz

Hashes for EtaML-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`8dd3b87742fc03c4dbce3572e91c16915ab39ab175367159b532078f2fbae9ef`
MD5	`6469816b2f5475378f165415b7c7d467`
BLAKE2b-256	`41432c508c6d9edb3b8566c9e6dcd43380a1f339d6b2c2e917a224712856eebd`

Hashes for EtaML-0.0.9-py3-none-any.whl

Hashes for EtaML-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`85fef660e12eb7b9db6e7ac2574c635002e0ec70a9cd0354b58f63c3ddb4569d`
MD5	`6138c6ebc00998450cbcd82abc58596a`
BLAKE2b-256	`c8dbbafbf5b215b70569320ca3c606f2b15f62f9b42c191b35ee6a023078a9f9`