Skip to main content

An automated machine learning platform with a focus on explainability

Project description

img

Summary

This project aims to create a template for solving classification problems based on tabular data. The template handles binary and multi-class problems. Among others, the project includes an exploratory data analysis, a preprocessing pipeline before train/test splitting, a fold-wise preprocessing pipeline after train/test splitting, a scalable and robust Monte Carlo cross-validation scheme, various classification algorithms which are evaluated for multiple performance metrics and a set of capabilities enabling explainable artificial intelligence including visualizations.

Workflow diagram

Content:

  • Exploratory data analysis
    • Report via Pandas Profiling
    • Visualization by dimensionality reduction (via PCA, tSNE and UMAP)
  • Preprocessing
    • Removing all-NA instances
    • Removing features with constant value over all instances (ignoring NaNs)
    • Removing features with a user-provided ratio of missing values
    • One hot encoding of non-numeric features
  • Fold-wise preprocessing
    • Normalization / Standardization
    • Filling missing values using kNN or MICE imputation
    • Resampling for handling label imbalances via SMOTE
  • Performance estimation using Monte Carlo cross validation with multiple metrics
    • Accuracy
    • Area under the receiver operating characteristic curve (AUC)
    • Balanced accuracy
    • Sensitivity / Recall / True positive rate
    • Specificity / True negative rate
    • Positive predictive value (Precision)
    • Negative predictive value
  • Receiver operating characteristic curve
  • Feature selection using mRMR (or univariate filter methods)
  • Hyperparameter optimization (using cross-validated randomized search)
  • Training and evaluation of multiple classification algorithms
    • Explainable boosting machine (EBM)
    • Extreme gradient boosting (XGBoost)
    • k-nearest neighbors (kNN)
    • Decision tree (DT)
    • Random forest (RF)
    • Neural network (NN)
    • Support vector machine (SVM)
    • Logistic regression (LGR)
  • Probability calibration (not supported for EBM)
    • Calibration plots with Brier score
  • Explainable Artificial Intelligence (XAI)
    • Permutation feature importance (+ visualizations)
    • Individual conditional expectation (ICE) and partial dependence plots (PDPs)
    • EBM-specific global feature-wise PDPs
    • SHAP values (+ summary visualization)
    • Surrogate models (approximation via DT and EBM)
  • Visualization of performance evaluation
    • Performance metrics for each classification model
    • Confusion matrices
    • Detailed list of predictions for each cross-validation sample
    • Receiver operating characteristic (ROC) curve for each model

Output

Output content: - EDA: results as html report - Intermediate data: preprocessed data for final models as csv - Input data: input table and settings - Models: joblib objects and tuned hyperparameters as json - Performance: Confusion matrices and overall performance metrics for each model as csv and visalization as svg - XAI: Partial dependence plots, permutation feature importances and SHAP summary plots as csv and svg

Output structure:

Results/
    ├── EDA
    |   ├── exploratory_data_analysis.html
    |   └── umap.html
    ├── Intermediate_data
    |   ├── preprocessed_features.csv
    |   └── preprocessed_labels.csv
    ├── Models
    |   ├── ebm_model.pickle
    |   ├── ebm_model_hyperparameters.json
    |   └── ... (other pickled models and hyperparameters)
    ├── Performance
    |   ├── confusion_matrix-ebm.csv
    |   ├── confusion_matrix-ebm.png
    |   ├── ... (other models confusion matrices)
    |   ├── performance.png
    |   └── performances.csv
    └── XAI
        ├── Partial_dependence_plots
        |   ├── partial_dependence-ebm_feature-1_class-A.png
        |   └── ... (PDPs of other features, models and classes)
        ├── Permutation_importances   
        |   ├── permutation_importance_ebm-test.png
        |   ├── permutation_importance_ebm-train.png
        |   ├── ... (other models permutation importances for train and test set)
        ├── Surrogate_models
        |   ├── dt_surrogate_model_for_opaque_model.pickle
        |   ├── ebm_surrogate_model_for_opaque_model.pickle 
        |   ├── dt_surrogate_model_for_opaque_model.svg 
        └── SHAP
            ├── label-0_shap-values.csv
            ├── ... (other labels shap values if multiclass)
            ├── shap_summary-ebm.png
            └── ... (other models shap summary plots)

Installation

Recommended: Create and activate a virtual environment

python3 -m venv /path/to/new/virtual/environment
cd /path/to/new/virtual/environment
source bin/activate

Clone this repository, navigate to the corresponding directory and install the supplied requirements.txt. The project was built using python 3.9.5.

pip install -r requirements.txt

Alternatively, the individual packages contained in the requirements.txt file can be installed manually.

Afterwards, run the software using

python etaml.py --config ../Example_data/settings.ini

System specifications

The software was tested with the following specifications

  • Ubuntu 18.04 LTS (64-bit)
  • Ubuntu 20.04 LTS (64-bit)
  • Windows 11 (64-bit)
  • Python 3.8
  • Python 3.9

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

EtaML-0.0.9.tar.gz (32.2 kB view details)

Uploaded Source

Built Distribution

EtaML-0.0.9-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file EtaML-0.0.9.tar.gz.

File metadata

  • Download URL: EtaML-0.0.9.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for EtaML-0.0.9.tar.gz
Algorithm Hash digest
SHA256 8dd3b87742fc03c4dbce3572e91c16915ab39ab175367159b532078f2fbae9ef
MD5 6469816b2f5475378f165415b7c7d467
BLAKE2b-256 41432c508c6d9edb3b8566c9e6dcd43380a1f339d6b2c2e917a224712856eebd

See more details on using hashes here.

File details

Details for the file EtaML-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: EtaML-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for EtaML-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 85fef660e12eb7b9db6e7ac2574c635002e0ec70a9cd0354b58f63c3ddb4569d
MD5 6138c6ebc00998450cbcd82abc58596a
BLAKE2b-256 c8dbbafbf5b215b70569320ca3c606f2b15f62f9b42c191b35ee6a023078a9f9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page