Skip to main content

Forest-Guided Clustering - Explainability method for Random Forest models.

Project description

Forest-Guided Clustering - Shedding light into the Random Forest Black Box

test PyPI stars License: MIT cite

Docs | Tutorials

Forest-Guided Clustering (FGC) is an explainability method for Random Forest models. Standard explainability methods (e.g. feature importance) assume independence of model features and hence, are not suited in the presence of correlated features. The Forest-Guided Clustering algorithm does not assume independence of model features, because it computes the feature importance based on subgroups of instances that follow similar decision rules within the Random Forest model. Hence, this method is well suited for cases with high correlation among model features.

For a detailed comparison of FGC and Permutation Feature Importance, please have a look at the Notebook Introduction to FGC: Comparison of Forest-Guided Clustering and Feature Importance.

Documentation

Please see here for full documentation on:

  • Getting Started (installation, basic usage)
  • Theoretical Background (introduction, general algorith, feature importance)
  • Tutorials (simple use cases, special cases)
  • API documentation

For a short introduction to Forest-Guided Clustering, click below:

Video

Installation

Requirements

This packages was tested for Python 3.7 - 3.11 on ubuntu, macos and windows. It depends on the kmedoids python package. If you are using windows or macos, you may need to first install Rust/Cargo with:

conda install -c conda-forge rust

If this does not work, please try to install Cargo from source:

git clone https://github.com/rust-lang/cargo
cd cargo
cargo build --release

For further information on the kmedoids package, please visit this page.

All other required packages are automatically installed if installation is done via pip.

Install Options

The installation of the package is done via pip. Note: if you are using conda, first install pip with: conda install pip.

PyPI install:

pip install fgclustering

Installation from source:

git clone https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering.git
  • Installation as python package (run inside directory):

      pip install .   
    
  • Development Installation as python package (run inside directory):

      pip install -e . [dev]
    

Basic Usage

To get explainability of your Random Forest model via Forest-Guided Clustering, you simply need to run the following commands:

from fgclustering import FgClustering
   
# initialize and run fgclustering object
fgc = FgClustering(model=rf, data=data, target_column='target')
fgc.run()
   
# visualize results
fgc.plot_feature_importance()
fgc.plot_decision_paths()
   
# obtain optimal number of clusters and vector that contains the cluster label of each data point
optimal_number_of_clusters = fgc.k
cluster_labels = fgc.cluster_labels

where

  • model=rf is a Random Forest Classifier or Regressor object,
  • data=data is a dataset containing the same features as required by the Random Forest model, and
  • target_column='target' is the name of the target column (i.e. target) in the provided dataset.

For detailed instructions, please have a look at the Notebook Introduction to FGC: Simple Use Cases.

Usage on big datasets

If you are working with the dataset containing large number of samples, you can use one of the following strategies:

  • Use the cores you have at your disposal to parallelize the optimization of the cluster number. You can do so by setting the parameter n_jobs to a value > 1 in the run() function.
  • Use the faster implementation of the pam method that K-Medoids algorithm uses to find the clusters by setting the parameter method_clustering to fasterpam in the run() function.
  • Use subsampling technique

For detailed instructions, please have a look at the Notebook Special Case: FGC for Big Datasets.

Contributing

Contributions are more than welcome! Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

For any further inquiries please send an email to Lisa Barros de Andrade e Sousa.

How to cite

If Forest-Guided Clustering is useful for your research, consider citing the package:

@software{lisa_sousa_2022_7823042,
    author       = {Lisa Barros de Andrade e Sousa,
                     Helena Pelin,
                     Dominik Thalmeier,
                     Marie Piraud},
    title        = {{Forest-Guided Clustering - Explainability for Random Forest Models}},
    month        = april,
    year         = 2022,
    publisher    = {Zenodo},
    version      = {v1.0.3},
    doi          = {10.5281/zenodo.7823042},
    url          = {https://doi.org/10.5281/zenodo.7823042}
}

License

fgclustering is released under the MIT license. See LICENSE for additional details about it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fgclustering-1.1.1.tar.gz (5.0 MB view details)

Uploaded Source

Built Distribution

fgclustering-1.1.1-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file fgclustering-1.1.1.tar.gz.

File metadata

  • Download URL: fgclustering-1.1.1.tar.gz
  • Upload date:
  • Size: 5.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for fgclustering-1.1.1.tar.gz
Algorithm Hash digest
SHA256 a7f76316dac7dc87f1b98d3282cf191b07ddff0f287e29845bbcde1a7ef50a7e
MD5 042bc49637d83743deb163b6394975b1
BLAKE2b-256 f0a51d7a2ec68563ea5cd6dfcec3e72ff247d607f503159d7ac69f5da83fd1be

See more details on using hashes here.

File details

Details for the file fgclustering-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: fgclustering-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for fgclustering-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a2b5277416c28321f132c236669ed2841002c7053ea442a8b38706447cc0cedf
MD5 7a332eda2ede10643ada8a645ed14bb4
BLAKE2b-256 d6427b4c2e9bd03f5cdeea2ecb979a8254be862826e690b950b026b9bfbfd4de

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page