Skip to main content

Forest-Guided Clustering - Explainability method for Random Forest models.

Project description

Forest-Guided Clustering - Shedding light into the Random Forest Black Box

Docs PyPI PyPI Downloads stars License: MIT cite test

✨ About this Package

Why Use Forest-Guided Clustering?

Forest-Guided Clustering (FGC) is an explainability method for Random Forest models that addresses one of the key limitations of many standard XAI techniques: the inability to effectively handle correlated features and complex decision patterns. Traditional methods like permutation importance, SHAP, and LIME often assume feature independence and focus on individual feature contributions, which can lead to misleading or incomplete explanations. As machine learning models are increasingly deployed in sensitive domains like healthcare, finance, and HR, understanding why a model makes a decision is as important as the decision itself. This is not only a matter of trust and fairness, but also a legal requirement in many jurisdictions, such as the European Union's GDPR which mandates a “right to explanation” for automated decisions.

FGC offers a different approach: instead of approximating the model with simpler surrogates, it uses the internal structure of the Random Forest itself. By analyzing the tree traversal patterns of individual samples, FGC clusters data points that follow similar decision paths. This reveals how the forest segments the input space, enabling a human-interpretable view of the model's internal logic. FGC is particularly useful when features are highly correlated, as it does not rely on assumptions of feature independence. It bridges the gap between model accuracy and model transparency, offering a powerful tool for global, model-specific interpretation of Random Forests.

Prefer a visual walkthrough? Watch our short introduction video by clicking below:

Video

Curious how Forest-Guided Clustering compares to standard methods? See our notebook: Introduction to FGC: Comparison of Forest-Guided Clustering and Feature Importance.

Want to dive deeper? Visit our full documentation for:

  • Getting Started – Installation and quick start
  • Tutorials – Use cases for classification, regression, and large datasets
  • API Reference – Detailed descriptions of functions and classes

🛠️ Installation

Requirements

This package was tested for Python 3.8 - 3.13 on ubuntu, macos and windows. It depends on the kmedoids python package. If you are using windows or macos, you may need to first install Rust/Cargo with:

conda install -c conda-forge rust

If this does not work, please try to install Cargo from source:

git clone https://github.com/rust-lang/cargo
cd cargo
cargo build --release

For further information on the kmedoids package, please visit this page.

All other required packages are automatically installed if installation is done via pip.

Install Options

The installation of the package is done via pip. Note: if you are using conda, first install pip with: conda install pip.

PyPI install:

pip install fgclustering

Installation from source:

git clone https://github.com/HelmholtzAI-Consultants-Munich/fg-clustering.git
  • Installation as python package (run inside directory):

      pip install .   
    
  • Development Installation as python package (run inside directory):

      pip install -e .
    

💻 How to Use Forest-Guided Clustering

Basic Usage

To apply Forest-Guided Clustering (FGC) for explaining a Random Forest model, you can follow the simple workflow consisting of three main steps: computing the forest-guided clusters, evaluating feature importance, and visualizing the results.

# compute the forest-guided clusters
fgc = forest_guided_clustering(
    estimator=model, 
    X=X, 
    y=y, 
    clustering_distance_metric=DistanceRandomForestProximity(), 
    clustering_strategy=ClusteringKMedoids(),
)

# evaluate feature importance
feature_importance = forest_guided_feature_importance(
    X=X, 
    y=y, 
    cluster_labels=fgc.cluster_labels,
    model_type=fgc.model_type,
)

# visualize the results
plot_forest_guided_feature_importance(
    feature_importance_local=feature_importance.feature_importance_local,
    feature_importance_global=feature_importance.feature_importance_global
)

plot_forest_guided_decision_paths(
    data_clustering=feature_importance.data_clustering,
    model_type=fgc.model_type,
)

where

  • estimator is the trained Random Forest model
  • X is the feature matrix
  • y is the target variable
  • clustering_distance_metric defines how similarity between samples is measured based on the Random Forest structure
  • clustering_strategy determines how the proximity-based clustering is performed

For a detailed walkthrough, refer to the Introduction to FGC: Simple Use Cases notebook.

Using FGC on Large Datasets

When working with datasets containing a large number of samples, Forest-Guided Clustering (FGC) provides several strategies to ensure efficient performance and scalability:

  • Parallelize Cluster Optimization: Leverage multiple CPU cores by setting the n_jobs parameter to a value greater than 1 in the forest_guided_clustering() function. This will parallelize the bootstrapping process for evaluating cluster stability.

  • Use a Faster Clustering Algorithm: Improve the efficiency of the K-Medoids clustering step by using the optimized "fasterpam" algorithm. Set the method parameter of your clustering strategy (e.g., ClusteringKMedoids(method="fasterpam")) to activate this faster implementation.

  • Enable Subsampling with CLARA: For extremely large datasets, consider using the CLARA (Clustering Large Applications) variant by choosing ClusteringClara() as your clustering strategy. CLARA performs clustering on smaller random subsamples, making it suitable for high-volume data.

For a detailed example, please refer to the notebook Special Case: FGC for Big Datasets.

🤝 Contributing

We welcome contributions of all kinds—whether it’s improvements to the code, documentation, tutorials, or examples. Your input helps make Forest-Guided Clustering more robust and useful for the community.

To contribute:

  1. Fork the repository.
  2. Make your changes in a feature branch.
  3. Submit a pull request to the main branch.

We’ll review your submission and work with you to get it merged.

If you have any questions or ideas you'd like to discuss before contributing, feel free to reach out to Lisa Barros de Andrade e Sousa.

📝 How to cite

If you find Forest-Guided Clustering useful in your research or applications, please consider citing it:

@software{lisa_sousa_2022_7823042,
    author       = {Lisa Barros de Andrade e Sousa,
                    Dominik Thalmeier,
                    Helena Pelin, 
                    Marie Piraud},
    title        = {{Forest-Guided Clustering - Explainability for Random Forest Models}},
    month        = april,
    year         = 2022,
    publisher    = {Zenodo},
    version      = {v1.0.3},
    doi          = {10.5281/zenodo.7823042},
    url          = {https://doi.org/10.5281/zenodo.7823042}
}

🛡️ License

The fgclustering package is released under the MIT License. You are free to use, modify, and distribute it under the terms outlined in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fgclustering-2.0.1.tar.gz (4.0 MB view details)

Uploaded Source

Built Distribution

fgclustering-2.0.1-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file fgclustering-2.0.1.tar.gz.

File metadata

  • Download URL: fgclustering-2.0.1.tar.gz
  • Upload date:
  • Size: 4.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for fgclustering-2.0.1.tar.gz
Algorithm Hash digest
SHA256 e3dbba85860943cfe704e2617caf419f00e47e01039780466659ce76313fafb7
MD5 bca2b7eda37707597ecdbd3096534bc6
BLAKE2b-256 006145b0b4298cf0e0bb3cd3f64708e2cd2a8457746e916e059234c720fe16e3

See more details on using hashes here.

File details

Details for the file fgclustering-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: fgclustering-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for fgclustering-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9a766a18efc941e0a6938fd240fceb6aea0a8c07ee8e99ab188372dfcae578af
MD5 b52f472d3101c51b915f3f85b7582cf3
BLAKE2b-256 2f044eeaccb2a384791ad671a3a9ba4d44e08ae3b90e0da1509ec3f7e2f0d459

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page