No project description provided
Project description
ClusterExplorer
This repository contains the code for ClusterExplorer, a novel explainability tool for black-box clustering pipelines. Our approach formulates the explanation of clusters as the identification of concise conjunctions of predicates that maximize the coverage of the cluster's data points while minimizing separation from other clusters.
Explaining the results of clustering pipelines
Our approach formulates the explanation of clusters as the identification of concise conjunctions of predicates that maximize the coverage of the cluster's data points while minimizing separation from other clusters. We achieve this by reducing the problem to generalized frequent-itemsets mining (gFIM), where items correspond to explanation predicates, and itemset frequency indicates coverage. To enhance efficiency, we leverage inherent problem properties and implement attribute selection to further reduce computational costs.
Source Code
The source code is located in the cluster-explorer/src directory. This directory contains the following key components:
-
Explainer:
explainer.pyGenerates rule-based explanations for each cluster using frequent-itemsets mining. -
Frequent Itemset Mining:
gFIM.pyContains methods for frequent itemset mining.. -
Clustering Rule Evaluation:
ScoreMetrics.pyAnalyzeItemsets.pyProvides methods to evaluate and summarize the quality of clustering rules based on metrics such as separation error, coverage, and conciseness. -
Binning Methods:
binning_methods.pyContains methods for binning numeric attributes, including equal width, equal frequency, decision tree-based, and multiclass optimal binning techniques.
Experiment Datasets
Cluster-Explorer was evaluated using a diverse set of 98 clustering results obtained from various clustering pipelines and algorithms. The datasets used in these experiments were sourced from the UCI Machine Learning Repository and cover a wide range of data shapes and sizes.
Datasets Overview
The datasets used in the experiments include:
| Dataset | Rows | Attributes | Link |
|---|---|---|---|
| Urban Land Cover | 168 | 148 | Link |
| DARWIN | 174 | 451 | Link |
| Wine | 178 | 13 | Link |
| Flags | 194 | 30 | Link |
| Parkinson Speech | 1,040 | 26 | Link |
| Communities and Crime | 1,994 | 128 | Link |
| Turkiye Student Evaluation | 5,820 | 33 | Link |
| In-vehicle Coupon Recommendation | 12,684 | 23 | Link |
| Human Activity Recognition | 10,299 | 561 | Link |
| Quality Assessment of Digital Colposcopies | 30,000 | 23 | Link |
| RT-IoT2022 | 123,117 | 85 | Link |
| Gender by Name | 147,270 | 4 | Link |
| Multivariate Gait Data | 181,800 | 7 | Link |
| Wave Energy Converters | 288,000 | 49 | Link |
| 3D Road Network | 434,874 | 4 | Link |
| Year Prediction MSD | 515,345 | 90 | Link |
| Online Retail | 1,067,371 | 8 | Link |
| MetroPT-3 Dataset | 1,516,948 | 15 | Link |
| Taxi Trajectory | 1,710,670 | 9 | Link |
Clustering Pipelines
The clustering results were generated using 16 different clustering pipelines, each combining various preprocessing steps and clustering algorithms (are located in clustering_pipelines.py). The preprocessing steps included standard scaling for numeric columns, one-hot encoding for categorical data, and dimensionality reduction using PCA. The clustering algorithms used were K-Means, DBSCAN, Birch, Spectral Clustering, and Affinity Propagation.
To use this, you need to provide the datasets folder (first save the datasets in this folder) and the folder to save the pipelines results.
Running the Experiments
For running the experiments (located in cluster-explorer/experiments), you need to provide the folder of the pipelines result for BaselinesExperiment.py. The results will be saved in cluster-explorer/experiments)
Additional Experiments
This folder contains information about our attribute-selection optimization on both the explanation quality and running times.
For running the experiments (located in cluster-explorer/additional_experiments), you need to provide the folder of the pipelines result for P_ValueExperiment.py. The results will be saved in cluster-explorer/additional_experiments)
Use Cases and Examples
An example of a simple use case is provided in the example_notebook.ipynb file. In this notebook we generate an explanation rules set from the 'Wine' dataset. For each cluster, the ClusterExplorer generates a set of rules that explain the common properties of the wine samples within that cluster. These rules help in understanding why certain samples are grouped together and what distinguishes one cluster from another.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cluster_explorer-1.0.2.tar.gz.
File metadata
- Download URL: cluster_explorer-1.0.2.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f2a0fe14835017dc4faf4d01c4954bc7a558605baba167b5c567dc28766ad4c
|
|
| MD5 |
c0e3ab1f875f6278b290e3f5f8d63c95
|
|
| BLAKE2b-256 |
f920ba1df7cb675097505e8aff181a89a5fe3ab523204d485fffd19589849e56
|
Provenance
The following attestation bundles were made for cluster_explorer-1.0.2.tar.gz:
Publisher:
python-publish.yml on analysis-bots/cluster-explorer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cluster_explorer-1.0.2.tar.gz -
Subject digest:
6f2a0fe14835017dc4faf4d01c4954bc7a558605baba167b5c567dc28766ad4c - Sigstore transparency entry: 183313285
- Sigstore integration time:
-
Permalink:
analysis-bots/cluster-explorer@4c08b2a8a0fec281e1ab98fe9612b760e450873d -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/analysis-bots
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4c08b2a8a0fec281e1ab98fe9612b760e450873d -
Trigger Event:
release
-
Statement type:
File details
Details for the file cluster_explorer-1.0.2-py3-none-any.whl.
File metadata
- Download URL: cluster_explorer-1.0.2-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ae031fb328c424bb5cfd557646d5423f0a4823d896e9d487d1ca967d95405c4
|
|
| MD5 |
393a062072a4e65e3bd9d8c67122552a
|
|
| BLAKE2b-256 |
741d3b7507b5d8fed0fe14aa04388b73017e379292815c1d5c27b083b128b280
|
Provenance
The following attestation bundles were made for cluster_explorer-1.0.2-py3-none-any.whl:
Publisher:
python-publish.yml on analysis-bots/cluster-explorer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cluster_explorer-1.0.2-py3-none-any.whl -
Subject digest:
7ae031fb328c424bb5cfd557646d5423f0a4823d896e9d487d1ca967d95405c4 - Sigstore transparency entry: 183313292
- Sigstore integration time:
-
Permalink:
analysis-bots/cluster-explorer@4c08b2a8a0fec281e1ab98fe9612b760e450873d -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/analysis-bots
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4c08b2a8a0fec281e1ab98fe9612b760e450873d -
Trigger Event:
release
-
Statement type: