Skip to main content

No project description provided

Project description

ClusterExplorer

This repository contains the code for ClusterExplorer, a novel explainability tool for black-box clustering pipelines. Our approach formulates the explanation of clusters as the identification of concise conjunctions of predicates that maximize the coverage of the cluster's data points while minimizing separation from other clusters.

Explaining the results of clustering pipelines

Our approach formulates the explanation of clusters as the identification of concise conjunctions of predicates that maximize the coverage of the cluster's data points while minimizing separation from other clusters. We achieve this by reducing the problem to generalized frequent-itemsets mining (gFIM), where items correspond to explanation predicates, and itemset frequency indicates coverage. To enhance efficiency, we leverage inherent problem properties and implement attribute selection to further reduce computational costs.

Source Code

The source code is located in the cluster-explorer/src directory. This directory contains the following key components:

  1. Explainer:explainer.py Generates rule-based explanations for each cluster using frequent-itemsets mining.

  2. Frequent Itemset Mining:gFIM.py Contains methods for frequent itemset mining..

  3. Clustering Rule Evaluation:ScoreMetrics.py AnalyzeItemsets.py Provides methods to evaluate and summarize the quality of clustering rules based on metrics such as separation error, coverage, and conciseness.

  4. Binning Methods:binning_methods.py Contains methods for binning numeric attributes, including equal width, equal frequency, decision tree-based, and multiclass optimal binning techniques.

Experiment Datasets

Cluster-Explorer was evaluated using a diverse set of 98 clustering results obtained from various clustering pipelines and algorithms. The datasets used in these experiments were sourced from the UCI Machine Learning Repository and cover a wide range of data shapes and sizes.

Datasets Overview

The datasets used in the experiments include:

Dataset Rows Attributes Link
Urban Land Cover 168 148 Link
DARWIN 174 451 Link
Wine 178 13 Link
Flags 194 30 Link
Parkinson Speech 1,040 26 Link
Communities and Crime 1,994 128 Link
Turkiye Student Evaluation 5,820 33 Link
In-vehicle Coupon Recommendation 12,684 23 Link
Human Activity Recognition 10,299 561 Link
Quality Assessment of Digital Colposcopies 30,000 23 Link
RT-IoT2022 123,117 85 Link
Gender by Name 147,270 4 Link
Multivariate Gait Data 181,800 7 Link
Wave Energy Converters 288,000 49 Link
3D Road Network 434,874 4 Link
Year Prediction MSD 515,345 90 Link
Online Retail 1,067,371 8 Link
MetroPT-3 Dataset 1,516,948 15 Link
Taxi Trajectory 1,710,670 9 Link

Clustering Pipelines

The clustering results were generated using 16 different clustering pipelines, each combining various preprocessing steps and clustering algorithms (are located in clustering_pipelines.py). The preprocessing steps included standard scaling for numeric columns, one-hot encoding for categorical data, and dimensionality reduction using PCA. The clustering algorithms used were K-Means, DBSCAN, Birch, Spectral Clustering, and Affinity Propagation.

To use this, you need to provide the datasets folder (first save the datasets in this folder) and the folder to save the pipelines results.

Running the Experiments

For running the experiments (located in cluster-explorer/experiments), you need to provide the folder of the pipelines result for BaselinesExperiment.py. The results will be saved in cluster-explorer/experiments)

Additional Experiments

This folder contains information about our attribute-selection optimization on both the explanation quality and running times. For running the experiments (located in cluster-explorer/additional_experiments), you need to provide the folder of the pipelines result for P_ValueExperiment.py. The results will be saved in cluster-explorer/additional_experiments)

Use Cases and Examples

An example of a simple use case is provided in the example_notebook.ipynb file. In this notebook we generate an explanation rules set from the 'Wine' dataset. For each cluster, the ClusterExplorer generates a set of rules that explain the common properties of the wine samples within that cluster. These rules help in understanding why certain samples are grouped together and what distinguishes one cluster from another.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cluster_explorer-1.0.2.tar.gz (32.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cluster_explorer-1.0.2-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file cluster_explorer-1.0.2.tar.gz.

File metadata

  • Download URL: cluster_explorer-1.0.2.tar.gz
  • Upload date:
  • Size: 32.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for cluster_explorer-1.0.2.tar.gz
Algorithm Hash digest
SHA256 6f2a0fe14835017dc4faf4d01c4954bc7a558605baba167b5c567dc28766ad4c
MD5 c0e3ab1f875f6278b290e3f5f8d63c95
BLAKE2b-256 f920ba1df7cb675097505e8aff181a89a5fe3ab523204d485fffd19589849e56

See more details on using hashes here.

Provenance

The following attestation bundles were made for cluster_explorer-1.0.2.tar.gz:

Publisher: python-publish.yml on analysis-bots/cluster-explorer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cluster_explorer-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for cluster_explorer-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7ae031fb328c424bb5cfd557646d5423f0a4823d896e9d487d1ca967d95405c4
MD5 393a062072a4e65e3bd9d8c67122552a
BLAKE2b-256 741d3b7507b5d8fed0fe14aa04388b73017e379292815c1d5c27b083b128b280

See more details on using hashes here.

Provenance

The following attestation bundles were made for cluster_explorer-1.0.2-py3-none-any.whl:

Publisher: python-publish.yml on analysis-bots/cluster-explorer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page