Skip to main content

Improving Clustering Problem Analysis with a simple Silhouette Metric Support and Sklearn Estimators Fit

Project description

cluster-ss: Cluster Silhouette Support

Clustering Problems is hard and much more hard in Python, do not exists some packages to faster implementation of metrics and others helpful supports like cluster analysis, visualization, estimators training and dimensionality reduction. Note, This package was heavily influenced by the lazypredict package, I loved the ideia of one comand line for fit multiple estimators.

Based on this problems cluster-ss is a simple package to facilite implementation of this steps for clustering analysis, adding some simple ways to fit estimators and visualize one of my favorite metric, the silhouette score.

The package offers:

  • Simple ways to visualize silhouette score and analysis.
  • Fit functions from all sklearn cluster estimators.
  • Setup multiple parameters to fit these estimators.

This is a simple study proposed package by Me, however I invite the community to contribute. Please help by trying it out, reporting bugs, making improvments and other cool thigs. :)

Link pypi: https://pypi.org/project/cluster-ss/

Installation

pip install cluster-ss

For latest.

Usage

Basic Silhouette Plots

For basic plots, just prepare you dataset, select one sklearn base clustering estimators, a list of k clusters and use the plot_silhouette or plot_silhouette_score.

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Import the plot functions
from cluster_ss import plot_silhouette, plot_silhouette_score

# List of Clusters
cluster_list = [2, 4, 6, 8]

# Simple Random Blobs Dataset
X, y = make_blobs(n_samples=7_000, centers=4, n_features=5)

# A pandas Dataframe of Blobs dataset.
X = pd.DataFrame(X)

# Use the plot_silhouette
ax, fig = plot_silhouette(estimator=KMeans(), X=X, cluster_list=cluster_list)

You can also use plot_silhouette_score to get labels, scores and a plot at k cluster:

# Labels at K is a list of dicts based on K cluster and respective labels
# Silhouette Scores at K is the Sklearn silhouette_score function result for X and labels.
labels_at_k, sil_scores_at_k, ax, fig = plot_silhouette_score(estimator=KMeans(), X=X, cluster_list=cluster_list)

Multiple Sklearn Clustering Estimators Fits and Plots

ClusterSupport supports "verbose" fitting results, "random_state" and "extra_parameters" for fitting the estimators.

# You can import the class for multiple plots and setups.
from cluster_ss import ClusterSupport
cs = ClusterSupport(verbose=True)

# Using the same previous dataset and multiple silhouette scores plot.
# The 'estimators_selection' is the type of clustering estimators you like to use.
# 'all' -> Fit all estimators.
# 'k_inform' -> n_clusters or n_components params requested by some sklearn estimators.
# 'no_k_inform' -> Density based or similar sklearn estimators without specifying cluster number. 
fit_results, no_k_results, axes, fig = cs.plot_multiple_silhouette_score(X=X, cluster_list=cluster_list, estimators_selection='all')

Using estimators_selection='all' you get this plot:

The fit_results and no_k_results is a pandas dataframe with silhouettes scores for all estimators.

# Using Aditional Parameters
# Just config a list or only one dict with Estimator name
# And a Dict with respective Estimator custom params.
extra_parameters = [
    {'KMeans': {'max_iter': 320,
                'tol': 0.001}},
    {'GaussianMixture': {'max_iter': 110,
                         'n_init': 2,
                         'reg_covar': 1e-05,
                         'tol': 0.01}}
]

# Select a list of estimators names 
# for skip setup and fitting
skipped = ["spectralcoclustering", "agglomerativeclustering", 
           "bayesiangaussianmixture", "dbscan", "optics", "birch"]

# A List of Clusters
cluster_list = [2, 3, 4, 5]

cs = ClusterSupport(
    verbose=True, 
    extra_parameters=extra_parameters, 
    skipped_estimators=skipped
)

axes, fig = cs.plot_multiple_silhouette(cluster_list=cluster_list, fit=True, X=X)
  0%|          | 0/4 [00:00<?, ?it/s]

Training: Estimator: kmeans -> K-Num: 2
Training: Estimator: minibatchkmeans -> K-Num: 2
Training: Estimator: gaussianmixture -> K-Num: 2
Training: Estimator: bisectingkmeans -> K-Num: 2
Training: Estimator: spectralclustering -> K-Num: 2
Training: Estimator: spectralbiclustering -> K-Num: 2

 25%|██▌       | 1/4 [00:14<00:43, 14.65s/it]
 
Training: Estimator: kmeans -> K-Num: 3
Training: Estimator: minibatchkmeans -> K-Num: 3
Training: Estimator: gaussianmixture -> K-Num: 3
Training: Estimator: bisectingkmeans -> K-Num: 3
Training: Estimator: spectralclustering -> K-Num: 3
Training: Estimator: spectralbiclustering -> K-Num: 3
...

One Example for one Fig generated by plot_multiple_silhouette.

# You can use Fit too to get silhouettes for estimators_selection 'k_inform', 'no_k_inform' and 'all'
# This function return silhouettes scores for all estimators and a list with dicts inside sils_info_results. 
# This dict is just the results of fits.
# {estimator_name_k_number: (sklearn_silhouette_samples, labels)}
# {'Birch_2': (array([0.4647072, ..., 0.43923615]),
#              array([0, 1, 0, ..., 0, 0, 0]))
silhouettes, no_k_silhouettes, sils_info_results = cs.fit(X=X, cluster_list=cluster_list, estimators_selection='all')

The silhouettes variable is a pandas DataFrame of sklearn silhouette_score for all K clsuters in cluster_list:

index 2 3 4 5
bisectingkmeans 0.661884 0.610811 0.688161 0.564153
gaussianmixture 0.661884 0.610806 0.688178 0.548734
kmeans 0.661884 0.610811 0.688178 0.556731
minibatchkmeans 0.661884 0.610811 0.688178 0.512270
spectralbiclustering 0.661884 0.415127 0.542108 0.557163
spectralclustering 0.661884 0.610806 0.688178 0.559767

The no_k_silhouettes variable is a pandas DataFrame with sklearn silhouette_score for no k inform sklearn estimators.

index Silhouette
affinitypropagation 0.121989
meanshift 0.661884

Finally, the sils_info_results variable if a list of all fits with cluster labels and silhouette_samples. For more details and a complete usage, please check examples folder. Thanks!

References

[1] Lazypredict Package: Lazy predict python package with similar implementation but for supervised learning.

[2] Selecting the number of clusters with silhouette analysis on KMeans clustering: scikit-learn post showing silhouette raw code analysis.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cluster_ss-0.0.5.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

cluster_ss-0.0.5-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file cluster_ss-0.0.5.tar.gz.

File metadata

  • Download URL: cluster_ss-0.0.5.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.12 Linux/6.8.0-52-generic

File hashes

Hashes for cluster_ss-0.0.5.tar.gz
Algorithm Hash digest
SHA256 0546478f17fc7601c1433c72391a12215c42419322bd2ad7fdf196df77be1748
MD5 2b2843adf49fc430ce2fc2d589474bb9
BLAKE2b-256 366d9179f585efaf2c6ab8c1fc0b9801f1b67386b668d73b05fb98e761ec6bc2

See more details on using hashes here.

File details

Details for the file cluster_ss-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: cluster_ss-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.12 Linux/6.8.0-52-generic

File hashes

Hashes for cluster_ss-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 caf6116e8ca9807d2923732d471e88c093509df366a99b35aaf5db2c4a04638f
MD5 b47aa3a3bddd2cc93202f73de92d4bac
BLAKE2b-256 251866251a85bfb02cecab30ea063cdc58f173f4231aecb8edb9965b34c51aa9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page