A small package for enhancing scikit-learn's clustering capabilities
Project description
ClusterSupport
ClusterSupport is a small package designed to enhance scikit-learn's clustering capabilities. The package combines scikit-learn's clustering algorithm classes like KMeans()
, AgglomerativeClustering()
, and DBSCAN()
,
with additional functions for the analysis and optimisation of clustering results.
Dependencies
ClusterSupport requires the following packages:
Analysing clustering results
ClusterSupport inherits clustering classes from scikit-learn and wraps their .fit()
methods so that calling .fit()
returns an instance of the ClusteringResult()
class. Let's load in one of scikit-learn's toy datasets to see some of the functionality.
from sklearn.datasets import load_boston
data = pd.DataFrame(load_boston()['data'], columns = load_boston()['feature_names'])
And then we can call the .fit()
method to return an instance of our ClusteringResult()
class.
import clustersupport as cs
results = cs.KMeans(n_clusters = 3).fit(data)
We can calculate a metric of clustering structure, for example the Calinski-Harabasz score or C-index, using:
CH_score = results.CH_score()
C_index = results.C_index()
Currently, ClusterSupport supports the following clustering metrics:
- Silhouette score
- Calinski-Harabasz score
- C-index
- Inertia (sum of intra-cluster distances)
For another view we can get a summary of the clustering using:
summ = results.get_summary(dist_metric = 'sqeuclidean')
which returns a Pandas DataFrame containing values for the number of points in each cluster as well as the average distance around the cluster mean.
cluster | n | avg_dist_around_mean |
---|---|---|
0 | 102 | 6.18 |
1 | 366 | 8.66 |
2 | 38 | 6.69 |
The distance metric can be specified with the dist_metric
argument. ClusterSupport currently supports 'sqeuclidean'
, 'euclidean'
, 'cosine'
, and 'manhattan'
distances.
Finally, we can conduct a classifier assessment of the clustering. This involves training a classifier to predict the cluster to which each data point belongs and assessing the classifier's accuracy on a 'test set' of points that were not used in the training. This is particularly useful in contexts where the clustering is run on a reduced version of the full feature space, and we wish to analyse how effectively the clustering captures the full detail of the complete feature space.
from sklearn.decomposition import PCA
# set seed and apply PCA on the data
data_reduced = PCA(n_components = 3, random_state = 123).fit_transform(data)
# fit to the reduced data and save the labels
reduced_fit_labels = cs.KMeans(n_clusters = 8, random_state = 123).fit(data_reduced).labels_
# run fit with the full data set
clustering = cs.KMeans(n_clusters = 8).fit(data)
# run classifier assessment with the reduced fit labels
clf_assessment = clustering.classifier_assessment(classifier = 'logreg', labels = reduced_fit_labels, roc_plot = True, n = 50, save_fig = True, random_state = 123)
Specifying roc_plot = True
uses matplotlib to plot an ROC curve for each cluster so that the user can see how well each cluster is classified.
The function also outputs a Pandas DataFrame with classification metrics (precision, recall, and f1 score) calculated for each cluster.
The AUC for each cluster reflects the classifier's ability to distinguish instances in that cluster, thus providing an estimate of:
- How well the reduced feature space represents the complete feature space
- How well-separated each cluster is
Analysing feature importances in clustering
ClusterSupport also provides methods for analysing the importance of different features in the cluster, either on at a global or per-cluster level.
These functions are called as methods under the clustering classes inherited from scikit-learn (e.g. clustersupport.KMeans()
).
The t_test
method calculates a t-statistic for each feature in each cluster, calculated as a scaled difference-of-means
between feature values for instances inside the cluster compared to instances outside the cluster. This returns Pandas DataFrame of size
(n_clusters, n_features), with the calculated t-statistic or p-value for the respective cluster/feature combination in each cell.
Welch's two-sample t-test with unequal variances is used for the calculation.
feature_t_tests = cs.KMeans(n_clusters = 3).t_test(X = data, output = 'p-value')
You can also output the raw t-statistics with output = 't-statistic'
.
Alternatively you can conduct a non-parametric Mann-Whitney U test to test the ranks of feature values inside/outside each cluster.
feature_MW = cs.KMeans(n_clusters = 3).mann_whitney(X = data, output = 'p-value')
This also returns a DataFrame of size (n_clusters, n_features), with the calculated statistic or p-value for the cluster/feature combination in each cell.
The leave_one_out()
function assesses the global contribution of each feature to the clustering by calculating a global metric like the Calinski-Harabasz score
or the sum of intra-cluster distances (a.k.a inertia).
feature_LOO = cs.KMeans(n_clusters = 3).leave_one_out(X = data, metric = 'CH_score')
This generates a Pandas Series showing the change in the clustering metric that was calculated when each feature was removed.
feature | change_in_CH_score |
---|---|
0 | 12.3713 |
1 | 8.71682 |
2 | -8.93893 |
3 | 36.7465 |
4 | -9.20392 |
5 | 26.1994 |
6 | -3.35618 |
7 | -7.85918 |
8 | -15.0409 |
9 | -21.3239 |
10 | 21.898 |
11 | 20.8256 |
12 | 3.82213 |
Finally we can build a logistic regression model to calculate a coefficient for each feature in each cluster and return the p-value of the coefficient under the null hypothesis:
$$\frac{\beta}{\text{SE}(\beta)} \sim \mathcal{N}(\mu = 0, \sigma^{2} = 1)$$
This is done using the logistic_regression()
function, which builds a logistic regression model for each cluster with y = 1 if an instance is in the cluster, and y = 0 if not:
feature_LR = cs.KMeans(n_clusters = 3).logistic_regression(X = data, output = 'p-value')
which returns a DataFrame of size (n_clusters, n_features), with the calculated value for the cluster/feature combination in each cell. Types of output can be chosen from 'coef', 'z-score', and 'p-value'.
cluster | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.398823 | 0.0147451 | 0.259453 | 0.398709 | 0.21047 | 0.391893 | 0.104119 | 0.0995283 | 0.396368 | 0.308628 | 0.308155 | 0.389234 | 0.292119 |
1 | 0.358399 | 0.398903 | 0.105571 | 0.37434 | 0.0327718 | 0.30657 | 0.385749 | 0.302824 | 0.0947453 | 0.00961577 | 0.202549 | 0.369796 | 0.169429 |
2 | 0.277988 | 0.00147331 | 0.198039 | 0.396121 | 0.0952189 | 0.35859 | 0.159246 | 0.191815 | 0.299553 | 0.124311 | 0.362565 | 0.390554 | 0.354773 |
These p-values are not adjusted.
Optimizing clustering hyperparameters
ClusterSupport also provides functions for optimizing clustering hyperparameters.
Like the feature methods, these functions are called as methods under the clustering classes inherited from scikit-learn (e.g. clustersupport.KMeans()
)
The simplest optimization method is elbow_plot()
, which constructs a plot of hyperparameter values compared to a clustering metric.
cs.KMeans().elbow_plot(X = data, parameter = 'n_clusters', parameter_range = range(2,10), metric = 'silhouette_score')
The gap_statistic()
method is another function can be used to optimise hyperparameters.
It calculates the gap statistic and its standard errors across a range of hyperparameter values.
For example, to optimise the number of clusters used in K-means clustering, we call the following:
gap_statistics = cs.KMeans().gap_statistic(X = data, parameter = 'n_clusters', parameter_range = range(2,10), metric = 'inertia', random_state = 123)
Note that the n_clusters
argument of KMeans()
is not used since the function is iterating through a specified range of hyperparameter values.
The parameter
argument is passed as a string and the parameter_range
argument should be an iterable containing values of the hyperparameter over which the gap statistic should be calculated.
The function defaults to calculating the gap statistic in terms of changes in inertia but can also be modified to calculate the change in Calinski-Harabasz score, silhouette score, or C-index by changing the metric
argument.
Tibshirani and colleagues propose taking the first value of k clusters at which the value of the gap statistic at k clusters is greater than the value for k+1 clusters minus the standard error at k+1 clusters.
The function returns a data frame of size (len(parameter_range), 2)
which contains the gap statistic for each hyperparameter value and the gap statistic's standard error.
n_clusters | gap_statistic | standard_error |
---|---|---|
2 | 0.01415 | 0.00134 |
3 | 0.04581 | 0.0019 |
4 | 0.06136 | 0.0018 |
5 | 0.06562 | 0.00212 |
6 | 0.07478 | 0.00244 |
7 | 0.118 | 0.0031 |
8 | 0.11725 | 0.00321 |
9 | 0.1476 | 0.00314 |
We can also use the consensus_cluster()
function to run Monti consensus clustering over a hyperparameter value range. The function is passed in a similar way to the gap_statistic()
function.
consensus_data = cs.KMeans().consensus_cluster(X = data, parameter = 'n_clusters', parameter_range = range(2,10), plot = True, random_state = 123)
Consensus clustering does not rely on any particular clustering metric. The plot
argument defaults to True
and causes the function to output the empirical CDFs for consensus values for different hyperparameter values.
Monti et al suggesting picking the number of clusters k at which the largest increase is seen in the area under the CDF between k clusters and k-1 clusters.
Șenbabaoğlu et al suggest a different method, involving selecting the number of clusters k at which the proportion of unambiguous consensus values (values <0.1 or >0.9) is greatest.
The consensus_cluster()
function returns a Pandas DataFrame of size (len(parameter_range), 2)
, containing columns for the proportion of unambiguous clusterings and the area under the CDF for every value of the hyperparameter of interest.
n_clusters | proportion_unambiguous_clusterings | area_under_cdf |
---|---|---|
2 | 1 | 0.396 |
3 | 0.999 | 0.431 |
4 | 0.975 | 0.64 |
5 | 0.827 | 0.704 |
6 | 0.87 | 0.766 |
7 | 0.918 | 0.81 |
8 | 0.902 | 0.827 |
9 | 0.896 | 0.845 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file clustersupport-0.0.9.tar.gz
.
File metadata
- Download URL: clustersupport-0.0.9.tar.gz
- Upload date:
- Size: 21.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 303b67e7ea46ae4ebf6a217433506b139b1ea06a09811dc31869db802edbe881 |
|
MD5 | f115f4476a47222378b3a3b19f8df95c |
|
BLAKE2b-256 | 658649c6e59b384629fd5758423471f419a541fbd22f01c4cf6ddc231657b397 |
File details
Details for the file clustersupport-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: clustersupport-0.0.9-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d354484dbd31b0f5f3cedefabfecd80cf0d949f255eb140ea35c76c42656fbc5 |
|
MD5 | 40afb985e3dfd7434b441cffeaf0d91c |
|
BLAKE2b-256 | 63e920d94a09ecbf9725457063f64ae59c3211643fd6773de09d895a69e31d96 |