Auto-select optimal K-means clusters with advanced scoring
Project description
KScorer: Auto-select optimal K-means clusters with advanced scoring
Basic Usage
LIVE demo-notebook is here
Load Modules
In [1]: import numpy as np
...: import pandas as pd
...: from sklearn import datasets
...: from sklearn.metrics import balanced_accuracy_score
...: from sklearn.model_selection import train_test_split
...: from kscorer.kscorer import KScorer
Init KScorer
In [2]: ks = KScorer()
Get Data
In [3]: X, y = datasets.load_digits(return_X_y=True)
...: X.shape
Out[3]: (1797, 64)
Train/Test Split
In [4]: X_train, X_test, y_train, y_test = train_test_split(
...: X, y, test_size=0.2, random_state=1234)
Fit KScorer (i.e. Perform Unsupervised Clustering)
In [5]: labels, centroids, _ = ks.fit_predict(X_train, retall=True)
100%|██████████| 13/13 [00:09<00:00, 1.39it/s]
Optimal Clusters
In [6]: ks.show()
In [7]: ks.optimal_
Out[7]: 10
Confusion Matrix
In [8]: labels_mtx = (pd.Series(y_train)
...: .groupby([labels, y_train])
...: .count()
...: .unstack()
...: .fillna(0))
...: # match arbitrary labels to ground-truth labels
...: order = []
...:
...: for i, r in labels_mtx.iterrows():
...: left = [x for x in np.unique(y_train) if x not in order]
...: order.append(r.iloc[left].idxmax())
...:
...: confusion_mtx = labels_mtx[order]
...: confusion_mtx
Out[8]:
5 | 9 | 4 | 2 | 0 | 6 | 1 | 7 | 8 | 3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 124.0 | 5.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 7.0 | 4.0 | 2.0 |
1 | 12.0 | 95.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 90.0 |
2 | 2.0 | 0.0 | 122.0 | 0.0 | 1.0 | 2.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 108.0 | 0.0 | 0.0 | 22.0 | 0.0 | 1.0 | 20.0 |
4 | 1.0 | 2.0 | 1.0 | 0.0 | 147.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 2.0 | 1.0 | 0.0 | 0.0 | 2.0 | 145.0 | 3.0 | 0.0 | 4.0 | 0.0 |
6 | 0.0 | 1.0 | 2.0 | 22.0 | 0.0 | 0.0 | 67.0 | 7.0 | 57.0 | 6.0 |
7 | 0.0 | 5.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 130.0 | 4.0 | 6.0 |
8 | 0.0 | 15.0 | 0.0 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 57.0 | 21.0 |
9 | 0.0 | 22.0 | 3.0 | 0.0 | 0.0 | 1.0 | 48.0 | 0.0 | 6.0 | 2.0 |
Cluster Unseen Data (you would prefer to build classifier instead)
In [9]: labels_unseen = ks.predict(X_test, init=centroids)
Evaluate Accuracy
In [10]: y_clustd = pd.Series(labels).replace(dict(enumerate(order)))
...: y_unseen = pd.Series(labels_unseen).replace(dict(enumerate(order)))
In [11]: balanced_accuracy_score(y_train, y_clustd) # train data
Out[11]: 0.6940733254455871
In [12]: balanced_accuracy_score(y_test, y_unseen) # unseen data
Out[12]: 0.646615365026082
ToDo:
- consider applying power-transform before initial scaling
- consider pyckmeans
- consider pyxmeans
- consider spherecluster
- consider benchmark testing
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kscorer-2.2.0.tar.gz
(46.1 kB
view details)
Built Distribution
kscorer-2.2.0-py3-none-any.whl
(32.7 kB
view details)
File details
Details for the file kscorer-2.2.0.tar.gz
.
File metadata
- Download URL: kscorer-2.2.0.tar.gz
- Upload date:
- Size: 46.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f831fd3502b02e48f73b79be6e62f7bcb44482e57ea32d69bb0a2175ead08be7 |
|
MD5 | 9e7900bfed2360f8bff199aedac6c0c4 |
|
BLAKE2b-256 | ef1ef5090f265c209121b2b4df8551efa218668b74bdf9a8d150250bfd207bbf |
File details
Details for the file kscorer-2.2.0-py3-none-any.whl
.
File metadata
- Download URL: kscorer-2.2.0-py3-none-any.whl
- Upload date:
- Size: 32.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fbcb55bb887e4261e908c696d3e5a3a1d04c39ce437b7c79f071fc9e63b25a11 |
|
MD5 | 563508c23d3da28d945f224b03fbaf0c |
|
BLAKE2b-256 | 2f645338696d1b42ed9a88abde63c5d2293f636b963bf9c9e2ff6daf33a26d7f |