A quantitative approach to select the optimal number of clusters in a dataset.
Project description
The Curvature Method
A quantitative approach to select the optimal number of clusters in a dataset.
Table of contents
Introduction
Clustering is a major area in Unsupervised Machine Learning. In many clustering algorithms, the number of desired clusters is given as a parameter. Selecting a dataset's true cluster number k can be challenging, as model accuracy increases with additional clusters, yet too high of a k value leads to overfitting, and a less meaningful model. Because the value of k has a dramatic impact on clustering results, it is important to select it carefully.
The most common method of selecting a true cluster number is known as the "Elbow Method", which involves manually selecting a point along an evaluation graph that appears to contain the sharpest corner. There are several problems with this approach, as it is empirical and requires direct intervention. Additionally, the axes of the evaluation graph tend to lie on significantly different scales, which makes it difficult to recognize the optimal k value visually. In contrast, the Curvature Method is a recent approach that quantitatively finds the optimal k value [1]. This approach can be used in a broad range of clustering applications, further decoupling the learning process from human intervention.
Installation
This project can be installed using pip:
pip install curve-method
Examples
First, obtain a dataset as a 2D NumPy array. In these examples, we use the
make_blobs()
generator from Scikit-Learn to simulate a real dataset.
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=10000, n_features=4, centers=5)
Evaluation
To view the curvature index for each k value up to a specified maximum,
use the curve_scores()
function.
from curve_method import curve_scores
curve_scores(X, k_max=10)
Or, to obtain the k value with maximum curvature, use the true_k()
function.
from curve_method import true_k
true_k(X, k_max=10)
Plotting
To view the evaluation graph from the Curvature Method, use the
scatter()
function. If desired, points can be connected on the graph by
setting line=True
.
from curve_method import scatter
scatter(X, k_max=12, line=False)
As an alternative, use the polyfit() function to generate an evaluation
graph with a polynomial approximation. The degree of the polynomial n
can be specified by setting deg=n
.
from curve_method import polyfit
polyfit(X, k_max=12, deg=3)
Dependencies
- NumPy
- Matplotlib
- Scikit-learn
References
[1] Zhang, Y., Mańdziuk, J., Quek, C.H. and Goh, B.W., 2017. Curvature-based method for determining the number of clusters. Information Sciences, 415, pp.414-428.
License
This project is licensed under the terms of the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for curve_method-0.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c95485030e07ccb5d12ac863268d9ccc6ba6a159af0aa5fbe1ee01e9a34db491 |
|
MD5 | b7ec6ead0ea4a75358b066efbcba627a |
|
BLAKE2b-256 | 70a6497ffeba191cdbc90ad4505fcdf833be89bbbe3c164624f17db5871647a3 |