Skip to main content

Clustering with maximum distance between points inside clusters

Project description

Clustering with maximum diameter

Clustering algorithms with maximum distance between points inside clusters.

When we have interpetable metric like cosine distance it could be nice to have clusters with maximum distance between points. Then we can find good threshold for maximum distance and be confident that points inside clusters are really similar. Unfortunately popular clustering algorithms don't have such behavior.

Main algorithm is MaxDiameterClustering. It is a simple greedy algorithm, in which we add points one by one. If there is a cluster with all points close enough to new points, then we add new point to this cluster. If there is no such cluster, this point starts new cluster.

Also two similar algorithms are added - Leader Clustering and Quality Threshold Clustering.

Usage

MaxDiameterClustering

Basic usage of MaxDiameterClustering:

from sklearn.datasets import make_blobs
from diameter_clustering import MaxDiameterClustering

X, y = make_blobs(n_samples=100, n_features=50)

model = MaxDiameterClustering(max_distance=0.3, metric='cosine')
labels = model.fit_predict(X)

When we want to compute cosine distance and our vectors are normalized, it is better to use inner_product as metric because it is much faster:

X_normalized = X/(np.linalg.norm(X, axis=-1, keepdims=True) + 1e-16)

model = MaxDiameterClustering(max_distance=0.3, metric='inner_product')
labels = model.fit_predict(X_normalized)

Instead of using feature matrix X we can pass precomputed distance matrix:

from diameter_clustering.dist_matrix import compute_dist_matrix

dist_matrix = compute_dist_matrix(X, metric='cosine')

model = MaxDiameterClustering(max_distance=0.3, precomputed_dist=True)
labels = model.fit_predict(dist_matrix)

Calculation of full distance matrix between all points is expensive, so for big datasets it is better to use distance matrix in sparse format:

model = MaxDiameterClustering(max_distance=0.3, metric='cosine', sparse_dist=True)
labels = model.fit_predict(X)

model = MaxDiameterClustering(max_distance=0.3, sparse_dist=True, precomputed_dist=True)
dist_matrix = compute_sparse_dist_matrix(X, max_distance=0.3, metric='cosine')
labels = model.fit_predict(dist_matrix)

With deterministic=True we can get reproducible results:

model = MaxDiameterClustering(max_distance=0.3, metric='cosine', deterministic=True)
labels = model.fit_predict(X)

Leader Clustering

from diameter_clustering import LeaderClustering

model = LeaderClustering(max_radius=0.15, metric='cosine')
labels = model.fit_predict(X)

Precomputed distance, sparse distance, deterministic behavior and inner_product could be used as in MaxDiameterClustering.

Quality Threshold Clustering

from diameter_clustering import QTClustering

model = QTClustering(max_radius=0.15, metric='cosine', min_cluster_size=5)
labels = model.fit_predict(X)

Precomputed distance, sparse distance and inner_product could be used as in MaxDiameterClustering. This algorithm is deterministic by design.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diameter-clustering-0.0.1.tar.gz (8.3 kB view hashes)

Uploaded Source

Built Distribution

diameter_clustering-0.0.1-py3-none-any.whl (11.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page