Clustering with maximum distance between points inside clusters
Project description
Clustering with maximum diameter
Clustering algorithms with maximum distance between points inside clusters.
When we have interpetable metric like cosine distance it could be nice to have clusters with maximum distance between points. Then we can find good threshold for maximum distance and be confident that points inside clusters are really similar. Unfortunately popular clustering algorithms don't have such behavior.
Main algorithm is MaxDiameterClustering. It is a simple greedy algorithm, in which we add points one by one. If there is a cluster with all points close enough to new points, then we add new point to this cluster. If there is no such cluster, this point starts new cluster.
Also two similar algorithms are added - Leader Clustering and Quality Threshold Clustering.
Usage
MaxDiameterClustering
Basic usage of MaxDiameterClustering:
from sklearn.datasets import make_blobs
from diameter_clustering import MaxDiameterClustering
X, y = make_blobs(n_samples=100, n_features=50)
model = MaxDiameterClustering(max_distance=0.3, metric='cosine')
labels = model.fit_predict(X)
When we want to compute cosine distance and our vectors are normalized, it is better to use
inner_product
as metric because it is much faster:
X_normalized = X/(np.linalg.norm(X, axis=-1, keepdims=True) + 1e-16)
model = MaxDiameterClustering(max_distance=0.3, metric='inner_product')
labels = model.fit_predict(X_normalized)
Instead of using feature matrix X
we can pass precomputed distance matrix:
from diameter_clustering.dist_matrix import compute_dist_matrix
dist_matrix = compute_dist_matrix(X, metric='cosine')
model = MaxDiameterClustering(max_distance=0.3, precomputed_dist=True)
labels = model.fit_predict(dist_matrix)
Calculation of full distance matrix between all points is expensive, so for big datasets it is better to use distance matrix in sparse format:
model = MaxDiameterClustering(max_distance=0.3, metric='cosine', sparse_dist=True)
labels = model.fit_predict(X)
model = MaxDiameterClustering(max_distance=0.3, sparse_dist=True, precomputed_dist=True)
dist_matrix = compute_sparse_dist_matrix(X, max_distance=0.3, metric='cosine')
labels = model.fit_predict(dist_matrix)
With deterministic=True
we can get reproducible results:
model = MaxDiameterClustering(max_distance=0.3, metric='cosine', deterministic=True)
labels = model.fit_predict(X)
Leader Clustering
from diameter_clustering import LeaderClustering
model = LeaderClustering(max_radius=0.15, metric='cosine')
labels = model.fit_predict(X)
Precomputed distance, sparse distance, deterministic behavior and inner_product could be used as in MaxDiameterClustering.
Quality Threshold Clustering
from diameter_clustering import QTClustering
model = QTClustering(max_radius=0.15, metric='cosine', min_cluster_size=5)
labels = model.fit_predict(X)
Precomputed distance, sparse distance and inner_product could be used as in MaxDiameterClustering. This algorithm is deterministic by design.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for diameter-clustering-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16e0981596072acde695706b82585d2d973a9987174bbee300e375cf6d75ec2b |
|
MD5 | 4255705674c53dc44710e7a7d7b9e2ed |
|
BLAKE2b-256 | 9e73e2922e56dd865da4b8a4877a3a31da550f30a17dd853ee5fb457838698b2 |
Hashes for diameter_clustering-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6dfcd4b2d8379fae195a949e9350a31464a3a3d3646dea62714d1ba590254e9 |
|
MD5 | e8149ba23ea84c664d4c3f5d250a60c8 |
|
BLAKE2b-256 | a591b1728fc37b687b86e3e4c6a4562dfd56bbaea9f1fad81251f0dadd8fbe48 |