Skip to main content

Size Constrained Clustering solver

Project description

Size Constrained Clustering Solver

Build Status PyPI version GitHub codecov PyPI - Downloads

Implementation of Size Constrained Clustering. Size constrained clustering can be treated as an optimization problem. Details could be found in a set of reference paper.

Installation

Requirement Python >= 3.6, Numpy >= 1.13, Cython >= 0.29

  • install from PyPI
pip install size-constrained-clustering

Methods

  • Fuzzy C-means Algorithm: Similar to KMeans, but use membership probability, not 0 or 1
  • Same Size Contrained KMeans Heuristics: Use Heuristics methods to reach same size clustering
  • Same Size Contrained KMeans Inspired by Minimum Cost Flow Problem
  • Minimum and Maximum Size Constrained KMeans Inspired by Minimum Cost Flow Problem
  • Deterministic Annealling Algorithm: Input target cluster distribution, return correspondent clusters
  • Shrinkage Clustering: base algorithm and minimum size constraints

Usage:

# setup
from size_constrained_clustering import fcm, equal, minmax, shrinkage
# by default it is euclidean distance, but can select others
from sklearn.metrics.pairwise import haversine_distances
import numpy as np

Fuzzy C-means

n_samples = 2000
n_clusters = 4
centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                    centers=centers, shuffle=False, random_state=42)
model = fcm.FCM(n_clusters)
# use other distance function: e.g. haversine distance
# model = fcm.FCM(n_clusters, distance_func=haversine_distances)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Equal Size Constraint

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# use minimum cost flow framework to solve
model = equal.SameSizeKMeansMinCostFlow(n_clusters)
# use heuristics method to solve
model = equal.SameSizeKMeansHeuristics(n_clusters)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Cluster size: 667, 667 and 666 in the figure above.

Minimum and Maximum Size Constraint

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400,   size_max=800)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Cluster size: 753, 645 and 602 in the figure above.

Deterministic Annealing

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# distribution is the distribution of cluster sizes
model = da.DeterministicAnnealing(n_clusters, distribution=[0.1, 0.6, 0.3])
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Cluster size: 1200, 600 and 200 in the figure above, corresponding to distribution [0.6, 0.3, 0.1]

Shrinkage Clustering

The result might be not available.

n_samples = 1000
n_clusters = 4
centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0, centers=centers, shuffle=False, random_state=42)

model = shrinkage.Shrinkage(n_clusters, size_min=100)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

alt text

Copyright

Copyright (c) 2020 Jing Wang. Released under the MIT License.

Third-party copyright in this distribution is noted where applicable.

Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

size_constrained_clustering-0.1.2.tar.gz (51.5 kB view hashes)

Uploaded Source

Built Distribution

size_constrained_clustering-0.1.2-cp36-cp36m-macosx_10_7_x86_64.whl (433.6 kB view hashes)

Uploaded CPython 3.6m macOS 10.7+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page