Skip to main content

Size Constrained Clustering solver

Project description

Size Constrained Clustering Solver

Build Status PyPI version GitHub

Implementation of Size Constrained Clustering. Size constrained clustering can be treated as an optimization problem. Details could be found in a set of reference paper.

Installation

Requirement Python >= 3.6, Numpy >= 1.13, Cython >= 0.29

  • install from PyPI
pip install size-constrained-clustering

Methods

  • Fuzzy C-means Algorithm: Similar to KMeans, but use membership probability, not 0 or 1
  • Same Size Contrained KMeans Heuristics: Use Heuristics methods to reach same size clustering
  • Same Size Contrained KMeans Inspired by Minimum Cost Flow Problem
  • Minimum and Maximum Size Constrained KMeans Inspired by Minimum Cost Flow Problem
  • Deterministic Annealling Algorithm: Input target cluster distribution, return correspondent clusters
  • Shrinkage Clustering: base algorithm and minimum size constraints

Usage:

# setup
from size_constrained_clustering import fcm, equal, minmax, shrinkage
# 默认都是欧式距离计算,可接受其它distance函数,比如haversine
from sklearn.metrics.pairwise import haversine_distances

Fuzzy C-means

n_samples = 2000
n_clusters = 4
centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                    centers=centers, shuffle=False, random_state=42)
model = fcm.FCM(n_clusters)
# use other distance function: e.g. haversine distance
# model = fcm.FCM(n_clusters, distance_func=haversine_distances)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

![alt text][https://github.com/jingw2/size_constrained_clustering/tree/master/pic/fcm.png]

等大聚类

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# 使用minmax flow方式求解
model = equal.SameSizeKMeansMinCostFlow(n_clusters)
# 使用heuristics方法求解
model = equal.SameSizeKMeansHeuristics(n_clusters)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

![alt text][https://github.com/jingw2/size_constrained_clustering/tree/master/pic/equal.png]

图中共2000个正态分布的点,聚成3类,分别有667,667和666个点。

最小和最大规模限制

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400,   size_max=800)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

![alt text][https://github.com/jingw2/size_constrained_clustering/tree/master/pic/minmax.png]

获取结果聚类size分别为753, 645, 602。

Deterministic Annealing

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
# distribution 表明各cluster目标的比例
model = da.DeterministicAnnealing(n_clusters, distribution=[0.1, 0.6, 0.3])
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

![alt text][https://github.com/jingw2/size_constrained_clustering/tree/master/pic/da.png]

获取的结果cluster size分别为:1200,600和200。对应比例为0.6, 0.3和0.1。

Shrinkage Clustering

只能保证收敛到局部最优,且获取的结果不一定可用。

n_samples = 1000
n_clusters = 4
centers = [(-5, -5), (0, 0), (5, 5), (7, 10)]
X, _ = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0, centers=centers, shuffle=False, random_state=42)

model = shrinkage.Shrinkage(n_clusters, size_min=100)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

![alt text][https://github.com/jingw2/size_constrained_clustering/tree/master/pic/shrinkage.png]

Copyright

Copyright (c) 2020 Jing Wang. Released under the MIT License.

Third-party copyright in this distribution is noted where applicable.

Reference

TO DO

  • Size constraint API
  • Examples to show
  • Readme Modification, badges, travis CI
  • Upload PyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

size_constrained_clustering-0.1.1.tar.gz (52.3 kB view hashes)

Uploaded Source

Built Distribution

size_constrained_clustering-0.1.1-cp36-cp36m-macosx_10_7_x86_64.whl (446.4 kB view hashes)

Uploaded CPython 3.6m macOS 10.7+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page