Skip to main content

K-Means clustering constrained with minimum and maximum cluster size

Project description

PyPI Python Build Documentation

k-means-constrained

K-means clustering implementation whereby a minimum and/or maximum size for each cluster can be specified.

This K-means implementation modifies the cluster assignment step (E in EM) by formulating it as a Minimum Cost Flow (MCF) linear network optimisation problem. This is then solved using a cost-scaling push-relabel algorithm and uses Google's Operations Research tools's SimpleMinCostFlow which is a fast C++ implementation.

This package is inspired by Bradley et al.. The original Minimum Cost Flow (MCF) network proposed by Bradley et al. has been modified so maximum cluster sizes can also be specified along with minimum cluster size.

The code is based on scikit-lean's KMeans and implements the same API with modifications.

Ref:

  1. Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8.
  2. Google's SimpleMinCostFlow C++ implementation

Installation

You can install the k-means-constrained from PyPI:

pip install k-means-constrained

It is supported on Python 3.10, 3.11 and 3.12. Previous versions of k-means-constrained support older versions of Python and Numpy.

Example

More details can be found in the API documentation.

>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...                [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
...     n_clusters=2,
...     size_min=2,
...     size_max=5,
...     random_state=0
... )
>>> clf.fit_predict(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])
>>> clf.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
Code only
from k_means_constrained import KMeansConstrained
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
                [4, 2], [4, 4], [4, 0]])
clf = KMeansConstrained(
     n_clusters=2,
     size_min=2,
     size_max=5,
     random_state=0
 )
clf.fit_predict(X)
clf.cluster_centers_
clf.labels_

Time complexity and runtime

k-means-constrained is a more complex algorithm than vanilla k-means and therefore will take longer to execute and has worse scaling characteristics.

Given a number of data points $n$ and clusters $c$, the time complexity of:

  • k-means: $\mathcal{O}(nc)$
  • k-means-constrained1: $\mathcal{O}((n^3c+n^2c^2+nc^3)\log(n+c)))$

This assumes a constant number of algorithm iterations and data-point features/dimensions.

If you consider the case where $n$ is the same order as $c$ ($n \backsim c$) then:

  • k-means: $\mathcal{O}(n^2)$
  • k-means-constrained1: $\mathcal{O}(n^4\log(n)))$

Below is a runtime comparison between k-means and k-means-constrained whereby the number of iterations, initializations, multi-process pool size and dimension size are fixed. The number of clusters is also always one-tenth the number of data points $n=10c$. It is shown above that the runtime is independent of the minimum or maximum cluster size, and so none is included below.

Data-points vs execution time for k-means vs k-means-constrained. Data-points=10*clusters. No min/max constraints

System details
  • OS: Linux-5.15.0-75-generic-x86_64-with-glibc2.35
  • CPU: AMD EPYC 7763 64-Core Processor
  • CPU cores: 120
  • k-means-constrained version: 0.7.3
  • numpy version: 1.24.2
  • scipy version: 1.11.1
  • ortools version: 9.6.2534
  • joblib version: 1.3.1
  • sklearn version: 1.3.0
---

1: Ortools states the time complexity of their cost-scaling push-relabel algorithm for the min-cost flow problem as $\mathcal{O}(n^2m\log(nC))$ where $n$ is the number of nodes, $m$ is the number of edges and $C$ is the maximum absolute edge cost.

Change log

  • v0.7.5 fix comment in README on Python version that is supported
  • v0.7.4 compatible with Numpy +v2.1.1. Added Python 3.12 support and dropped Python 3.8 and 3.9 support (due to Numpy). Linux ARM support has been dropped as we use GitHub runners to build the package and ARM machines was being emulated using QEMU. This however was producing numerical errors. GitHub should natively support Ubuntu ARM images soon and then we can start to re-build them.
  • v0.7.3 compatible with Numpy v1.23.0 to 1.26.4

Citations

If you use this software in your research, please use the following citation:

@software{Levy-Kramer_k-means-constrained_2018,
  author = {Levy-Kramer, Josh},
  month = apr,
  title = {{k-means-constrained}},
  url = {https://github.com/joshlk/k-means-constrained},
  year = {2018}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

k_means_constrained-0.7.5-cp312-cp312-win_amd64.whl (379.8 kB view details)

Uploaded CPython 3.12 Windows x86-64

k_means_constrained-0.7.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

k_means_constrained-0.7.5-cp312-cp312-macosx_11_0_arm64.whl (392.1 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

k_means_constrained-0.7.5-cp312-cp312-macosx_10_13_x86_64.whl (430.6 kB view details)

Uploaded CPython 3.12 macOS 10.13+ x86-64

k_means_constrained-0.7.5-cp311-cp311-win_amd64.whl (379.0 kB view details)

Uploaded CPython 3.11 Windows x86-64

k_means_constrained-0.7.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

k_means_constrained-0.7.5-cp311-cp311-macosx_11_0_arm64.whl (387.0 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

k_means_constrained-0.7.5-cp311-cp311-macosx_10_9_x86_64.whl (423.2 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

k_means_constrained-0.7.5-cp310-cp310-win_amd64.whl (378.4 kB view details)

Uploaded CPython 3.10 Windows x86-64

k_means_constrained-0.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

k_means_constrained-0.7.5-cp310-cp310-macosx_11_0_arm64.whl (387.3 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

k_means_constrained-0.7.5-cp310-cp310-macosx_10_9_x86_64.whl (422.5 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

File details

Details for the file k_means_constrained-0.7.5-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 72aac79a6084331794703d4b77db13b0e765280eeae3bb5160cd14d4fce9cefd
MD5 e983794044a4f368b053550df8c1687b
BLAKE2b-256 595f13b6b65cf1db62fd72c9519d35760a982a6e5ebd9db25ddeff6021ed9acd

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dd1d02ae0fbe29673415a1d9c979e1b6952467b90ee04fbd7f9f22e759e2d464
MD5 44c5027046144a8f2dc1b9abb2351c52
BLAKE2b-256 164e6b320e6308fe62b3cb9df8009ce849f185b10c60b4a24c4db68df32a03f1

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 afaad977c3ccf72b7f67efe4d05f88b59baaf9d63fd222f1fd6c118d2f767b15
MD5 0468e733a6d249acb04c616e15b4603f
BLAKE2b-256 64380c2d06ab83359801bd809d8d3385cf2aad5bd087cd7eeb3a4c71cc2b5c52

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 57590920903420e20df73725c32fe842c9ba19c052192c8f3152b0948a6b58ec
MD5 2ddb4dcb7083f725f838615d1592cd46
BLAKE2b-256 469f9b536016cb82b3eea776a49dc2eab047cd9fd94579744e316850b3d488d8

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ff966716793721feccaa6a620c17599eed21753492a48cea8b73adb4c8ee41ba
MD5 94939775a2a6f49a162c4f83c41fdc9b
BLAKE2b-256 b34398d8dcfed0a3dc120eaaab8e5410d5c0ee2aaf0958d933717af5d7c19a61

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f99d864300d509cd0aa369ff5953027f40754823a22a7a8465e145bb82274012
MD5 0ce65b58baf9eeaed89d0ea9a8ed6397
BLAKE2b-256 e125bb0d2d7882ba273486e6180d55ee9becac9cad85383dab77d1f316102c9d

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dd62d195ea625bc36fbca7f2ba3271492e16c8230c619fb15fbdc8729b32ff89
MD5 3374dddc778601ae986611accd9b6bed
BLAKE2b-256 aaced0ad4a399f323130baa41ee8cb06b3b796f4d46542e942f08a30f93758cd

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fb151028d0a281229c717dae211b97c9701d17be73332f925fbee4364de95065
MD5 051aa77ab5824307a2a9424c823f41e1
BLAKE2b-256 2ed51ae15b6dc17b6881a2b494605dd8b5e92f4f6bddb8891ce7a911a0f52854

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 35ed1d9d9b6acdc1783aa42af93a512ba01f97cea4199af9580950f60d6402f7
MD5 ddccc3e3892892715b1f4c2150e95755
BLAKE2b-256 61ca8b50e15056b20d588cda27e864e1e95e75b1ae4985747eb6a09cc370e13c

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 affb55c66f7717fff3f03d5c6c5343f93d257217240451167dbb97246c56dcd7
MD5 d0cba0b2c4fc3ab66710b07035a7d6f9
BLAKE2b-256 bcbbff7e56971817f524a0fb9cbe35dbad565983f530a484b8f6c4f6d754152e

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3d0bac38fbe492daeca3a90f1cf2bb92a0ac9dfcfb64c90344696635db61bd94
MD5 d17920a950093683efcd8a0dd465c5ce
BLAKE2b-256 493c82de88b2d958c21773108f775fc64c5f8c62f618f0257758ccfad5393007

See more details on using hashes here.

File details

Details for the file k_means_constrained-0.7.5-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for k_means_constrained-0.7.5-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 aa936573bd3d7b26f47156afbc050ae24f84098c0e42e083704b5f7b81b14a0d
MD5 8e10440869bc28bfebd3f9473a154d90
BLAKE2b-256 95afdf7975962828d6fd0b59eb306f890533f18869c99be07c9e1aef9677bb6b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page