Skip to main content

The breathing k-means algorithm

Project description

The Breathing K-Means Algorithm

A novel approximation algorithm for the k-means problem

Breathing k-means finds solutions which

  • are usually better than solutions from k-means++
  • are lower-bounded by k-means++ (the worst possible result is a solution from k-means++)
  • require more computation time than k-means++ (ca 50%-300% depending on the problem)

The second claim is fulfilled, since the default seeding method of breathing k-means is k-means++. If no improvement is found by breathing k-means, it simply returns the k-means++ result used as seed.

The more interesting part about breathing k-means is probably its ability to improve upon k-means++ in many cases and to do so at moderate extra computational cost (considering that the k-means problem is NP-hard).

Further Info

For a detailed motivation and description of the breathing k-means algorithm see the preprint https://arxiv.org/abs/2006.15666.

An extended software repo including the data sets used in the above preprint and jupyter notebooks to experiment with them can be found at https://github.com/gittar/breathing-k-means

API

The included class BKMeans is subclassed from scikit-learn's KMeans class and has thus the same API, only extended by two parameters which, however, can be ignored for normal usage and thus simply left at their default values:

  • m (breathing depth), default: 5, (m=0 is equivalent to running k-means++)
  • theta (neighborhood freezing range), default: 1.1

Installation

pip install bkmeans

Example 1: running on simple random data set

Code:

import numpy as np
from bkmeans import BKMeans

# generate random data set
X=np.random.rand(1000,2)

# create BKMeans instance
bkm = BKMeans(n_clusters=100)

# run the algorithm
bkm.fit(X)

# print SSE (inertia in scikit-learn terms)
print(bkm.inertia_)

Output:

1.1775040547902602

Example 2: comparison with k-means++ (multiple runs)

Code:

import numpy as np
from sklearn.cluster import KMeans
from bkmeans import BKMeans

# random 2D data set
X=np.random.rand(1000,2)

# number of centroids
k=100

for i in range(5):
    # kmeans++
    km = KMeans(n_clusters=k)
    km.fit(X)

    # breathing k-means
    bkm = BKMeans(n_clusters=k)
    bkm.fit(X)

    # relative SSE improvement of bkm over km++
    imp = 1 - bkm.inertia_/km.inertia_
    print(f"SSE improvement over k-means++: {imp:.2%}")

Output:

SSE improvement over k-means++: 3.38%
SSE improvement over k-means++: 4.16%
SSE improvement over k-means++: 6.14%
SSE improvement over k-means++: 6.79%
SSE improvement over k-means++: 4.76%

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bkmeans-1.0.tar.gz (5.7 kB view hashes)

Uploaded Source

Built Distribution

bkmeans-1.0-py3-none-any.whl (7.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page