The breathing k-means algorithm

## Project description

# The Breathing *K*-Means Algorithm

## A novel approximation algorithm for the *k*-means problem

**Breathing k-means** finds solutions which

- are usually better than solutions from
*k*-means++ - are lower-bounded by
(the worst possible result*k*-means++*is*a solution from)*k*-means++ - require more computation time than
(ca 50%-300% depending on the problem)*k*-means++

The second claim is fulfilled, since
the default seeding method of **breathing k-means**

*is*

**. If no improvement is found by**

*k*-means++**breathing**, it simply returns the

*k*-means**result used as seed.**

*k*-means++The more interesting part about **breathing k-means** is probably its ability to improve upon

**in many cases and to do so at moderate extra computational cost (considering that the**

*k*-means++*k*-means problem is NP-hard).

## Further Info

For a detailed motivation and description of the **breathing k-means algorithm** see the preprint https://arxiv.org/abs/2006.15666.

An extended software repo including the data sets used in the above preprint and jupyter notebooks to experiment with them can be found at https://github.com/gittar/breathing-k-means

## API

The included class **BKMeans** is subclassed from scikit-learn's **KMeans** class
and has thus the same API, only extended by two parameters which, however, can be ignored for normal usage and thus simply left at their default values:

*m*(breathing depth), default: 5, (m=0 is equivalent to running k-means++)*theta*(neighborhood freezing range), default: 1.1

## Installation

```
pip install bkmeans
```

## Example 1: running on simple random data set

Code:

```
import numpy as np
from bkmeans import BKMeans
# generate random data set
X=np.random.rand(1000,2)
# create BKMeans instance
bkm = BKMeans(n_clusters=100)
# run the algorithm
bkm.fit(X)
# print SSE (inertia in scikit-learn terms)
print(bkm.inertia_)
```

Output:

```
1.1775040547902602
```

## Example 2: comparison with *k*-means++ (multiple runs)

Code:

```
import numpy as np
from sklearn.cluster import KMeans
from bkmeans import BKMeans
# random 2D data set
X=np.random.rand(1000,2)
# number of centroids
k=100
for i in range(5):
# kmeans++
km = KMeans(n_clusters=k)
km.fit(X)
# breathing k-means
bkm = BKMeans(n_clusters=k)
bkm.fit(X)
# relative SSE improvement of bkm over km++
imp = 1 - bkm.inertia_/km.inertia_
print(f"SSE improvement over k-means++: {imp:.2%}")
```

Output:

```
SSE improvement over k-means++: 3.38%
SSE improvement over k-means++: 4.16%
SSE improvement over k-means++: 6.14%
SSE improvement over k-means++: 6.79%
SSE improvement over k-means++: 4.76%
```

## Project details

## Release history Release notifications | RSS feed

## Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.