The breathing k-means algorithm
Project description
The Breathing K-Means Algorithm
An approximation algorithm for the k-means problem that (on average) is better (higher solution quality) and faster (lower CPU time usage) than k-means++ (in the widely-used implementation from scikit-learn
and with n_init=10
, the longtime default value for repetitions).
Remark: Version 1.4 of scikit-learn
(released 1/2024) introduced a new default value n_init=1
for k-means++. As a consequence, the advantage of breathing k-means in solution quality has become even larger but k-means++ is faster now (by sacrificing solution quality).
Technical Report: https://arxiv.org/abs/2006.15666
Example Code: Google Colab notebook
API
The included class BKMeans is subclassed from scikit-learn's KMeans class and has, therefore, the same API. It can be used as a plug-in replacement for scikit-learn's KMeans.
There is one new parameter that can be ignored (meaning: left at default) for normal usage:
- m (breathing depth), default: 5
The parameter m can also be used, however, to generate faster ( 1 < m < 5) or better (m>5) solutions. For details see the above technical report.
Release Notes
Version 1.3
- bug fix: BKMeans.predict(df) now matches BKMeans.labels_
- bug fix: BKMeans is now fully deterministic if parameter random_state is set to non-Null
- compatibility with scikit-learn: unused parameter y was not set in the
fit
method
Remark: the fixed bugs did not affect the quality of the codebook results in previous version
Version 1.2
- make use of the optional
sample_weight
parameter in thefit
method - (contributed by Björn Wiescholek)
Version 1.1
- speed improvement by setting
n_init=1
by default - close centroids now defined by nearest neighbor criterion
- parameter
theta
abolished
Version 1.0
- (initial release)
- "close centroids" were based on distance and a parameter theta
Example 1: running on a simple random data set
Code:
import numpy as np
from bkmeans import BKMeans
# generate random data set
X=np.random.rand(1000,2)
# create BKMeans instance
bkm = BKMeans(n_clusters=100)
# run the algorithm
bkm.fit(X)
# print SSE (inertia in scikit-learn terms)
print(bkm.inertia_)
Output:
1.1775040547902602
Example 2: comparison with k-means++ (multiple runs)
Code:
import numpy as np
from sklearn.cluster import KMeans
from bkmeans import BKMeans
import time
# random 2D data set
X=np.random.rand(1000,2)
# number of centroids
k=100
for i in range(5):
# kmeans++
kmp = KMeans(n_clusters=k, n_init=10)
kmp.fit(X)
# breathing k-means
bkm = BKMeans(n_clusters=k)
bkm.fit(X)
# relative SSE improvement of bkm over km++
imp = 1 - bkm.inertia_/kmp.inertia_
print(f"SSE improvement over k-means++: {imp:.2%}")
Output:
SSE improvement over k-means++: 3.38%
SSE improvement over k-means++: 4.16%
SSE improvement over k-means++: 6.14%
SSE improvement over k-means++: 6.79%
SSE improvement over k-means++: 4.76%
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bkmeans-1.3.tar.gz
.
File metadata
- Download URL: bkmeans-1.3.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 526def8bf501c18d08eb31c69b5071cedae138acd01939454d6d006a9945b452 |
|
MD5 | 8f83a7984b51ea7e72955a720d21d5c0 |
|
BLAKE2b-256 | 122c4b81650238f045c130555e40a7ef947f2e73ee3a25ab1dadd1b318dc9c00 |
File details
Details for the file bkmeans-1.3-py3-none-any.whl
.
File metadata
- Download URL: bkmeans-1.3-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f064c4546123e6fbe6bf50296a246ab1a6114b1fa52e4dcb813f89ad1ee34881 |
|
MD5 | 921bb3ce8912bd817541043a8616f3c5 |
|
BLAKE2b-256 | b288002267f728d3efd81e34c63f6a66fc44e38d5ab205171d533eafa3d197e0 |