kmeans++ implementation based on tensorflowgpu
Project description
kmeanstf
kmeans++ implementation based on tensorflowgpu
TL;DR
To use kmeans++ with GPUsupport do the following:
pip install kmeanstf
Execute the following test program (https://github.com/gittar/kmeanstf/blob/master/demo/test.py)
import tensorflow as tf import matplotlib.pyplot as plt from kmeanstf import KMeansTF # create data set X = tf.random.normal([50000,2]) # create kmeanstf object km = KMeansTF(n_clusters=100) # adapt km.fit(X) # plot result (optional) m=10000 # max number of data points to display fig,ax = plt.subplots(figsize=(8,8)) ax.scatter(X[:m,0],X[:m,1],s=1) ax.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],s=8,c="r") plt.show()
(requires tensorflowgpu, either 1.14 or 2.0b)
What is kmeans?
kmeans  a.k.a. Lloyds Algorithm  is a wellknown clustering method that positions k centroids (means) over a given data set.
Starting from some initial positions of the centroids a number of Lloyd iterations is performed until a stopping criterion is met. One Lloyd iteration consists of the following two steps:
 determine for each centroid the subset of data points for which this centroid is closest
 move each centroid to the mean of its associated data subset (hence the name "kmeans")
It can be shown that each Lloyd iteration decreases the sum of squared Euclidean distances of all data points to their respective nearest centroid or leaves it unchanged in which case the algorithm is converged.
The final sum of distances depends strongly on the initial centroid positions. A common goal is to find a set of centroids with a small sum of distances (see example below).
kmeans++
In 2006 a particular initalization method was proposed which produces provably good (but generally still suboptimal) results: kmeans++. One can describe the core of kmeans++ as the sequential positioning of centroids in areas where the given data points are far from the already positioned centroids. kmeans++ is said to both generate better solutions than kmeans with random initialization
Probably due to its good results kmeans++ has been chosen as the default intialization method for the implementation of kmeans in the popular scikitlearn python library for machine learning
why kmeanstf?
kmeans for large data sets and large number of centroids is computationally expensive (slow!). For portability reasons scikitlearn does not support the use of GPUs (https://scikitlearn.org/stable/faq.html#willyouaddgpusupport). Therefore we provide here an implementation of kmeans (and the kmeans++ initization) based on tensorflow and making use of a GPU if available. The kmeans++ initialization is a port from the scikitlearn implementation of kmeans++ to tensorflow (using tensorflow matrix operations instead of the numpy ones used by scikitlearn). The tensorflowbased implementation of Lloyd iterations borrowed some aspects from Shawn Simister's gist https://gist.github.com/narphorium/d06b7ed234287e319f18 and added the splitting of large data sets into subsets of a given size as well as finding a suitable size automatically for a given GPU if memory errors occur.
Please note: Only a subset of the scikitlearn KMeans class is implemented in KMeansTF, in particular the fit(), predict() and fit_predict() methods. There is however no handling of sparse matrices, no minibatcg version and no sample_weights. For these features please use scikitlearn. Using kmeanstf makes most sense if you like to do standard kmeans (kmeans++) on a GPU.
Speedup
On a linux machine (Ubuntu 18.04 LTS) equipped with a NVidia GTX1060 6MB graphics card we observed a speedup of 710 for larger data sets (n > 200k points). Example: for a data set of n = 10⁶ (1 million) 2dimensional vectors and k=100 centroids the execution time for 30 Lloyd iterations was 6.34 seconds a.o.t. 62.17 seconds for scikitlearn (this is a speedup of 9.81)
Below you see speedup values measured for different values of data set size n and number of centroids k for 2Ddata. For larger data sets also the speedup tends to be higher. For small data sets, however, KMeansTF is actually often slower than scikitlearn KMeans. Perhaps this is caused caused by a startup time for tensorflow. One could argue that this is not so relevant but it may also indicate potential for improvement for the kmeanstf implementation.
Why is it so fast?
Three lines of code which seem to originate from Shawn Simister (https://twitter.com/narphorium/status/668234313662525440?lang=en) are at the core of the tensorflow kmeans implementation:
expanded_vectors = tf.expand_dims(samples, 0) #[1,n,d] expanded_centroids = tf.expand_dims(centroids, 1) # [k,1,d] distances = tf.reduce_sum( tf.square(tf.subtract(expanded_vectors, expanded_centroids)), 2) #[k,n]
Thereby samples is the [n,d] tensor containing n vectors of dimension d.
centroids is the [k,d] tensor containing k vectors (centroids) of dimension d.
The first two lines add a dimension of size 1 to samples and centroids in position 0 and 1 respectively
The third line contains the inner statement
tf.subtract(expanded_vectors, expanded_centroids) #[k,n,d]
This makes use of "broadcasting", an operation in tensorflow and numpy which aligns the shape of tensors before an arithmetic operation is applied which requires them to have the same shape. In this case the [1,n,d]tensor expanded_vectors is converted to shape [k,n,d] by stacking k copies of samples along dimension 0. The [k,1,d] tensor expanded_centroids is converted to shape [k,n,d] as well by stacking n copies of centroids along dimension 1.
This makes it possible to compute the elementwise mutual distance of all n data vectors and all k centroids. Adding the outer tf.square and tf.reduce_sum operations we have one single tensor statement which does the work of 4 nested conventional loops.
Pure elegance!
And ideally suited to run on a GPU (if it has enough memory, because k*n*d can be quite large). To not make the GPU memory a limiting factor, the implementation in kmeanstf is able to split up this tensor along the ndimension into several pieces each fitting on the GPU. See parameter max_mem below.
class KMeansTF
class kmeanstf.KMeansTF(n_clusters=8, init='kmeans++', n_init=10, max_iter=300, tol=1e4)
Parameters (subset of sklearn.cluster.KMeans)

n_clusters : int, optional, default: 8
The number of clusters to form as well as the number of centroids to generate.

init : {‘kmeans++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘kmeans++’:
‘kmeans++’ : selects initial cluster centers for kmean clustering in a smart way to speed up convergence.
‘random’: choose k observations (rows) at random from data for the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

n_init : int, default: 10
Number of time the kmeans algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

max_iter : int, default: 300
Maximum number of iterations of the kmeans algorithm for a single run.

tol : float, default: 1e4
Relative tolerance with regards to inertia to declare convergence: algorithm stops if the relative decrease of means square error is smaller than tol (this is a slightly different definition of tol than the one from sklearn)

verbose : int, default: 0 if larger 0 outputs messages to analyze the working of the algoritm

max_mem : int, default: 1300000000 (no equivalent in sklearn) size in bytes of largest tensor to be used during kmeans computation. Default value is suitable for NVidia GTX1060. If a too large value is specified and a memory error occurs, a sequence of decreasing values is automatically tried out (and printed) until a working value for the given GPU is found. This value should subsequently be used as the ḿax_mem parameter. Large data sets ar split into fractions in order to not pass this threshold.
Attributes

cluster_centers_ : array, [n_clusters, n_features]
Coordinates of cluster centers.

inertia_ : float
Sum of squared distances of samples to their closest cluster center.

n_iter_ : int
Number of iterations run.
Methods

fit (self, X)
Compute kmeans clustering.

fit_predict (self, X)
Compute cluster centers and predict cluster index for each sample.
Parameters X: data to be clustered and predicted, shape = [n_samples, n_features]
Returns: labels : array, shape [n_samples,]
Index of the cluster each sample belongs to

predict (self, X)
Predict cluster index for each sample.
Parameters X: data to be predicted, shape = [n_samples, n_features]
Returns: labels : array, shape [n_samples,]
Index of the cluster each sample belongs to
Areas for further improvements
 improve the time behaviour for smaller data sets
 directly determine suitable max_mem value from physical features of present GPU
 enable the use of several GPUs
 enable the use of TPUs (kmeanstf does work on colab with GPU, but not yet with TPU)
Pull requests are welcome!
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size kmeanstf0.2.1py2.py3noneany.whl (11.5 kB)  File type Wheel  Python version py2.py3  Upload date  Hashes View hashes 
Filename, size kmeanstf0.2.1.tar.gz (11.4 kB)  File type Source  Python version None  Upload date  Hashes View hashes 
Hashes for kmeanstf0.2.1py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  c2c933c39d242c0116e07e9d6805b096003d66bab8cbcda83a95988c6b554426 

MD5  a661ec5894d6bee5633eddfb1a3c1fb1 

BLAKE2256  1c032b30e56b8b5f4735f87f041dd2dbfec3a63b7bca79abba5b5ec1c7204f81 