kmeans++ implementation based on tensorflowgpu
Project description
kmeanstf
kmeans++ implementation based on tensorflowgpu
Quick Start
To use kmeans++ with GPUsupport do the following:
pip install kmeanstf
(requires tensorflowgpu, at least version 1.14 or 2.0b)
Execute the following test program to produce the above graphic.
import tensorflow as tf import matplotlib.pyplot as plt from kmeanstf import KMeansTF # create data set X = tf.random.normal([50000,2]) # create kmeanstf object km = KMeansTF(n_clusters=100, n_init=1) # adapt km.fit(X) # plot result (optional) m=10000 # max number of data points to display fig,ax = plt.subplots(figsize=(8,8)) ax.scatter(X[:m,0],X[:m,1],s=1) ax.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],s=8,c="r") plt.show()
What is kmeans?
kmeans  a.k.a. Lloyds Algorithm  is perhaps the most wellknown clustering method. It positions k centroids (means) over a given data set in order to represent/cluster the data set.
Starting from some initial positions of the centroids a number of socalled "Lloyd iterations" is performed until a stopping criterion is met. One Lloyd iteration consists of the following two steps:
 determine for each centroid the subset of data points for which this centroid is closest
 move each centroid to the mean of its associated data subset (hence the name "kmeans")
It can be shown that each Lloyd iteration decreases the sum of squared Euclidean distances of all data points to their respective nearest centroid or leaves it unchanged in which case the algorithm has converged.
The final sum of distances depends strongly on the initial centroid positions. A common goal is to find a set of centroids with a small sum of distances (see example below).
kmeans++
In 2006 a particular initialization method was proposed which produces provably good (but generally still suboptimal) results: kmeans++. One can describe the core of kmeans++ as the sequential positioning of centroids in areas where the given data points are far from the already positioned centroids. kmeans++ is said to both generate better solutions than kmeans with random initialization as well as needing fewer Lloyd iterations.
Probably due to its good results kmeans++ has been chosen as the default initialization method for the implementation of kmeans in the popular scikitlearn python library for machine learning
why kmeanstf?
kmeans for large data sets and large number of centroids is computationally expensive (slow!). For portability reasons scikitlearn does not support the use of GPUs (https://scikitlearn.org/stable/faq.html#willyouaddgpusupport). Therefore we provide here an implementation of kmeans (and the kmeans++ initialization) based on tensorflow and making use of a GPU if available. The kmeans++ initialization is a port from the scikitlearn implementation of kmeans++ to tensorflow (using tensorflow matrix operations instead of the numpy ones used by scikitlearn). The tensorflowbased implementation of Lloyd iterations borrows some aspects from Shawn Simister's gist https://gist.github.com/narphorium/d06b7ed234287e319f18 and added the splitting of large data sets into subsets of a given size as well as finding a suitable size automatically for a given GPU if memory errors occur with the default size.
Please note: Only a subset of the scikitlearn KMeans class is implemented in KMeansTF, in particular the fit(), predict() and fit_predict() methods. There is e.g. no handling of sparse matrices, no minibatch version, no sample weights. For these features please use scikitlearn. Using kmeanstf makes most sense if you like to perform standard kmeans (kmeans++) on a GPU for large data sets.
Speedup
On a linux machine (Ubuntu 18.04 LTS) equipped with a NVidia GTX1060 6MB graphics card we observed a speedup of 710 for larger data sets (n > 200k points). Example: for a data set of n = 10⁶ (1 million) 2dimensional vectors and k=100 centroids the execution time for 30 Lloyd iterations was 6.34 seconds a.o.t. 62.17 seconds for scikitlearn (this is a speedup of 9.81)
Below you see speedup values measured for different values of data set size n and number of centroids k for 2Ddata. For larger data sets also the speedup tends to be higher. For small data sets, however, KMeansTF is actually often slower than scikitlearn KMeans. Perhaps this is caused caused by a startup time for tensorflow. One could argue that this is not so relevant but it may also indicate potential for improvement for the kmeanstf implementation.
Why is it so fast?
Three lines of code which seem to originate from Shawn Simister (https://twitter.com/narphorium/status/668234313662525440?lang=en) are at the core of the tensorflow kmeans implementation:
expanded_vectors = tf.expand_dims(samples, 0) #[1,n,d] expanded_centroids = tf.expand_dims(centroids, 1) # [k,1,d] distances = tf.reduce_sum( tf.square(tf.subtract(expanded_vectors, expanded_centroids)), 2) #[k,n]
Thereby samples is the [n,d] tensor containing n vectors of dimension d.
centroids is the [k,d] tensor containing k vectors (centroids) of dimension d.
The first two lines add a dimension of size 1 to samples and centroids in position 0 and 1 respectively
The third line contains the inner statement
tf.subtract(expanded_vectors, expanded_centroids) #[k,n,d]
This makes use of "broadcasting", an operation in tensorflow and numpy which aligns the shape of tensors before an arithmetic operation is applied which requires them to have the same shape. In this case the [1,n,d]tensor expanded_vectors is converted to shape [k,n,d] by stacking k copies of samples along dimension 0. The [k,1,d] tensor expanded_centroids is converted to shape [k,n,d] as well by stacking n copies of centroids along dimension 1.
This makes it possible to compute the elementwise mutual distance of all n data vectors and all k centroids. Adding the outer tf.square and tf.reduce_sum operations we have one single tensor statement which does the work of 4 nested conventional loops.
Pure elegance!
And ideally suited to run on a GPU (if it has enough memory, because k*n*d can be quite large). To not make the GPU memory a limiting factor, the implementation in kmeanstf is able to split up this tensor along the ndimension into several pieces each fitting on the GPU. See parameter max_mem below.
class KMeansTF
Please note: the interface is designed to be a subset of sklearn.cluster.KMeans.
Current exception: parameter max_mem (see below)
class kmeanstf.KMeansTF(n_clusters=8, init='kmeans++', n_init=10, max_iter=300, tol=1e4, verbose=0, max_mem=1300000000)
Parameters

n_clusters : int, optional, default: 8
The number of clusters to form as well as the number of centroids to generate.

init : {‘kmeans++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘kmeans++’:
‘kmeans++’ : selects initial cluster centers for kmean clustering in a smart way to speed up convergence.
‘random’: choose k observations (rows) at random from data for the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

n_init : int, default: 10
Number of time the kmeans algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

max_iter : int, default: 300
Maximum number of iterations of the kmeans algorithm for a single run.

tol : float, default: 1e4
Relative tolerance with regards to inertia to declare convergence

verbose : int, default: 0 if larger 0 outputs messages to analyze the working of the algorithm

max_mem : int, default: 1300000000 (no equivalent in sklearn) size in bytes of largest tensor to be used during kmeans computation. Default value is suitable for NVidia GTX1060. If a too large value is specified and a memory error occurs, a sequence of decreasing values is automatically tried out (and printed) until a working value for the given GPU is found. This value should subsequently be used as the ḿax_mem parameter. Large data sets ar split into fractions in order to not pass this threshold.
Attributes

cluster_centers_ : array, [n_clusters, n_features]
Coordinates of cluster centers.

inertia_ : float
Sum of squared distances of samples to their closest cluster center.

n_iter_ : int
Number of iterations run.
Methods

fit (self, X)
Compute kmeans clustering.

fit_predict (self, X)
Compute cluster centers and predict cluster index for each sample.
Parameters X: data to be clustered and predicted, shape = [n_samples, n_features]
Returns: labels : array, shape [n_samples,]
Index of the cluster each sample belongs to

predict (self, X)
Predict cluster index for each sample.
Parameters X: data to be predicted, shape = [n_samples, n_features]
Returns: labels : array, shape [n_samples,]
Index of the cluster each sample belongs to
Areas for further improvements
 improve the time behavior for smaller data sets
 directly determine suitable max_mem value from physical features of present GPU
 enable the use of several GPUs
 enable the use of TPUs (kmeanstf does work on colab with GPU, but not yet with TPU)
Both feedback and pull requests are welcome.
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size kmeanstf0.4.0py2.py3noneany.whl (11.8 kB)  File type Wheel  Python version py2.py3  Upload date  Hashes View hashes 
Filename, size kmeanstf0.4.0.tar.gz (11.7 kB)  File type Source  Python version None  Upload date  Hashes View hashes 
Hashes for kmeanstf0.4.0py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  669d6c625f49c159ab0e06e7eca7dc350d92ae37a38f7dd16f576543a130e46e 

MD5  d2e1551c3c8fc0973cae61bf1611bda9 

BLAKE2256  7a27452ddcc09d6c5804ce32e2f43aedb8e8420c8fbce521447048087ed36645 