Scalable and parallel programming implementation of Affinity Propagation clustering
A scalable and concurrent programming implementation of Affinity Propagation clustering.
Affinity Propagation is a clustering algorithm based on passing messages between data-points.
Storing and updating matrices of ‘affinities’, ‘responsibilities’ and ‘similarities’ between samples can be memory-intensive. We address this issue through the use of an HDF5 data structure, allowing Affinity Propagation clustering of arbitrary large data-sets, where other Python implementations would return a MemoryError on most machines.
We also significantly speed up the computations by splitting them up across subprocesses, thereby taking full advantage of the resources of multi-core processors and bypassing the Global Interpreter Lock of the standard Python interpreter, CPython.
Concurrent_AP requires Python 2.7 along with the following packages and a few modules from the Standard Python Library:
It is suggested that you check that the required dependencies are installed, although the pip command below should do this automatically for you. You can indeed most conveniently download Concurrent_AP from the official Python Package Index (PyPI) as follows:
The code herewith has been tested on Fedora, OS X and Ubuntu and should work fine on any other member of the Unix-like family of operating systems.
See the docstrings associated to each function of the Concurrent_AP module for more information and an understanding of how different tasks are organized and shared among subprocesses.
Usage: Concurrent_AP [options] file_name, where file_name denotes the path where the data to be processed by Affinity Propagation clustering is held. The data must consist in tab-separated rows of samples, each column corresponding to a particular feature.
The following few lines illustrate the use of Concurrent_AP on the ‘Iris data-set’ from the UCI Machine Learning Repository. While the number of samples is here way too small for the benefits of the present multi-tasking implementation and the use of an HDF5 data structure to come fully into play, this data-set has the advantage of being well-known and prone to a quick comparison with scikit-learn’s version of Affinity Propagation clustering.
>>> import numpy as np >>> from sklearn import datasets >>> iris = datasets.load_iris() >>> data = iris.data >>> with open('./iris_data.txt', 'w') as f: np.savetxt(f, data, fmt = '%.4f', delimiter = '\t')
The latter will automatically compute a preference parameter from the data-set.
When the rounds of message-passing among data-points have completed, a folder containing a file of cluster labels and a file of cluster centers indices both in tab-separated format is created in your current working directory.
Brendan J. Frey and Delbert Dueck. “Clustering by Passing Messages between Data Points”, Science Feb. 2007