Scalable and parallel programming implementation of Affinity Propagation clustering
A scalable and concurrent programming implementation of Affinity Propagation clustering.
Affinity Propagation is a clustering algorithm based on passing messages between data-points.
Storing and updating matrices of ‘affinities’, ‘responsibilities’ and ‘similarities’ between samples can be memory-intensive. We address this issue through the use of an HDF5 data structure, allowing Affinity Propagation clustering of arbitrary large data-sets, where other Python implementations would return a MemoryError on most machines.
We also significantly speed up the computations by splitting them up across subprocesses, thereby taking full advantage of the resources of multi-core processors and bypassing the Global Interpreter Lock of the standard Python interpreter, CPython.
Installation and Requirements
Concurrent_AP requires Python 2.7 along with the following packages and a few modules from the Standard Python Library:
- NumPy >= 1.9
It is suggested that you check that the required dependencies are installed, although the pip command below should do this automatically for you. You can indeed most conveniently download Concurrent_AP from the official Python Package Index (PyPI) as follows:
- open a terminal window;
- type in the command pip install Concurrent_AP.
The code herewith has been tested on Fedora, OS X and Ubuntu and should work fine on any other member of the Unix-like family of operating systems.
Usage and Command Line Options
See the docstrings associated to each function of the Concurrent_AP module for more information and an understanding of how different tasks are organized and shared among subprocesses.
Usage: Concurrent_AP [options] file_name, where file_name denotes the path where the data to be processed by Affinity Propagation clustering is held. The data must consist in tab-separated rows of samples, each column corresponding to a particular feature.
- -c or --convergence: specify the number of iterations without change in the number of clusters that signals convergence (defaults to 15);
- -d or --damping: the damping parameter of Affinity Propagation (defaults to 0.5);
- -f or --file: option to specify the file name or file handle of the hierarchical data format where the matrices involved in Affinity Propagation clustering will be stored (defaults to a temporary file);
- -i or --iterations: maximum number of message-passing iterations (defaults to 200);
- -m or --multiprocessing: the number of processes to use;
- -p or --preference: the preference parameter of Affinity Propagation (if not specified, will be determined as the median of the distribution of pairwise L2 Euclidean distances between samples);
- -s or --similarities: determine if a similarity matrix has been pre-computed and stored in the HDF5 data structure accessible at the location specified through the command line option -f or --file (see above);
- -v or --verbose: whether to be verbose.
Demo of Concurrent_AP
The following few lines illustrate the use of Concurrent_AP on the ‘Iris data-set’ from the UCI Machine Learning Repository. While the number of samples is here way too small for the benefits of the present multi-tasking implementation and the use of an HDF5 data structure to come fully into play, this data-set has the advantage of being well-known and prone to a quick comparison with scikit-learn’s version of Affinity Propagation clustering.
- In a Python interpreter console, enter the following few lines, whose purpose is to create a file containing the Iris data-set that will be later subjected to Affinity Propagation clustering via Concurrent_AP:
>>> import numpy as np >>> from sklearn import datasets >>> iris = datasets.load_iris() >>> data = iris.data >>> with open('./iris_data.txt', 'w') as f: np.savetxt(f, data, fmt = '%.4f', delimiter = '\t')
- Open a terminal window.
- Type in Concurrent_AP --preference 5.47 --v iris_data.txt or simply Concurrent_AP iris_data.txt.
The latter will automatically compute a preference parameter from the data-set.
When the rounds of message-passing among data-points have completed, a folder containing a file of cluster labels and a file of cluster centers indices both in tab-separated format is created in your current working directory.
Brendan J. Frey and Delbert Dueck. “Clustering by Passing Messages between Data Points”, Science Feb. 2007